Data configuration

data_params block in YAML configuration file

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

The settings in data_params block are related to data and used mainly to divide training, validation and test periods (you can ready more about how BaseModel handles it in this article), as well as influence things like the method of sampling or encoding data.

You can find all data parameters in the list below. Note that not all parameters are used during foundation model training; some are only utilized at the scenario model training or inference stages (e.g., test dates). These parameters can be declared now or added and modified at later stages.


Parameters
  • data_start_date : datetime
    No default, required.
    Events before this date will not be used in the training process; events after this date will be considered for training.

  • training_end_date : datetime
    No default, required.
    Introduced the training_end_date parameter, with a default value set to validation_start_date - 1, providing more flexibility and control over model training timelines.

  • validation_start_date : datetime
    default: None
    Initial date of the validation period.

  • validation_end_date : datetime
    default: None
    The last date of the validation period.

  • test_start_date : datetime
    default: None
    Initial date of the test period. This parameter is required for scenario model testing or inference purposes. It does not need to be set during foundation model training.

  • test_end_date : datetime
    default: None
    The last date of the test period. This parameter is required for scenario model testing or inference purposes. It does not need to be set during foundation model training.

  • features_path : str
    default: None
    The path to the folder with features created during the foundation model training. Please do not specify it in YAML file - it should be provided as argument to pretrain function or terminal command and is then overwritten here.

  • timebased_encoding : Literal["two-hot", "fourier"]
    default: "two-hot"
    Controls the encoding of time-based features.

  • target_sampling_strategy : Literal["valid", "random", "existing"]
    default: "random"
    Controls the data sampling for each entity; for Foundation Model it should always be left as random. Setting valid usually improves scenario models with recommendation task. [WHEN TO CHANGE, WHAT DOES IT DO]

  • maximum_splitpoints_per_entity : int
    default: 20
    The maximum number of splits into input and target events per entity. The default value improves performance, especially for smaller datasets or highly imbalanced classes, as it increases the number of training examples. However, it may result in slow training for very large datasets. Setting it to a lower value will decrease runtime.

  • max_data_splits_per_split_point : int
    default: 200
    Limits the number of examples generated for each split point when creating examples based on an entity's event history. This is specifically used in the scenario of contextual recommendation, such as when suggesting a complementary product given other items in the cart.

  • split_point_data_sources : list[str]
    default: None
    Names of event data sources that should contribute their timestamps as split point candidates. If not specified, all event data sources are used. This can be useful in scenarios where the user wants more control over history and future splitting.

  • split_point_inclusion_overrides : Literal["future", "past", "one-future", "one-future-all-variants"]
    default: future for the recommendation task, history in all other cases
    The strategy to handle situations when multiple events have the exact same timestamp as the chosen split point in time. Eligible values for each event source are:
    • future - events at the point belong to the "future".
    • history - events at the point belong to the "past".
    • one-future - a random element at the point belongs to the "future", the rest belong to the "past".
    • one-future-all-variants - like one-future but considers all possible selections of "the item in the future" instead of only one that was randomly sampled.

  • use_recency_sketches : boolean
    default: True
    If true, recency sketches (storing information about how far in the past the interactions took place) are used in training. This results in improved relevance and faster adaptation of the model to changes in the data. Setting this parameter to false will decrease the size of the model, typically at the cost of its performance.

  • extra_columns : list['Extra_Column']
    default: None
    Certain columns may be discarded during the foundation model fitting stage due to reasons such as insufficient variability. If these columns are needed, for example, to define a scenario model's target function, they must be declared here to be made available for the scenario model training script.

  • dynamic_events_sampling : boolean
    default: True
    A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting.

  • apply_event_count_weighting : boolean
    default: False
    If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts.

  • apply_recency_based_weighting : boolean
    default: False
    If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends.

Example

The following example shows how to configure data parameters in a YAML file. For examples of modifying data parameters with a Python script during scenario model training or inference, please refer to the relevant guides.

data_params:
  data_start_date: 2018-09-20 00:00:00
  validation_start_date: 2018-10-10 00:00:00
  test_start_date: 2018-10-20 00:00:00