Data configuration

⚠️
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

The data_params block in the YAML file defines key settings related to your data. These parameters serve two main purposes:

Splitting data into training, validation, and test sets, for model tuning and evaluation,
Controlling how your event data are sampled, split and weighed for model training and inference.

While some parameters are specific to the foundation model training stage, others are only relevant later—during scenario model training or for inference purpose. You can define all of them at once or update them at any point in the workflow.

Training, validation, and test sets

BaseModel supports two distinct strategies for splitting data into training, validation, and test sets:

Entity-based split: the dataset is partitioned by entities (e.g., users, customers, subscribers etc.).
A percentage of entities is held out for validation, and the rest are used for training. This is the default and recommended approach for production use as:
- No test set is created for additional metrics evaluation unless specifically filtered out (e.g. with WHEREclause),
- The most recent data can be used for training.
Time-based split: the dataset is divided based on timestamps. Events are separated into training, validation, and test periods. This approach is particularly useful for initial experiments, offline evaluation, and retrospective testing, as it mimics how the model would have performed at different points in the past.

📘
Good practice
⏱️
Use time-based split when you want to evaluate performance historically.
👥
Use entity-based split in production to maximize learning from recent behavior while validating on unseen users.

Parameters

data_start_date : datetime
No default, required.
Events before this date will be excluded from the training process.
split: DataEntitySplit | DataTimeSplit
No default, required.
Configuration of the division into training, validation and test sets.
- type: Literal["entity", "time"]
  No default, required.
  Specifies the split mode:
  - entity - entity-based split where the validation is done on the user holdout.
    The following additional parameters should be provided:
    - training: int
      Percentage of entities used for the training.
    - validation: int
      Percentage of entities used for the validation.
  - time - time-based split, where training, validation and test are defined by time periods.
    - training: TimeRange
      No default, required.
      Definition of the training period.
      - start_date: datetime
        No default, required.
        Initial date of the training period.
      - end_date: datetime
        default: None
        The last date of the training period. If not provided will be set to validation start_date -1.
    - validation: TimeRange
      No default, required.
      Definition of the validation period.
      - start_date: datetime
        No default, required.
        Initial date of the validation period.
      - end_date: datetime
        default: None
        The last date of the validation period. If not provided, will be set to test start_date -1, if test start_date is not provided will be set to the current date.
    - test: TimeRange | None
      default: None
      Definition of the test period.
      - start_date: datetime
        default: None
        Initial date of the testing period. This parameter is required for scenario model testing. It does not need to be set during foundation model training.
      - end_date: datetime
        default: None
        The last date of the test period. This parameter is required for scenario model testing. It does not need to be set during foundation model training. In not provided and test start_date is set, will be set to the current date.
target_sampling_strategy : Literal["valid", "random", "existing"]
default: "random"
Controls the data sampling for each entity; random randomly splits into input and target events per entity, valid ensures that at least one event will occur in target, existing splits only on timestamps of existing events. For Foundation Model it should always be left as random. Setting "valid" usually improves scenario models with recommendation task.
maximum_splitpoints_per_entity : int
default: 20
The maximum number of splits into input and target events per entity. The default value improves performance, especially for smaller datasets or highly imbalanced classes, as it increases the number of training examples. However, it may result in slow training for very large datasets. Setting it to a lower value will decrease runtime.
max_data_splits_per_split_point : int
default: 200
Limits the number of examples generated for each split point when creating examples based on an entity's event history. This is specifically used in the scenario of contextual recommendation, such as when suggesting a complementary product given other items in the cart.
split_point_data_sources : list[str]
default: None
Names of event data sources that should contribute their timestamps as split point candidates. If not specified, all event data sources are used. This can be useful in scenarios where the user wants more control over history and future splitting.
split_point_inclusion_overrides : dict[str, SplitPointInclusion]
default:future for the recommendation task, history in all other cases
DIctionary that maps datasource name to the strategy to handle situations when multiple events have the exact same timestamp as the chosen split point in time. Eligible values for each event source are:
- future - events at the point belong to the "future".
- history - events at the point belong to the "past".
- one-future - a random element at the point belongs to the "future", the rest belong to the "past".
- one-future-all-variants - like one-future but considers all possible selections of "the item in the future" instead of only one that was randomly sampled.
extra_columns : list[Extra_Column]
default: None
Certain columns may be discarded during the foundation model fitting stage due to reasons such as insufficient variability. If these columns are needed, for example, to define a scenario model's target function, they must be declared here to be made available for the scenario model training script.
ignore_entities_without_events: bool
default: True
A flag indicating whether entities without events will be discarded during training.
dynamic_events_sampling : boolean
default: True
A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting.
apply_event_count_weighting : boolean
default: False
If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts.
apply_recency_based_weighting : boolean
default: False
If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends.
limit_entity_num_events : int
default: None
Limits the number of events per entity used. The most recent events are kept.
cache_path : Path
default: None
Directory to save queried data to or to read from, if cached query exists. Applicable when fitting a model or when calling verify_target. Not applicable to data sources fed from parquet files - please use parquet data engine's own caching method as described here.

Caching means, that during Foundation Model training, data will be stored to the cache_pathlocation and stored there in parquet file format. BaseModel will then use that information for next epochs training as well as downstream tasks and predictions if so configured by user.

🚧
Remember
With cache_path you can enable caching, which will speed up and/or stabilize the training in case the database connection is either not very stable or not very fast.

Examples

The following examples show how to configure data parameters in a YAML file. For examples of modifying data parameters with a Python script during scenario model training or inference, please refer to the relevant guides.

Entity-based split

data_params:
  data_start_date: 2018-09-20 00:00:00
  split:
    type: entity
    training: 95
    validation: 5
  cache_path: /path/to/cache

Time-based split

data_params:
  data_start_date: 2022-06-01 00:00:00
  split:
    type: time
    training:
      start_date: 2022-06-01 00:00:00
    validation:
      start_date: 2023-06-01 00:00:00
  cache_path: /path/to/cache

Check This First!

Training, validation, and test sets

Good practice

Remember