HomeGuidesRecipesChangelog
Log In
Guides

Data configuration

data_params block in YAML configuration file

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

The data_params block in the YAML file defines key settings related to your data. These parameters serve two main purposes:

  • Splitting data into training, validation, and test sets, for model tuning and evaluation,
  • Controlling how your event data are sampled, split and weighed for model training and inference.

While some parameters are specific to the foundation model training stage, others are only relevant later—during scenario model training or for inference purpose. You can define all of them at once or update them at any point in the workflow.

Training, validation, and test sets

BaseModel supports two distinct strategies for splitting data into training, validation, and test sets:

  • Entity-based split: the dataset is partitioned by entities (e.g., users, customers, subscribers etc.).
    A percentage of entities is held out for validation, and the rest are used for training. This is the default and recommended approach for production use as:
    • No test set is created for additional metrics evaluation unless specifically filtered out (e.g. with WHEREclause),
    • The most recent data can be used for training.
  • Time-based split: the dataset is divided based on timestamps. Events are separated into training, validation, and test periods. This approach is particularly useful for initial experiments, offline evaluation, and retrospective testing, as it mimics how the model would have performed at different points in the past.

📘

Good practice

⏱️
Use time-based split when you want to evaluate performance historically.

👥
Use entity-based split in production to maximize learning from recent behavior while validating on unseen users.

Parameters
  • data_start_date : datetime
    No default, required.
    Events before this date will be excluded from the training process.

  • split: DataEntitySplit | DataTimeSplit
    No default, required.
    Configuration of the division into training, validation and test sets.

    • type: Literal["entity", "time"]
      No default, required.
      Specifies the split mode:
      • entity - entity-based split where the validation is done on the user holdout.
        The following additional parameters should be provided:
        • training: int
          Percentage of entities used for the training.
        • validation: int
          Percentage of entities used for the validation.
      • time - time-based split, where training, validation and test are defined by time periods.
        • training: TimeRange
          No default, required.
          Definition of the training period.
          • start_date: datetime
            No default, required.
            Initial date of the training period.
          • end_date: datetime
            default: None
            The last date of the training period. If not provided will be set to validation start_date -1.
        • validation: TimeRange
          No default, required.
          Definition of the validation period.
          • start_date: datetime
            No default, required.
            Initial date of the validation period.
          • end_date: datetime
            default: None
            The last date of the validation period. If not provided, will be set to test start_date -1, if test start_date is not provided will be set to the current date.
        • test: TimeRange | None
          default: None
          Definition of the test period.
          • start_date: datetime
            default: None
            Initial date of the testing period. This parameter is required for scenario model testing. It does not need to be set during foundation model training.
          • end_date: datetime
            default: None
            The last date of the test period. This parameter is required for scenario model testing. It does not need to be set during foundation model training. In not provided and test start_date is set, will be set to the current date.
  • target_sampling_strategy : Literal["valid", "random", "existing"]
    default: "random"
    Controls the data sampling for each entity; random randomly splits into input and target events per entity, valid ensures that at least one event will occur in target, existing splits only on timestamps of existing events. For Foundation Model it should always be left as random. Setting "valid" usually improves scenario models with recommendation task.

  • maximum_splitpoints_per_entity : int
    default: 20
    The maximum number of splits into input and target events per entity. The default value improves performance, especially for smaller datasets or highly imbalanced classes, as it increases the number of training examples. However, it may result in slow training for very large datasets. Setting it to a lower value will decrease runtime.

  • max_data_splits_per_split_point : int
    default: 200
    Limits the number of examples generated for each split point when creating examples based on an entity's event history. This is specifically used in the scenario of contextual recommendation, such as when suggesting a complementary product given other items in the cart.

  • split_point_data_sources : list[str]
    default: None
    Names of event data sources that should contribute their timestamps as split point candidates. If not specified, all event data sources are used. This can be useful in scenarios where the user wants more control over history and future splitting.

  • split_point_inclusion_overrides : Literal["future", "past", "one-future", "one-future-all-variants"]
    default:future for the recommendation task, history in all other cases
    The strategy to handle situations when multiple events have the exact same timestamp as the chosen split point in time. Eligible values for each event source are:

    • future - events at the point belong to the "future".
    • history - events at the point belong to the "past".
    • one-future - a random element at the point belongs to the "future", the rest belong to the "past".
    • one-future-all-variants - like one-future but considers all possible selections of "the item in the future" instead of only one that was randomly sampled.
  • extra_columns : list[Extra_Column]
    default: None
    Certain columns may be discarded during the foundation model fitting stage due to reasons such as insufficient variability. If these columns are needed, for example, to define a scenario model's target function, they must be declared here to be made available for the scenario model training script.

  • ignore_entities_without_events: bool
    default: True
    A flag indicating whether entities without events will be discarded during training.

  • dynamic_events_sampling : boolean
    default: True
    A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting.

  • apply_event_count_weighting : boolean
    default: False
    If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts.

  • apply_recency_based_weighting : boolean
    default: False
    If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends.

  • limit_entity_num_events : int
    default: None
    Limits the number of events per entity used. The most recent events are kept.

  • cache_path : Path
    default: None
    Directory to save queried data to or to read from, if cached query exists. Applicable when fitting a model or when calling verify_target. Not applicable to data sources fed from parquet files - please use parquet data engine's own caching method as described here.

    Caching means, that during Foundation Model training, data will be stored to the cache_pathlocation and stored there in parquet file format. BaseModel will then use that information for next epochs training as well as downstream tasks and predictions if so configured by user.

    🚧

    Remember

    With cache_path you can enable caching, which will speed up and/or stabilize the training in case the database connection is either not very stable or not very fast.


Examples

The following examples show how to configure data parameters in a YAML file. For examples of modifying data parameters with a Python script during scenario model training or inference, please refer to the relevant guides.

Entity-based split

data_params:
  data_start_date: 2018-09-20 00:00:00
  split:
    type: entity
    training: 95
    validation: 5
  cache_path: /path/to/cache

Time-based split

data_params:
  data_start_date: 2022-06-01 00:00:00
  split:
    type: time
    training:
      start_date: 2022-06-01 00:00:00
    validation:
      start_date: 2023-06-01 00:00:00
  cache_path: /path/to/cache