Data configuration
data_params
block in YAML
configuration file
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
The data_params
block in the YAML file defines key settings related to your data. These parameters serve two main purposes:
- Splitting data into training, validation, and test sets, for model tuning and evaluation,
- Controlling how your event data are sampled, split and weighed for model training and inference.
While some parameters are specific to the foundation model training stage, others are only relevant later—during scenario model training or for inference purpose. You can define all of them at once or update them at any point in the workflow.
Training, validation, and test sets
BaseModel supports two distinct strategies for splitting data into training, validation, and test sets:
- Entity-based split: the dataset is partitioned by entities (e.g., users, customers, subscribers etc.).
A percentage of entities is held out for validation, and the rest are used for training. This is the default and recommended approach for production use as:- No test set is created for additional metrics evaluation unless specifically filtered out (e.g. with
WHERE
clause), - The most recent data can be used for training.
- No test set is created for additional metrics evaluation unless specifically filtered out (e.g. with
- Time-based split: the dataset is divided based on timestamps. Events are separated into training, validation, and test periods. This approach is particularly useful for initial experiments, offline evaluation, and retrospective testing, as it mimics how the model would have performed at different points in the past.
Good practice
⏱️
Use time-based split when you want to evaluate performance historically.👥
Use entity-based split in production to maximize learning from recent behavior while validating on unseen users.
Parameters |
---|
-
data_start_date : datetime
No default, required.
Events before this date will be excluded from the training process. -
split: DataEntitySplit | DataTimeSplit
No default, required.
Configuration of the division into training, validation and test sets.- type: Literal["entity", "time"]
No default, required.
Specifies the split mode:entity
- entity-based split where the validation is done on the user holdout.
The following additional parameters should be provided:- training: int
Percentage of entities used for the training. - validation: int
Percentage of entities used for the validation.
- training: int
time
- time-based split, where training, validation and test are defined by time periods.- training: TimeRange
No default, required.
Definition of the training period.- start_date: datetime
No default, required.
Initial date of the training period. - end_date: datetime
default: None
The last date of the training period. If not provided will be set to validationstart_date
-1.
- start_date: datetime
- validation: TimeRange
No default, required.
Definition of the validation period.- start_date: datetime
No default, required.
Initial date of the validation period. - end_date: datetime
default: None
The last date of the validation period. If not provided, will be set to teststart_date
-1, if teststart_date
is not provided will be set to the current date.
- start_date: datetime
- test: TimeRange | None
default: None
Definition of the test period.- start_date: datetime
default: None
Initial date of the testing period. This parameter is required for scenario model testing. It does not need to be set during foundation model training. - end_date: datetime
default: None
The last date of the test period. This parameter is required for scenario model testing. It does not need to be set during foundation model training. In not provided and teststart_date
is set, will be set to the current date.
- start_date: datetime
- training: TimeRange
- type: Literal["entity", "time"]
-
target_sampling_strategy : Literal["valid", "random", "existing"]
default: "random"
Controls the data sampling for each entity;random
randomly splits into input and target events per entity,valid
ensures that at least one event will occur in target,existing
splits only on timestamps of existing events. For Foundation Model it should always be left asrandom
. Setting"valid"
usually improves scenario models with recommendation task. -
maximum_splitpoints_per_entity : int
default: 20
The maximum number of splits into input and target events per entity. The default value improves performance, especially for smaller datasets or highly imbalanced classes, as it increases the number of training examples. However, it may result in slow training for very large datasets. Setting it to a lower value will decrease runtime. -
max_data_splits_per_split_point : int
default: 200
Limits the number of examples generated for each split point when creating examples based on an entity's event history. This is specifically used in the scenario of contextual recommendation, such as when suggesting a complementary product given other items in the cart. -
split_point_data_sources : list[str]
default: None
Names of event data sources that should contribute their timestamps as split point candidates. If not specified, all event data sources are used. This can be useful in scenarios where the user wants more control over history and future splitting. -
split_point_inclusion_overrides : Literal["future", "past", "one-future", "one-future-all-variants"]
default:future
for the recommendation task,history
in all other cases
The strategy to handle situations when multiple events have the exact same timestamp as the chosen split point in time. Eligible values for each event source are:future
- events at the point belong to the "future".history
- events at the point belong to the "past".one-future
- a random element at the point belongs to the "future", the rest belong to the "past".one-future-all-variants
- likeone-future
but considers all possible selections of "the item in the future" instead of only one that was randomly sampled.
-
extra_columns : list[Extra_Column]
default: None
Certain columns may be discarded during the foundation model fitting stage due to reasons such as insufficient variability. If these columns are needed, for example, to define a scenario model's target function, they must be declared here to be made available for the scenario model training script. -
ignore_entities_without_events: bool
default: True
A flag indicating whether entities without events will be discarded during training. -
dynamic_events_sampling : boolean
default: True
A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting. -
apply_event_count_weighting : boolean
default: False
If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts. -
apply_recency_based_weighting : boolean
default: False
If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends. -
limit_entity_num_events : int
default: None
Limits the number of events per entity used. The most recent events are kept. -
cache_path : Path
default: None
Directory to save queried data to or to read from, if cached query exists. Applicable when fitting a model or when callingverify_target
. Not applicable to data sources fed from parquet files - please use parquet data engine's own caching method as described here.Caching means, that during Foundation Model training, data will be stored to the
cache_path
location and stored there in parquet file format. BaseModel will then use that information for next epochs training as well as downstream tasks and predictions if so configured by user.Remember
With
cache_path
you can enable caching, which will speed up and/or stabilize the training in case the database connection is either not very stable or not very fast.
Examples |
---|
The following examples show how to configure data parameters in a YAML file. For examples of modifying data parameters with a Python script during scenario model training or inference, please refer to the relevant guides.
Entity-based split
data_params:
data_start_date: 2018-09-20 00:00:00
split:
type: entity
training: 95
validation: 5
cache_path: /path/to/cache
Time-based split
data_params:
data_start_date: 2022-06-01 00:00:00
split:
type: time
training:
start_date: 2022-06-01 00:00:00
validation:
start_date: 2023-06-01 00:00:00
cache_path: /path/to/cache
Updated 16 days ago