Data configuration
data_params
block in YAML
configuration file
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
The settings in data_params
block are related to data and used mainly to divide training, validation and test periods (you can ready more about how BaseModel handles it in this article), as well as influence things like the method of sampling or encoding data.
You can find all data parameters in the list below. Note that not all parameters are used during foundation model training; some are only utilized at the scenario model training or inference stages (e.g., test dates). These parameters can be declared now or added and modified at later stages.
Parameters |
---|
- data_start_date : datetime
No default, required.
Events before this date will not be used in the training process; events after this date will be considered for training.
- training_end_date : datetime
No default, required.
Introduced thetraining_end_date
parameter, with a default value set tovalidation_start_date - 1
, providing more flexibility and control over model training timelines.
- validation_start_date : datetime
default: None
Initial date of the validation period.
- validation_end_date : datetime
default: None
The last date of the validation period.
- test_start_date : datetime
default: None
Initial date of the test period. This parameter is required for scenario model testing or inference purposes. It does not need to be set during foundation model training.
- test_end_date : datetime
default: None
The last date of the test period. This parameter is required for scenario model testing or inference purposes. It does not need to be set during foundation model training.
- features_path : str
default: None
The path to the folder with features created during the foundation model training. Please do not specify it inYAML
file - it should be provided as argument to pretrain function or terminal command and is then overwritten here.
- timebased_encoding : Literal["two-hot", "fourier"]
default: "two-hot"
Controls the encoding of time-based features.
- target_sampling_strategy : Literal["valid", "random", "existing"]
default: "random"
Controls the data sampling for each entity; for Foundation Model it should always be left asrandom
. Setting valid usually improves scenario models with recommendation task. [WHEN TO CHANGE, WHAT DOES IT DO]
- maximum_splitpoints_per_entity : int
default: 20
The maximum number of splits into input and target events per entity. The default value improves performance, especially for smaller datasets or highly imbalanced classes, as it increases the number of training examples. However, it may result in slow training for very large datasets. Setting it to a lower value will decrease runtime.
- max_data_splits_per_split_point : int
default: 200
Limits the number of examples generated for each split point when creating examples based on an entity's event history. This is specifically used in the scenario of contextual recommendation, such as when suggesting a complementary product given other items in the cart.
- split_point_data_sources : list[str]
default: None
Names of event data sources that should contribute their timestamps as split point candidates. If not specified, all event data sources are used. This can be useful in scenarios where the user wants more control over history and future splitting.
- split_point_inclusion_overrides : Literal["future", "past", "one-future", "one-future-all-variants"]
default:future
for the recommendation task,history
in all other cases
The strategy to handle situations when multiple events have the exact same timestamp as the chosen split point in time. Eligible values for each event source are:future
- events at the point belong to the "future".history
- events at the point belong to the "past".one-future
- a random element at the point belongs to the "future", the rest belong to the "past".one-future-all-variants
- likeone-future
but considers all possible selections of "the item in the future" instead of only one that was randomly sampled.
- use_recency_sketches : boolean
default: True
If true, recency sketches (storing information about how far in the past the interactions took place) are used in training. This results in improved relevance and faster adaptation of the model to changes in the data. Setting this parameter to false will decrease the size of the model, typically at the cost of its performance.
- extra_columns : list['Extra_Column']
default: None
Certain columns may be discarded during the foundation model fitting stage due to reasons such as insufficient variability. If these columns are needed, for example, to define a scenario model's target function, they must be declared here to be made available for the scenario model training script.
- dynamic_events_sampling : boolean
default: True
A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting.
- apply_event_count_weighting : boolean
default: False
If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts.
- apply_recency_based_weighting : boolean
default: False
If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends.
Example |
---|
The following example shows how to configure data parameters in a YAML file. For examples of modifying data parameters with a Python script during scenario model training or inference, please refer to the relevant guides.
data_params:
data_start_date: 2018-09-20 00:00:00
validation_start_date: 2018-10-10 00:00:00
test_start_date: 2018-10-20 00:00:00
Updated 6 days ago