Data configuration
data_params
block in YAML
configuration file
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
The settings in data_params
block are related to data and used mainly to manage temporal splits to divide training, validation and test periods (you can ready more about how BaseModel handles it in this article), as well as influence things like the method of sampling or encoding data.
Parameters |
---|
- data_start_date : datetime
No default, required.
Events before this date will not be used in the training process; events after this date will be considered for training.
- training_end_date : datetime
No default, required.
Introduced thetraining_end_date
parameter, with a default value set tovalidation_start_date - 1
, providing more flexibility and control over model training timelines.
- check_target_for_next_N_days : int
default: None
The number of days, after the split point, considered for the model's target function period.
By exception, it will not be applicable for downstream models for recommendation tasks.
- validation_start_date : datetime
default: None
Initial date of the validation period.
- validation_end_date : datetime
default: None
The last date of the validation period.
- test_start_date : datetime
default: None
Initial date of the test period. It will be used for downstream models' predictions, but it can be set at a later stage.
- test_end_date : datetime
default: None
The last date of the test period.
- features_path : str
default: None
The path to the folder with features created during the foundation model training. Please do not specify it inYAML
file - it should be provided as argument to pretrain function or terminal command and is then overwritten here.
- timebased_encoding : Literal["two-hot", "fourier"]
default: "two-hot"
Controls the encoding of time-based features.
- target_sampling_strategy : Literal["valid", "random"]
default: "random"
Controls the data sampling for each entity; for Foundation Model it should always be left asrandom
.
- maximum_splitpoints_per_entity : int
default: 1
The maximum number of splits into input and target events per entity. For Foundation Model this should be left as 1.
- use_recency_sketches : boolean
default: True
If true then recency sketches are used in training.
- extra_columns : list['Extra_Column']
default: None
Columns discarded during foundation model fit stage, that should be then made available in the data_source e.g. for the definition of downstream model's target function.
- dynamic_events_sampling : boolean
default: True
A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting.
- [BETA] apply_event_count_weighting : boolean
default: False
If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts.
- [BETA] apply_recency_based_weighting : boolean
default: False
If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends.
- [BETA] window_shuffling_buffer_size : int
default: 5_000_000
Buffer size used by random window shuffling.
Example |
---|
data_params:
data_start_date: 2018-09-20 00:00:00
validation_start_date: 2018-10-10 00:00:00
test_start_date: 2018-10-20 00:00:00
check_target_for_next_N_days: 7
Updated 8 days ago