Data configuration

data_params block in YAML configuration file

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

The settings in data_params block are related to data and used mainly to manage temporal splits to divide training, validation and test periods (you can ready more about how BaseModel handles it in this article), as well as influence things like the method of sampling or encoding data.

Parameters
  • data_start_date : datetime
    No default, required.
    Events before this date will not be used in the training process; events after this date will be considered for training.

  • training_end_date : datetime
    No default, required.
    Introduced the training_end_date parameter, with a default value set to validation_start_date - 1, providing more flexibility and control over model training timelines.

  • check_target_for_next_N_days : int
    default: None
    The number of days, after the split point, considered for the model's target function period.
    By exception, it will not be applicable for downstream models for recommendation tasks.

  • validation_start_date : datetime
    default: None
    Initial date of the validation period.

  • validation_end_date : datetime
    default: None
    The last date of the validation period.

  • test_start_date : datetime
    default: None
    Initial date of the test period. It will be used for downstream models' predictions, but it can be set at a later stage.

  • test_end_date : datetime
    default: None
    The last date of the test period.

  • features_path : str
    default: None
    The path to the folder with features created during the foundation model training. Please do not specify it in YAML file - it should be provided as argument to pretrain function or terminal command and is then overwritten here.

  • timebased_encoding : Literal["two-hot", "fourier"]
    default: "two-hot"
    Controls the encoding of time-based features.

  • target_sampling_strategy : Literal["valid", "random"]
    default: "random"
    Controls the data sampling for each entity; for Foundation Model it should always be left as random.

  • maximum_splitpoints_per_entity : int
    default: 1
    The maximum number of splits into input and target events per entity. For Foundation Model this should be left as 1.

  • use_recency_sketches : boolean
    default: True
    If true then recency sketches are used in training.

  • extra_columns : list['Extra_Column']
    default: None
    Columns discarded during foundation model fit stage, that should be then made available in the data_source e.g. for the definition of downstream model's target function.

  • dynamic_events_sampling : boolean
    default: True
    A flag indicating whether to dynamically sample events from the input data. This is useful to avoid overfitting.

  • [BETA] apply_event_count_weighting : boolean
    default: False
    If set to True, enables weighting based on the count of events. This means that the influence of each example in the dataset is adjusted according to the number of events it represents, typically to balance the contribution of examples with varying event counts.

  • [BETA] apply_recency_based_weighting : boolean
    default: False
    If set to True, enables weighting based on the age of examples. This strategy assigns weights to examples based on their temporal proximity to a specific end date, giving preference to more recent examples under the assumption that they may be more relevant or indicative of current trends.

  • [BETA] window_shuffling_buffer_size : int
    default: 5_000_000
    Buffer size used by random window shuffling.

Example
data_params:
  data_start_date: 2018-09-20 00:00:00
  validation_start_date: 2018-10-10 00:00:00
  test_start_date: 2018-10-20 00:00:00
  check_target_for_next_N_days: 7