Control of data loading process

loading_params and data_loader_params blocks in YAML configuration file

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.


In this article we will focus on next two blocks: loading_params and data_loader_params that together control how the data are loaded from the source and into the model.

Data loading from the source

Settings in loading_params block modify how the data is loaded from source and are defined per DataMode.

Parameters
  • entities_ids_subquery : int
    default: None
    Subquery used to limit the data loaded from the data source.

  • limit_entity_num_events : int
    default: None
    Limits the number of events per entity used. The most recent events are kept.

  • cache_dir : Path
    default: None
    Directory to save queried data to or to read from, if cached query exists. Applicable when fitting a model or when calling verify_target. Not applicable to data sources fed from parquet files.

    Caching means, that during Foundation Model training, data will be stored to the cache_dirlocation and stored there in parquet file format. BaseModel will then use that information for next epochs training as well as downstream tasks and predictions if so configured by user.

    🚧

    Remember

    With cache_dir you can enable caching, which will speed up and/or stabilize the training in case the database connection is either not very stable or not very fast.


Example
loading_params:
  Train:
    cache_dir: /data/USER/cache/name
  Validation:
    cache_dir: /data/USER/cache/name
  Test:
    cache_dir: /data/USER/cache/name

Data loading into the model

data_loader_params block allows you to set constructor parameters for PyTorch DataLoader.
These settings modify how the data is loaded, such as batch sizes, workers etc.

Parameters
  • batch_size : int
    default: 256
    The size of the batch: how many samples per batch to load.

  • num_workers : int
    default: 0
    How many sub-processes to use for data loading. 0 means that the data will be loaded in the main process. Increasing number of workers results in splitting queries into smaller pieces which reduce memory consumption on the database end.

  • pin_memory : boolean
    default: False
    If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them.

  • drop_last : boolean
    default: False
    Set to True to drop the last incomplete batch if the dataset size is not divisible by.

  • pin_memory_device : str
    default: None
    The device memory should be pinned to, if pin_memory is True.

  • prefetch_factor : int
    default: 2
    Number of batches loaded in advance by each worker.

Example
data_loader_params:
  batch_size: 256
  num_workers: 5