Control of data loading process
loading_params
and data_loader_params
blocks in YAML
configuration file
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
In this article we will focus on next two blocks: loading_params
and data_loader_params
that together control how the data are loaded from the source and into the model.
Data loading from the source
Settings in loading_params
block modify how the data is loaded from source and are defined per DataMode.
Parameters |
---|
-
entities_ids_subquery : int
default: None
Subquery used to limit the data loaded from the data source.
-
limit_entity_num_events : int
default: None
Limits the number of events per entity used. The most recent events are kept.
-
cache_dir : Path
default: None
Directory to save queried data to or to read from, if cached query exists. Applicable when fitting a model or when callingverify_target
. Not applicable to data sources fed from parquet files.Caching means, that during Foundation Model training, data will be stored to the
cache_dir
location and stored there in parquet file format. BaseModel will then use that information for next epochs training as well as downstream tasks and predictions if so configured by user.Remember
With
cache_dir
you can enable caching, which will speed up and/or stabilize the training in case the database connection is either not very stable or not very fast.
Example |
---|
loading_params:
Train:
cache_dir: /data/USER/cache/name
Validation:
cache_dir: /data/USER/cache/name
Test:
cache_dir: /data/USER/cache/name
Data loading into the model
data_loader_params
block allows you to set constructor parameters for PyTorch DataLoader.
These settings modify how the data is loaded, such as batch sizes, workers etc.
Parameters |
---|
- batch_size : int
default: 256
The size of the batch: how many samples per batch to load.
- num_workers : int
default: 0
How many sub-processes to use for data loading. 0 means that the data will be loaded in the main process. Increasing number of workers results in splitting queries into smaller pieces which reduce memory consumption on the database end.
- pin_memory : boolean
default: False
If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them.
- drop_last : boolean
default: False
Set to True to drop the last incomplete batch if the dataset size is not divisible by.
- pin_memory_device : str
default: None
The device memory should be pinned to, ifpin_memory
isTrue
.
- prefetch_factor : int
default: 2
Number of batches loaded in advance by each worker.
Example |
---|
data_loader_params:
batch_size: 256
num_workers: 5
Updated 21 days ago