Scaling & Memory
This page covers everything that determines how training uses your hardware — device selection, distributed strategies, data loading, model size, query parallelization, and caching. For parameters that affect what the model learns, see Model Tuning.
Devices & Distribution
By default, BaseModel trains on a single automatically selected GPU (accelerator: "gpu", devices: "auto", strategy: null). The "auto" setting picks the least-occupied GPU on the machine, falling back to CPU if no GPUs are available. This is enough for most datasets. Set accelerator: "cpu" only for quick smoke tests — CPU is too slow for real training.
To pin training to a specific GPU, set devices: [0] or any other device index. You may need multi-GPU training if a single card runs out of memory or if you want to speed up a large run. List the specific GPU indices in devices (e.g., [0, 1]) — avoid -1 (all GPUs), because features are loaded onto each device and using all at once may cause out-of-memory errors.
When using multiple devices, set a strategy:
"ddp"(Distributed Data Parallel) — each GPU gets a full copy of the model and processes a different slice of the data. Use this when the model fits on a single card but you want faster training by parallelising across data. Requires at least 2 devices."fsdp"(Fully Sharded Data Parallel) — shards the model across all devices so no single GPU needs to hold the full model. Use this when the model is too large for a single card's memory. Requires at least 2 devices."fsdp:%d:%d"— explicit control over FSDP parallelism: the first integer sets data parallelism (how many model replicas), the second sets tensor parallelism (how many devices share each replica). Use this when you have many GPUs and want to balance memory savings with throughput — e.g.,"fsdp:2:4"on 8 GPUs creates 2 replicas, each sharded across 4 devices.
Start simple, scale if needed
Train on a single GPU first. Move to "ddp" if training is too slow, or to "fsdp" if you hit out-of-memory errors. Check GPU utilization before adding more devices — underutilized GPUs suggest a data loading bottleneck, not a compute one.
See Training Parameters reference for all available fields.
Precision
Precision is set in training_params alongside devices:
| Precision | Details |
|---|---|
"bf16-mixed" |
Speed: ~2× Memory: ~50% Best default — auto-selected on CUDA. Use whenever your hardware supports it (A100/H100 and newer) |
"16-mixed" |
Speed: ~2× Memory: ~50% Default choice for GPUs without bfloat16 support (pre-Ampere) |
"32" |
Speed: baseline Memory: baseline Debugging or when you suspect precision-related convergence issues |
Data Loading
The data_loader_params block controls how data is fed to the model during training. These are PyTorch DataLoader settings:
When to adjust:
- OOM errors — decrease
batch_size(default256). If memory is plentiful, increase it for better GPU utilization. - Slow data pipeline — increase
num_workers(default0, main process only) to parallelise data loading. This also splits queries into smaller pieces, reducing memory on the database side. - GPU idle between batches — increase
prefetch_factor(default2) so more batches are loaded in advance per worker. Decrease if prefetching causes memory pressure. - Faster GPU transfers — enable
pin_memoryto copy tensors to CUDA pinned memory before transferring. In multi-GPU setups, usepin_memory_deviceto target a specific device. - Final-batch instability — enable
drop_lastto skip the last incomplete batch of each epoch, which can have a much smaller size and cause training noise.
Automatic DataLoader Calibration
Instead of manually tuning num_workers and prefetch_factor, you can let BaseModel find the best values automatically. Enable calibration in your YAML config:
Before foundation model training begins, BaseModel will:
- Estimate a safe worker cap — based on available CPUs and, by default, available RAM. The RAM check measures per-worker buffer memory by probing a sample of entities and caps workers to avoid out-of-memory errors.
- Benchmark candidate configurations — sweeps through combinations of
num_workersandprefetch_factor, measuring throughput in batches per second. Skips warmup batches and stops early when throughput plateaus. - Select the most efficient config — picks the cheapest configuration that achieves at least 90 % of peak throughput, then applies it to the trainer.
The calibration result is saved to the output directory. On resume, the saved result is reloaded without re-running the sweep.
When to adjust:
- Tight memory — lower
ram_safety_margin(default0.2) to allow more workers, or raise it to be more conservative. - Known hardware limits — specify exact worker counts to test via
candidate_workers(e.g.,[0, 4, 6, 8, 12, 14]) instead of the default sweep up to 32. - Fast calibration — reduce
timeout_seconds(default20.0) ormax_measure_batches(default200) to shorten the sweep at the cost of less precise measurements. - Disable RAM cap — set
cap_workers_by_ram: falseif you want to rely only on the CPU-based cap (not recommended for memory-constrained environments).
Calibration vs. manual tuning
If you already know the optimal num_workers for your hardware, set it directly in data_loader_params. Use calibration when deploying to new or heterogeneous hardware where the optimal settings aren't known in advance.
See YAML Configuration → calibration_params for the full parameter reference.
Model Size
The memory_constraining_params block controls the architecture size — use it to fit the model to your available memory or to scale up for richer representations:
When to adjust:
- Memory constraints — reduce
hidden_dim(default2048). This is the primary lever for trading capacity for speed and memory. - Rich data, large datasets — increase
hidden_dimfor more model capacity. Only increasenum_layers(default4) if you still see clear underfitting after tuninghidden_dim. - Reducing input size — lower
emde_quality(default1.0) to produce smaller sketch inputs at the cost of some fidelity. Useful when the model input is very wide.
Query Parallelization
The query_optimization block controls how BaseModel queries data from your backend. Useful for large datasets where a single query would exhaust database or local memory:
query_optimization:
cleora_num_query_chunks: 4
data_loading_num_query_chunks: 1
num_cpus: 4
num_concurrent_features: 4
Start without query_optimization and add it only if you hit memory constraints during data loading.
When to adjust:
- Database memory pressure — increase
cleora_num_query_chunks(fit/embedding phase, default1) and/ordata_loading_num_query_chunks(train/validation/test/predict, default1) to split queries into more partitions. Higher values reduce peak memory at the cost of more sequential reads. Start with4and increase as needed. Mid-epoch resume is unsupported whendata_loading_num_query_chunksis greater than 1. - Faster fit stage — increase
num_concurrent_features(default4) to process more columns in parallel, reducing total fit time. Should not exceed the number of available CPU cores. If image or cleora columns are present, account fornum_cpusthreads per such task: keepnum_concurrent_features × num_cpus ≤ available coresto avoid oversubscription. - Limiting CPU utilization — lower
num_cpus(default4) to restrict how many CPUs are used per column calculator at the start of the pretrain phase.
Data Partitioning
For very large event tables, you can partition the data query itself by an entity column:
- type: event
name: transactions
partition_column: customer_id
partition_values_transformation: hash_mod
...
Set partition_column to your entity ID column and use partition_values_transformation: hash_mod to distribute entities into roughly equal chunks. This avoids loading the entire table at once.
This is configured per data source (inside data_sources), not as a global setting.
Caching
Cache queried data locally to avoid re-reading from the database on subsequent epochs or downstream tasks:
When set, BaseModel stores query results as Parquet files at the given path and reuses them for later epochs, scenario training, and inference. This is especially useful when the database connection is slow or unstable.
Cache configuration varies by data source type
cache_path in data_params applies to database-backed sources. For Parquet file sources, use the engine's own cache_path inside connection_params as described in Connect Sources.