Skip to content

Scaling & Memory

This page covers everything that determines how training uses your hardware — device selection, distributed strategies, data loading, model size, query parallelization, and caching. For parameters that affect what the model learns, see Model Tuning.

Devices & Distribution

By default, BaseModel trains on a single automatically selected GPU (accelerator: "gpu", devices: "auto", strategy: null). The "auto" setting picks the least-occupied GPU on the machine, falling back to CPU if no GPUs are available. This is enough for most datasets. Set accelerator: "cpu" only for quick smoke tests — CPU is too slow for real training.

yaml
training_params:
  accelerator: "gpu"
  devices: "auto"
  strategy: null

To pin training to a specific GPU, set devices: [0] or any other device index. You may need multi-GPU training if a single card runs out of memory or if you want to speed up a large run. List the specific GPU indices in devices (e.g., [0, 1]) — avoid -1 (all GPUs), because features are loaded onto each device and using all at once may cause out-of-memory errors.

When using multiple devices, set a strategy:

  • "ddp" (Distributed Data Parallel) — each GPU gets a full copy of the model and processes a different slice of the data. Use this when the model fits on a single card but you want faster training by parallelising across data. Requires at least 2 devices.
  • "fsdp" (Fully Sharded Data Parallel) — shards the model across all devices so no single GPU needs to hold the full model. Use this when the model is too large for a single card's memory. Requires at least 2 devices.
  • "fsdp:%d:%d" — explicit control over FSDP parallelism: the first integer sets data parallelism (how many model replicas), the second sets tensor parallelism (how many devices share each replica). Use this when you have many GPUs and want to balance memory savings with throughput — e.g., "fsdp:2:4" on 8 GPUs creates 2 replicas, each sharded across 4 devices.

Start simple, scale if needed

Train on a single GPU first. Move to "ddp" if training is too slow, or to "fsdp" if you hit out-of-memory errors. Check GPU utilization before adding more devices — underutilized GPUs suggest a data loading bottleneck, not a compute one.

See Training Parameters reference for all available fields.

Precision

Precision is set in training_params alongside devices:

yaml
training_params:
  precision: "bf16-mixed"
Precision Details
"bf16-mixed" Speed: ~2×
Memory: ~50%
Best default — auto-selected on CUDA. Use whenever your hardware supports it (A100/H100 and newer)
"16-mixed" Speed: ~2×
Memory: ~50%
Default choice for GPUs without bfloat16 support (pre-Ampere)
"32" Speed: baseline
Memory: baseline
Debugging or when you suspect precision-related convergence issues

Data Loading

The data_loader_params block controls how data is fed to the model during training. These are PyTorch DataLoader settings:

yaml
data_loader_params:
  batch_size: 256
  num_workers: 5

When to adjust:

  • OOM errors — decrease batch_size (default 256). If memory is plentiful, increase it for better GPU utilization.
  • Slow data pipeline — increase num_workers (default 0, main process only) to parallelise data loading. This also splits queries into smaller pieces, reducing memory on the database side.
  • GPU idle between batches — increase prefetch_factor (default 2) so more batches are loaded in advance per worker. Decrease if prefetching causes memory pressure.
  • Faster GPU transfers — enable pin_memory to copy tensors to CUDA pinned memory before transferring. In multi-GPU setups, use pin_memory_device to target a specific device.
  • Final-batch instability — enable drop_last to skip the last incomplete batch of each epoch, which can have a much smaller size and cause training noise.

Automatic DataLoader Calibration

Instead of manually tuning num_workers and prefetch_factor, you can let BaseModel find the best values automatically. Enable calibration in your YAML config:

yaml
calibration_params:
  enabled: true

Before foundation model training begins, BaseModel will:

  1. Estimate a safe worker cap — based on available CPUs and, by default, available RAM. The RAM check measures per-worker buffer memory by probing a sample of entities and caps workers to avoid out-of-memory errors.
  2. Benchmark candidate configurations — sweeps through combinations of num_workers and prefetch_factor, measuring throughput in batches per second. Skips warmup batches and stops early when throughput plateaus.
  3. Select the most efficient config — picks the cheapest configuration that achieves at least 90 % of peak throughput, then applies it to the trainer.

The calibration result is saved to the output directory. On resume, the saved result is reloaded without re-running the sweep.

When to adjust:

  • Tight memory — lower ram_safety_margin (default 0.2) to allow more workers, or raise it to be more conservative.
  • Known hardware limits — specify exact worker counts to test via candidate_workers (e.g., [0, 4, 6, 8, 12, 14]) instead of the default sweep up to 32.
  • Fast calibration — reduce timeout_seconds (default 20.0) or max_measure_batches (default 200) to shorten the sweep at the cost of less precise measurements.
  • Disable RAM cap — set cap_workers_by_ram: false if you want to rely only on the CPU-based cap (not recommended for memory-constrained environments).

Calibration vs. manual tuning

If you already know the optimal num_workers for your hardware, set it directly in data_loader_params. Use calibration when deploying to new or heterogeneous hardware where the optimal settings aren't known in advance.

See YAML Configuration → calibration_params for the full parameter reference.

Model Size

The memory_constraining_params block controls the architecture size — use it to fit the model to your available memory or to scale up for richer representations:

yaml
memory_constraining_params:
  hidden_dim: 2048
  num_layers: 4
  emde_quality: 1.0

When to adjust:

  • Memory constraints — reduce hidden_dim (default 2048). This is the primary lever for trading capacity for speed and memory.
  • Rich data, large datasets — increase hidden_dim for more model capacity. Only increase num_layers (default 4) if you still see clear underfitting after tuning hidden_dim.
  • Reducing input size — lower emde_quality (default 1.0) to produce smaller sketch inputs at the cost of some fidelity. Useful when the model input is very wide.

Query Parallelization

The query_optimization block controls how BaseModel queries data from your backend. Useful for large datasets where a single query would exhaust database or local memory:

yaml
query_optimization:
  cleora_num_query_chunks: 4
  data_loading_num_query_chunks: 1
  num_cpus: 4
  num_concurrent_features: 4

Start without query_optimization and add it only if you hit memory constraints during data loading.

When to adjust:

  • Database memory pressure — increase cleora_num_query_chunks (fit/embedding phase, default 1) and/or data_loading_num_query_chunks (train/validation/test/predict, default 1) to split queries into more partitions. Higher values reduce peak memory at the cost of more sequential reads. Start with 4 and increase as needed. Mid-epoch resume is unsupported when data_loading_num_query_chunks is greater than 1.
  • Faster fit stage — increase num_concurrent_features (default 4) to process more columns in parallel, reducing total fit time. Should not exceed the number of available CPU cores. If image or cleora columns are present, account for num_cpus threads per such task: keep num_concurrent_features × num_cpus ≤ available cores to avoid oversubscription.
  • Limiting CPU utilization — lower num_cpus (default 4) to restrict how many CPUs are used per column calculator at the start of the pretrain phase.

Data Partitioning

For very large event tables, you can partition the data query itself by an entity column:

yaml
- type: event
  name: transactions
  partition_column: customer_id
  partition_values_transformation: hash_mod
  ...

Set partition_column to your entity ID column and use partition_values_transformation: hash_mod to distribute entities into roughly equal chunks. This avoids loading the entire table at once.

This is configured per data source (inside data_sources), not as a global setting.

Caching

Cache queried data locally to avoid re-reading from the database on subsequent epochs or downstream tasks:

yaml
data_params:
  cache_path: "/path/to/store/cache"

When set, BaseModel stores query results as Parquet files at the given path and reuses them for later epochs, scenario training, and inference. This is especially useful when the database connection is slow or unstable.

Cache configuration varies by data source type

cache_path in data_params applies to database-backed sources. For Parquet file sources, use the engine's own cache_path inside connection_params as described in Connect Sources.