Troubleshooting
This page walks through the foundation model training pipeline step by step, showing what to look for in logs and how to resolve common failures.
Phase 1: Fitting Behavioral Representation
Configuration & Initialization
Log signature:
INFO - monad.run: Storing config in output directory...
INFO - monad.run: Processing n columns concurrently.
INFO - monad.run: Ray available resources {...}
What can go wrong:
- Invalid config schema — missing ID or date columns, duplicated column definitions, conflicting settings. Raises
ValueErrororValidationErrorwith a description of the problem.- Fix: correct the flagged entries in your YAML.
- Broken data source logic — non-unique source names, incorrect or cyclic joins, inconsistent type overrides. Raises
ValueErrorpointing to the conflicting fields.- Fix: review joins and source names.
- Joining on two SQL Lambda columns — not supported. Results in database errors like "missing table" or "column does not exist".
- Fix: precompute one of the join keys in the source data or create a SQL view.
- Invalid data parameters — reversed date ranges, overlapping test and training periods, negative window sizes. Raises
ValueError.- Fix: correct the dates and split config.
Data Sampling
Log signature:
INFO - monad.fit.monad_analyzer: Creating connector for data_source abc
INFO - monad.fit.monad_analyzer: Retrieving data sample for abc
INFO - monad.fit.monad_analyzer: Data sample shape: (x, y)
What can go wrong:
- Out-of-memory on database side — the sample query loads too much data at once. You may see engine-specific errors like
Resources exceeded during query execution.- Fix: reduce the number of features with
allowed_columns/disallowed_columns, or increase database-side quotas with your platform administrator.
- Fix: reduce the number of features with
- Query timeouts — long-running queries exceed execution limits.
- Fix: increase database-side retry limits or concurrency settings with your platform administrator.
Column Analysis
Log signature:
INFO - monad.fit.preprocessing: Analyzing abc type...
INFO - monad.fit.preprocessing: Column bcd, type categorical
INFO - monad.fit.preprocessing: Column cde, type decimal
INFO - monad.fit.preprocessing: Column def, type categoricalCompressed
What can go wrong:
- Misinterpreted column types — e.g., a numeric field inferred as categorical because it lacks a decimal part.
- Fix: use
column_type_overridesor correct the type in the source database.
- Fix: use
- Suboptimal encoding — text or time-series columns defaulting to categorical or decimal. BaseModel may log a hint like
Column 'price' appears to be a time series.- Fix: add the appropriate
column_type_overridesentry. See Enrich & Transform.
- Fix: add the appropriate
- Text columns missing from features — text columns are skipped by default during fitting.
- Fix: explicitly opt in with
column_type_overrides: {description: text}. See Enrich & Transform → Column Type Overrides.
- Fix: explicitly opt in with
Computing Representations
Log signature:
INFO - monad.fit.features.tasks: Computing embeddings fusion
INFO - monad.fit.features.tasks: Running EMDE
INFO - monad.fit.features.calculators.cleora_calculator: Will train cleora for column abc
INFO - monad.utils.feature_stats: Saving feature stats
What can go wrong:
- Out-of-memory (local machine) — worker processes crash. Typical error:
A worker died or was killed while executing a task by an unexpected system error.- Fix: review the automatic resource estimation report to check whether concurrency is within the recommended limit. Tune parallelism with
num_concurrent_featuresandnum_cpusinquery_optimization(see Scaling & Memory). Allocate more memory to the Docker container if needed.
- Fix: review the automatic resource estimation report to check whether concurrency is within the recommended limit. Tune parallelism with
- Excessively long runtime — heavy feature computation, especially with many sketch-type columns.
- Fix: remove redundant sketch features — e.g., if both
product_idandproduct_namerepresent the same entity, keep only one. Reduce cardinality where possible — if product IDs encode size/color/volume, restructure so the ID represents the core product and pass variants as separate columns.
- Fix: remove redundant sketch features — e.g., if both
- Out-of-memory (database side) — same as during sampling.
- Fix: use
num_query_chunksinquery_optimizationto split database queries into smaller parts.
- Fix: use
When the fit stage completes, a columns analysis report summarizes what was processed — column types, skipped columns, and action recommendations. See Run Training → Columns Analysis Report.
Redundancy Report
During the fit stage, BaseModel checks for redundant columns — pairs whose values form a bijection (one-to-one mapping), meaning they carry the same information. Redundant pairs are flagged in a warning log and in the columns analysis report:
WARNING - Detected 1 redundancy groups in provided columns.
Columns groups of categorical type `[('product_code', 'product_id')]`
provide the same information.
The columns analysis report also shows them under Action recommendations with a hint:
Redundant categorical columns: product_code, product_id Hint: These columns form a bijection (one-to-one mapping). Keeping both adds no new information. Remove the redundant column via
disallowed_columns.
Redundant columns add computation time without contributing extra signal. BaseModel does not remove them automatically — you must update the config yourself. However, the suggested_config.yaml generated in the output path already includes one column from each bijection pair in disallowed_columns and adds detected time-series columns as sql_lambdas with matching column_type_overrides entries.
- Fix: review the flagged pairs to confirm they genuinely represent the same dimension. Check the
suggested_config.yaml— bijection columns are already excluded and time-series candidates are already configured there. Verify the suggestions, then adopt the file or copy the relevant entries into your config. See also Select & Organize → Filtering Columns.
Phase 2: Training the Model
Loading Representations
Log signature:
INFO - monad.modalities.modality_artifact: Available datasets [names]
INFO - monad.modalities.modality_artifact: Data sources to load: [names]
INFO - monad.datasets.utils: Set fit columns [aliases] for data source abc.
What can go wrong:
- Corrupted or manually modified checkpoint files — leads to hard-to-debug loading failures.
- Fix: do not modify checkpoint contents manually. Re-run the fit stage to produce a clean checkpoint.
Initializing the Trainer
Log signature:
INFO - monad.run: Training foundation model...
INFO - monad.core.fm.model_provider: Number of model parameters: n
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
You should also see a model summary with parameter counts and estimated size.
What can go wrong:
- Long delay before training begins — memory-heavy data preparation, especially with large Parquet sources or multi-GPU setups.
- Fix: exclude low-signal columns with
allowed_columns/disallowed_columns. Lowerwindow_shuffling_buffer_sizeindata_params.
- Fix: exclude low-signal columns with
- Out-of-memory — too many columns or aggressive buffer settings.
- Fix: lower
window_shuffling_buffer_sizeindata_params. Exclude low-signal columns. Increase available memory for the container.
- Fix: lower
Training Loop
Log signature:
Train Epoch 0/2 Entities ━━━━━━━━━━╸━━━━━━━━━━ 45% 4,500/10,000 0:03:12 0:03:55
Train Epoch 0/2 Entities ━━━━━━━━━━━━━━━━━━━━ 100% 10,000/10,000 0:07:08 0:00:00
Validation Entities ━━━━━━━━━━━━━━━━━━━━ 100% 1,000/1,000 0:00:23 0:00:00
`Trainer.fit` stopped: `max_epochs=n` reached.
INFO - monad.run: Training foundation model finished.
INFO - monad.run: Pretraining finished.
The Rich entity progress bar shows percentage complete, entities processed, elapsed time, and estimated time remaining. It supplements the standard PyTorch Lightning training output.
What can go wrong:
- GPU out-of-memory — model doesn't fit. Typical error:
torch.OutOfMemoryError: CUDA out of memory.- Fix: check that the GPU isn't occupied by other processes (
nvidia-smiornvtop). If a previous run was interrupted withCtrl+Z, the process is only suspended — usekill %1orkill <PID>to free the memory. Otherwise, enable multi-GPU training viadevicesandstrategy(see Scaling & Memory), reduce model size withhidden_dimoremde_qualityinmemory_constraining_params, or remove redundant features.
- Fix: check that the GPU isn't occupied by other processes (
- Training is very slow — long pauses between batches or epochs.
- Fix: enable multi-GPU with
devicesandstrategy. Increasebatch_sizeif memory allows. Increasenum_workersindata_loader_params.
- Fix: enable multi-GPU with
- Database-side OOM during training — queries during the training loop exceed database memory.
- Fix: increase
num_workersto split queries into more parts. Usenum_query_chunksinquery_optimization. Enablecache_pathindata_paramsto avoid re-querying.
- Fix: increase