Skip to content

Troubleshooting

This page walks through the foundation model training pipeline step by step, showing what to look for in logs and how to resolve common failures.

Phase 1: Fitting Behavioral Representation

Configuration & Initialization

Log signature:

INFO - monad.run: Storing config in output directory...
INFO - monad.run: Processing n columns concurrently.
INFO - monad.run: Ray available resources {...}

What can go wrong:

  • Invalid config schema — missing ID or date columns, duplicated column definitions, conflicting settings. Raises ValueError or ValidationError with a description of the problem.
    • Fix: correct the flagged entries in your YAML.
  • Broken data source logic — non-unique source names, incorrect or cyclic joins, inconsistent type overrides. Raises ValueError pointing to the conflicting fields.
    • Fix: review joins and source names.
  • Joining on two SQL Lambda columns — not supported. Results in database errors like "missing table" or "column does not exist".
    • Fix: precompute one of the join keys in the source data or create a SQL view.
  • Invalid data parameters — reversed date ranges, overlapping test and training periods, negative window sizes. Raises ValueError.
    • Fix: correct the dates and split config.

Data Sampling

Log signature:

INFO - monad.fit.monad_analyzer: Creating connector for data_source abc
INFO - monad.fit.monad_analyzer: Retrieving data sample for abc
INFO - monad.fit.monad_analyzer: Data sample shape: (x, y)

What can go wrong:

  • Out-of-memory on database side — the sample query loads too much data at once. You may see engine-specific errors like Resources exceeded during query execution.
    • Fix: reduce the number of features with allowed_columns / disallowed_columns, or increase database-side quotas with your platform administrator.
  • Query timeouts — long-running queries exceed execution limits.
    • Fix: increase database-side retry limits or concurrency settings with your platform administrator.

Column Analysis

Log signature:

INFO - monad.fit.preprocessing: Analyzing abc type...
INFO - monad.fit.preprocessing: Column bcd, type categorical
INFO - monad.fit.preprocessing: Column cde, type decimal
INFO - monad.fit.preprocessing: Column def, type categoricalCompressed

What can go wrong:

  • Misinterpreted column types — e.g., a numeric field inferred as categorical because it lacks a decimal part.
    • Fix: use column_type_overrides or correct the type in the source database.
  • Suboptimal encoding — text or time-series columns defaulting to categorical or decimal. BaseModel may log a hint like Column 'price' appears to be a time series.
  • Text columns missing from features — text columns are skipped by default during fitting.

Computing Representations

Log signature:

INFO - monad.fit.features.tasks: Computing embeddings fusion
INFO - monad.fit.features.tasks: Running EMDE
INFO - monad.fit.features.calculators.cleora_calculator: Will train cleora for column abc
INFO - monad.utils.feature_stats: Saving feature stats

What can go wrong:

  • Out-of-memory (local machine) — worker processes crash. Typical error: A worker died or was killed while executing a task by an unexpected system error.
    • Fix: review the automatic resource estimation report to check whether concurrency is within the recommended limit. Tune parallelism with num_concurrent_features and num_cpus in query_optimization (see Scaling & Memory). Allocate more memory to the Docker container if needed.
  • Excessively long runtime — heavy feature computation, especially with many sketch-type columns.
    • Fix: remove redundant sketch features — e.g., if both product_id and product_name represent the same entity, keep only one. Reduce cardinality where possible — if product IDs encode size/color/volume, restructure so the ID represents the core product and pass variants as separate columns.
  • Out-of-memory (database side) — same as during sampling.
    • Fix: use num_query_chunks in query_optimization to split database queries into smaller parts.

When the fit stage completes, a columns analysis report summarizes what was processed — column types, skipped columns, and action recommendations. See Run Training → Columns Analysis Report.

Redundancy Report

During the fit stage, BaseModel checks for redundant columns — pairs whose values form a bijection (one-to-one mapping), meaning they carry the same information. Redundant pairs are flagged in a warning log and in the columns analysis report:

WARNING - Detected 1 redundancy groups in provided columns.
  Columns groups of categorical type `[('product_code', 'product_id')]`
  provide the same information.

The columns analysis report also shows them under Action recommendations with a hint:

Redundant categorical columns: product_code, product_id Hint: These columns form a bijection (one-to-one mapping). Keeping both adds no new information. Remove the redundant column via disallowed_columns.

Redundant columns add computation time without contributing extra signal. BaseModel does not remove them automatically — you must update the config yourself. However, the suggested_config.yaml generated in the output path already includes one column from each bijection pair in disallowed_columns and adds detected time-series columns as sql_lambdas with matching column_type_overrides entries.

  • Fix: review the flagged pairs to confirm they genuinely represent the same dimension. Check the suggested_config.yaml — bijection columns are already excluded and time-series candidates are already configured there. Verify the suggestions, then adopt the file or copy the relevant entries into your config. See also Select & Organize → Filtering Columns.

Phase 2: Training the Model

Loading Representations

Log signature:

INFO - monad.modalities.modality_artifact: Available datasets [names]
INFO - monad.modalities.modality_artifact: Data sources to load: [names]
INFO - monad.datasets.utils: Set fit columns [aliases] for data source abc.

What can go wrong:

  • Corrupted or manually modified checkpoint files — leads to hard-to-debug loading failures.
    • Fix: do not modify checkpoint contents manually. Re-run the fit stage to produce a clean checkpoint.

Initializing the Trainer

Log signature:

INFO - monad.run: Training foundation model...
INFO - monad.core.fm.model_provider: Number of model parameters: n

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True

You should also see a model summary with parameter counts and estimated size.

What can go wrong:

  • Long delay before training begins — memory-heavy data preparation, especially with large Parquet sources or multi-GPU setups.
    • Fix: exclude low-signal columns with allowed_columns / disallowed_columns. Lower window_shuffling_buffer_size in data_params.
  • Out-of-memory — too many columns or aggressive buffer settings.
    • Fix: lower window_shuffling_buffer_size in data_params. Exclude low-signal columns. Increase available memory for the container.

Training Loop

Log signature:

Train Epoch 0/2 Entities ━━━━━━━━━━╸━━━━━━━━━━  45%  4,500/10,000  0:03:12  0:03:55
Train Epoch 0/2 Entities ━━━━━━━━━━━━━━━━━━━━ 100% 10,000/10,000  0:07:08  0:00:00
Validation Entities      ━━━━━━━━━━━━━━━━━━━━ 100%  1,000/1,000   0:00:23  0:00:00
`Trainer.fit` stopped: `max_epochs=n` reached.
INFO - monad.run: Training foundation model finished.
INFO - monad.run: Pretraining finished.

The Rich entity progress bar shows percentage complete, entities processed, elapsed time, and estimated time remaining. It supplements the standard PyTorch Lightning training output.

What can go wrong:

  • GPU out-of-memory — model doesn't fit. Typical error: torch.OutOfMemoryError: CUDA out of memory.
    • Fix: check that the GPU isn't occupied by other processes (nvidia-smi or nvtop). If a previous run was interrupted with Ctrl+Z, the process is only suspended — use kill %1 or kill <PID> to free the memory. Otherwise, enable multi-GPU training via devices and strategy (see Scaling & Memory), reduce model size with hidden_dim or emde_quality in memory_constraining_params, or remove redundant features.
  • Training is very slow — long pauses between batches or epochs.
    • Fix: enable multi-GPU with devices and strategy. Increase batch_size if memory allows. Increase num_workers in data_loader_params.
  • Database-side OOM during training — queries during the training loop exceed database memory.
    • Fix: increase num_workers to split queries into more parts. Use num_query_chunks in query_optimization. Enable cache_path in data_params to avoid re-querying.