Run Training
Foundation model training can be run in two ways: as a single pipeline or as two separate stages. Both support Python and CLI execution.
Joint Pipeline
The simplest option. The pretrain function validates your config, fits the behavioral representation, and trains the foundation model in one go.
Python
from monad.ui import pretrain
from pathlib import Path
pretrain(
config_path=Path("path/to/config.yaml"),
output_path=Path("path/to/store/pretrain/artifacts"),
)
CLI
python -m monad.run \
--pretrain \
--config-path "path/to/config.yaml" \
--features-path "path/to/store/pretrain/artifacts"
Modular Pipeline
Split training into two stages when you need different environments for each — for example, a CPU-heavy machine for fitting and a GPU machine for training.
Stage 1: Fit behavioral representation
Analyzes your data and builds the feature representation.
Python:
from monad.ui import fit_behavioral_representation
from pathlib import Path
fit_behavioral_representation(
config_path=Path("path/to/config.yaml"),
output_path=Path("path/to/store/pretrain/artifacts"),
)
CLI:
python -m monad.run \
--fit \
--config-path "path/to/config.yaml" \
--features-path "path/to/store/pretrain/artifacts"
Stage 2: Train foundation model
Trains the neural network using the representations from Stage 1. No config_path is needed — BaseModel reads the config stored during fitting.
Python:
from monad.ui import train_foundation_model
from pathlib import Path
train_foundation_model(
output_path=Path("path/to/store/pretrain/artifacts"),
)
CLI:
Offline Prediction
Once a model is trained, you can score new data from the command line with the --predict stage — no custom Python code required. It loads a trained checkpoint together with a TestingParams YAML, then writes predictions to the configured save location.
CLI:
python -m monad.run \
--predict \
--checkpoint-path "path/to/checkpoint" \
--testing-params-path "path/to/testing_params.yaml"
Optional flags: --storage-config-path for an external storage config, and --seed to fix the ordering of results.
Single GPU
--predict runs on a single GPU. Local output is written as TSV; set remote_save_location in the TestingParams YAML to write to Snowflake or Databricks instead.
See Inference for the full prediction workflow.
Quick Check (Dry Run)
Before committing to a full run, you can validate your configuration and sample a small amount of data with a quick check. This catches config errors, connection issues, and schema mismatches in seconds instead of hours.
Python
Pass quick_check=True to any of the training functions:
from monad.ui import pretrain
from pathlib import Path
pretrain(
config_path=Path("path/to/config.yaml"),
output_path=Path("path/to/store/pretrain/artifacts"),
quick_check=True,
)
The same flag works with fit_behavioral_representation() and train_foundation_model().
CLI
Add --quick-check to any command:
python -m monad.run \
--pretrain \
--config-path "path/to/config.yaml" \
--features-path "path/to/store/pretrain/artifacts" \
--quick-check
What quick check does
When enabled, BaseModel automatically applies lightweight defaults:
- Fit stage — samples ~1 000 entities with a history limit of 50 and filters to ~5 % of entities via SQL
- Train stage — runs 1 epoch with 5 training batches and 5 validation batches
This validates the full pipeline end-to-end (config parsing, data source connectivity, column analysis, a short training loop) without running a full computation.
Key Flags
All three functions accept the following flags:
resume— resume from the last checkpoint. Fails if no checkpoint exists.overwrite— discard any previous results atoutput_pathbefore starting. Destructive — the directory contents are deleted.seed— set a seed for reproducibility. As of 1.7, this also makes modality sketch-dropping reproducible, so seeded runs are reproducible end to end.
Cannot resume and overwrite simultaneously
resume and overwrite cannot both be True — this raises an error.
Function parameters override YAML config
Parameters passed to the function override those in the YAML config.
Adding --overwrite or --resume in CLI
python -m monad.run \
--pretrain \
--config-path "path/to/config.yaml" \
--features-path "path/to/store/pretrain/artifacts" \
--overwrite
Resource Estimation
During the fit stage, BaseModel automatically estimates memory requirements for each feature computation task and logs a resource estimation report. The report shows:
- Available RAM and safety margin (default 20 %)
- Per-task memory estimates with component breakdowns
- Predicted peak memory usage
- Recommended vs. configured
num_concurrent_features
If the configured concurrency exceeds the recommendation, BaseModel prompts for confirmation in interactive mode or logs a warning in non-interactive mode.
To cap the RAM budget, set max_ram_gb in the query_optimization section of your YAML config:
DataLoader Calibration
When calibration_params.enabled is set to true in your YAML config, BaseModel automatically benchmarks DataLoader settings before foundation model training begins. It sweeps through candidate num_workers and prefetch_factor values, measures throughput, and applies the most efficient configuration.
This is especially useful when deploying to new hardware or when you are unsure which DataLoader settings work best. The calibration result is saved alongside model artifacts, so resumed runs skip re-calibration.
If calibration fails for any reason, training proceeds with the default data_loader_params from your config — no manual intervention required.
For details on all calibration parameters and tuning guidance, see Scaling & Memory → Automatic DataLoader Calibration.
Verifying Training Completed
How to confirm training completed
Training is complete when:
-
Console output confirms model checkpoints have been saved:
-
A
_FINISHEDfolder appears inside youroutput_path, containing the best model.
Training progress
During training, BaseModel displays a Rich progress bar showing entity-level progress with percentage complete, elapsed time, and estimated time remaining:
The entity progress bar supplements the standard PyTorch Lightning training output. Set the ENABLE_PROGRESS_BAR environment variable to False to disable it.
Columns Analysis Report
During the fit stage, BaseModel logs a columns analysis report for each data source. The report lists:
- Column types — every column grouped by its inferred type (decimal, categorical, categoricalCompressed, time_series, text, image)
- Skipped columns — columns excluded from training, with reasons (too many NaNs, text-like, low cardinality, date, mixed-object) and hints on how to include them
- Action recommendations — suggestions such as potential time-series or text columns that may benefit from a
column_type_overridesentry, and redundant column pairs to consider removing
Columns analysis report
====================================
Table: transactions
====================================
Column type Columns
------------------------------------
decimal price, quantity, total_amount
categorical channel, region
categoricalCompressed article_id, store_id
------------------------------------
Skipped columns
------------------------------------
Reason Columns
------------------------------------
Text column product_description
Hint: Text columns are skipped by default. To include it,
set its type via column_type_overrides.
------------------------------------
Action recommendations
------------------------------------
Redundant categorical columns product_code, product_id
Hint: These columns form a bijection (one-to-one mapping).
Keeping both adds no new information. Remove the redundant
column via disallowed_columns.
====================================
The report also logs the total fit duration: Fitting took X.XX seconds.
Review this report after every fit run:
- Verify column types — confirm that columns are classified as you expect. If a column landed in the wrong type, add a
column_type_overridesentry. - Check skipped columns — some skips are expected and harmless:
- Date columns — raw dates are not embedded (this is by design). If the column is your event timestamp, configure it as
date_column. If it should be a feature, transform it with ansql_lambda. - Text columns — skipped by default. Add a
column_type_overrides: textentry if the column should contribute semantic features. - Low cardinality — columns with fewer than 2 unique values carry no signal and are safe to ignore.
- Too many NaNs — over 90 % missing or empty values. Clean the source data or fill/impute if you need this column.
- Mixed-object columns — inconsistent types within a column (e.g., strings mixed with numbers). Normalize the data or override the type explicitly.
- Unexpected skips — if a column you need is skipped, check whether the data source itself is wrong (e.g., pointing at the wrong table or missing a filter).
- Date columns — raw dates are not embedded (this is by design). If the column is your event timestamp, configure it as
- Act on recommendations — remove redundant column pairs via
disallowed_columnsto reduce computation without losing signal. See also Troubleshooting → Redundancy Report.
Suggested Config
After the fit stage completes, BaseModel generates a suggested_config.yaml file in the output directory. This file is a copy of your original config with the column report findings already applied:
| What is applied | How |
|---|---|
| Time-series candidates | An sql_lambda is added with resolve_fn() and a matching column_type_overrides: time_series entry |
| Bijection columns | Redundant columns from each bijection group are added to disallowed_columns (keeping one representative column per group — local columns are preferred over joined ones) |
Auto-generated entries are annotated with inline comments so you can tell them apart from your original config:
data_sources:
- name: transactions
disallowed_columns:
- product_code # auto: bijection with product_id
column_type_overrides:
price_ts: time_series # auto: time series candidate
sql_lambdas:
- alias: price_ts # auto: time series candidate
expression: "{{ resolve_fn('price') }}"
How to use
Review the generated file, verify the suggestions make sense for your use case, then rename it to replace your original config. You can also cherry-pick individual suggestions and apply them manually.
Note
suggested_config.yaml is not generated in quick check mode — it requires a full fit run.
What Happens During Training
The pipeline runs in two phases regardless of whether you use the joint or modular approach:
-
Phase 1
Fit Behavioral Representation
- Validate config
- Sample data
- Analyze columns
- Compute representations
-
Phase 2
Train Foundation Model
- Load representations
- Initialize trainer
- Calibrate DataLoader (if enabled)
- Training loop
- Save best model
The fitting phase uses distributed computation — it runs locally on your server and no data leaves the environment. Ensure the server is secured against unauthorized access.
For details on log signatures at each step and how to diagnose failures, see Troubleshooting.