Skip to content

Run Training

Foundation model training can be run in two ways: as a single pipeline or as two separate stages. Both support Python and CLI execution.

Joint Pipeline

The simplest option. The pretrain function validates your config, fits the behavioral representation, and trains the foundation model in one go.

Python

Python
from monad.ui import pretrain
from pathlib import Path

pretrain(
    config_path=Path("path/to/config.yaml"),
    output_path=Path("path/to/store/pretrain/artifacts"),
)

CLI

python -m monad.run \
  --pretrain \
  --config-path "path/to/config.yaml" \
  --features-path "path/to/store/pretrain/artifacts"

Modular Pipeline

Split training into two stages when you need different environments for each — for example, a CPU-heavy machine for fitting and a GPU machine for training.

Stage 1: Fit behavioral representation

Analyzes your data and builds the feature representation.

Python:

Python
from monad.ui import fit_behavioral_representation
from pathlib import Path

fit_behavioral_representation(
    config_path=Path("path/to/config.yaml"),
    output_path=Path("path/to/store/pretrain/artifacts"),
)

CLI:

python -m monad.run \
  --fit \
  --config-path "path/to/config.yaml" \
  --features-path "path/to/store/pretrain/artifacts"

Stage 2: Train foundation model

Trains the neural network using the representations from Stage 1. No config_path is needed — BaseModel reads the config stored during fitting.

Python:

Python
from monad.ui import train_foundation_model
from pathlib import Path

train_foundation_model(
    output_path=Path("path/to/store/pretrain/artifacts"),
)

CLI:

python -m monad.run \
  --fm \
  --features-path "path/to/store/pretrain/artifacts"

Offline Prediction

Once a model is trained, you can score new data from the command line with the --predict stage — no custom Python code required. It loads a trained checkpoint together with a TestingParams YAML, then writes predictions to the configured save location.

CLI:

python -m monad.run \
  --predict \
  --checkpoint-path "path/to/checkpoint" \
  --testing-params-path "path/to/testing_params.yaml"

Optional flags: --storage-config-path for an external storage config, and --seed to fix the ordering of results.

Single GPU

--predict runs on a single GPU. Local output is written as TSV; set remote_save_location in the TestingParams YAML to write to Snowflake or Databricks instead.

See Inference for the full prediction workflow.

Quick Check (Dry Run)

Before committing to a full run, you can validate your configuration and sample a small amount of data with a quick check. This catches config errors, connection issues, and schema mismatches in seconds instead of hours.

Python

Pass quick_check=True to any of the training functions:

Python
from monad.ui import pretrain
from pathlib import Path

pretrain(
    config_path=Path("path/to/config.yaml"),
    output_path=Path("path/to/store/pretrain/artifacts"),
    quick_check=True,
)

The same flag works with fit_behavioral_representation() and train_foundation_model().

CLI

Add --quick-check to any command:

python -m monad.run \
  --pretrain \
  --config-path "path/to/config.yaml" \
  --features-path "path/to/store/pretrain/artifacts" \
  --quick-check

What quick check does

When enabled, BaseModel automatically applies lightweight defaults:

  • Fit stage — samples ~1 000 entities with a history limit of 50 and filters to ~5 % of entities via SQL
  • Train stage — runs 1 epoch with 5 training batches and 5 validation batches

This validates the full pipeline end-to-end (config parsing, data source connectivity, column analysis, a short training loop) without running a full computation.

Key Flags

All three functions accept the following flags:

  • resume — resume from the last checkpoint. Fails if no checkpoint exists.
  • overwrite — discard any previous results at output_path before starting. Destructive — the directory contents are deleted.
  • seed — set a seed for reproducibility. As of 1.7, this also makes modality sketch-dropping reproducible, so seeded runs are reproducible end to end.

Cannot resume and overwrite simultaneously

resume and overwrite cannot both be True — this raises an error.

Function parameters override YAML config

Parameters passed to the function override those in the YAML config.

Adding --overwrite or --resume in CLI

python -m monad.run \
  --pretrain \
  --config-path "path/to/config.yaml" \
  --features-path "path/to/store/pretrain/artifacts" \
  --overwrite

Resource Estimation

During the fit stage, BaseModel automatically estimates memory requirements for each feature computation task and logs a resource estimation report. The report shows:

  • Available RAM and safety margin (default 20 %)
  • Per-task memory estimates with component breakdowns
  • Predicted peak memory usage
  • Recommended vs. configured num_concurrent_features

If the configured concurrency exceeds the recommendation, BaseModel prompts for confirmation in interactive mode or logs a warning in non-interactive mode.

To cap the RAM budget, set max_ram_gb in the query_optimization section of your YAML config:

query_optimization:
  max_ram_gb: 32

DataLoader Calibration

When calibration_params.enabled is set to true in your YAML config, BaseModel automatically benchmarks DataLoader settings before foundation model training begins. It sweeps through candidate num_workers and prefetch_factor values, measures throughput, and applies the most efficient configuration.

This is especially useful when deploying to new hardware or when you are unsure which DataLoader settings work best. The calibration result is saved alongside model artifacts, so resumed runs skip re-calibration.

If calibration fails for any reason, training proceeds with the default data_loader_params from your config — no manual intervention required.

For details on all calibration parameters and tuning guidance, see Scaling & Memory → Automatic DataLoader Calibration.

Verifying Training Completed

How to confirm training completed

Training is complete when:

  1. Console output confirms model checkpoints have been saved:

    INFO - monad.run: Training foundation model finished.
    INFO - monad.run: Pretraining finished.
    
  2. A _FINISHED folder appears inside your output_path, containing the best model.

Training progress

During training, BaseModel displays a Rich progress bar showing entity-level progress with percentage complete, elapsed time, and estimated time remaining:

Train Epoch 0/2 Entities ━━━━━━━━━━╸━━━━━━━━━━  45%  75,525/167,834  0:12:47  0:15:38

The entity progress bar supplements the standard PyTorch Lightning training output. Set the ENABLE_PROGRESS_BAR environment variable to False to disable it.

Columns Analysis Report

During the fit stage, BaseModel logs a columns analysis report for each data source. The report lists:

  • Column types — every column grouped by its inferred type (decimal, categorical, categoricalCompressed, time_series, text, image)
  • Skipped columns — columns excluded from training, with reasons (too many NaNs, text-like, low cardinality, date, mixed-object) and hints on how to include them
  • Action recommendations — suggestions such as potential time-series or text columns that may benefit from a column_type_overrides entry, and redundant column pairs to consider removing
Columns analysis report
====================================

Table: transactions
====================================
Column type                        Columns
------------------------------------
decimal                            price, quantity, total_amount
categorical                        channel, region
categoricalCompressed              article_id, store_id
------------------------------------

Skipped columns
------------------------------------
Reason                             Columns
------------------------------------
Text column                        product_description
  Hint: Text columns are skipped by default. To include it,
  set its type via column_type_overrides.
------------------------------------

Action recommendations
------------------------------------
Redundant categorical columns      product_code, product_id
  Hint: These columns form a bijection (one-to-one mapping).
  Keeping both adds no new information. Remove the redundant
  column via disallowed_columns.

====================================

The report also logs the total fit duration: Fitting took X.XX seconds.

Review this report after every fit run:

  • Verify column types — confirm that columns are classified as you expect. If a column landed in the wrong type, add a column_type_overrides entry.
  • Check skipped columns — some skips are expected and harmless:
    • Date columns — raw dates are not embedded (this is by design). If the column is your event timestamp, configure it as date_column. If it should be a feature, transform it with an sql_lambda.
    • Text columns — skipped by default. Add a column_type_overrides: text entry if the column should contribute semantic features.
    • Low cardinality — columns with fewer than 2 unique values carry no signal and are safe to ignore.
    • Too many NaNs — over 90 % missing or empty values. Clean the source data or fill/impute if you need this column.
    • Mixed-object columns — inconsistent types within a column (e.g., strings mixed with numbers). Normalize the data or override the type explicitly.
    • Unexpected skips — if a column you need is skipped, check whether the data source itself is wrong (e.g., pointing at the wrong table or missing a filter).
  • Act on recommendations — remove redundant column pairs via disallowed_columns to reduce computation without losing signal. See also Troubleshooting → Redundancy Report.

Suggested Config

After the fit stage completes, BaseModel generates a suggested_config.yaml file in the output directory. This file is a copy of your original config with the column report findings already applied:

What is applied How
Time-series candidates An sql_lambda is added with resolve_fn() and a matching column_type_overrides: time_series entry
Bijection columns Redundant columns from each bijection group are added to disallowed_columns (keeping one representative column per group — local columns are preferred over joined ones)

Auto-generated entries are annotated with inline comments so you can tell them apart from your original config:

yaml
data_sources:
- name: transactions
  disallowed_columns:
  - product_code          # auto: bijection with product_id
  column_type_overrides:
    price_ts: time_series  # auto: time series candidate
  sql_lambdas:
  - alias: price_ts        # auto: time series candidate
    expression: "{{ resolve_fn('price') }}"

How to use

Review the generated file, verify the suggestions make sense for your use case, then rename it to replace your original config. You can also cherry-pick individual suggestions and apply them manually.

Note

suggested_config.yaml is not generated in quick check mode — it requires a full fit run.

What Happens During Training

The pipeline runs in two phases regardless of whether you use the joint or modular approach:

  • Phase 1
    Fit Behavioral Representation


    • Validate config
    • Sample data
    • Analyze columns
    • Compute representations
  • Phase 2
    Train Foundation Model


    • Load representations
    • Initialize trainer
    • Calibrate DataLoader (if enabled)
    • Training loop
    • Save best model

The fitting phase uses distributed computation — it runs locally on your server and no data leaves the environment. Ensure the server is secured against unauthorized access.

For details on log signatures at each step and how to diagnose failures, see Troubleshooting.