Skip to content

Apply to Your Data

You've done the Tutorial. Now point BaseModel at your own data using the onboarding starter package.

Before you start

1. Get the starter package

Recommended folder organization

Before unpacking, set up a working structure on your mounted volume:

  • /basemodel/onboarding_package/ — unpack the starter package here (read-only reference)
  • /basemodel/configs/ — copy and edit configs and scripts here (your working copies)
  • /basemodel/project/ — output directory for FM checkpoints, scenario models, and predictions (set via config)

This keeps the original templates, your edits, and model artefacts cleanly separated.

Download and unpack the onboarding package:

bash
mkdir -p /basemodel/onboarding_package
curl -sL https://docs.basemodel.ai/downloads/basemodel-onboarding-latest.tar.gz \
  | tar xz -C /basemodel/onboarding_package/

The onboarding_package contains the following files:

onboarding/
├── 0_foundation/
   ├── 00_quickstart_fm_config_parquet.yaml      # (1)
   ├── 00_quickstart_fm_config_snowflake.yaml
   ├── 00_quickstart_fm_config_bigquery.yaml
   ├── 00_quickstart_fm_config_*.yaml
   ├── 00_extended_fm_config_parquet.yaml         # (2)
   ├── 00_extended_fm_config_*.yaml
   └── 01_pretrain.py                             # (3)
├── 1_scenarios/
   ├── binary_classifier/                         # (4)
   ├── multiclass_classifier/
   ├── multilabel_classifier/
   ├── recommendation/
   └── regression/
└── readme.txt
  1. Minimal FM config — one per supported data engine
  2. Full-featured FM config with joins, enrichments, and advanced options
  3. FM training script
  4. Each folder contains a complete scenario pipeline: train, test, and predict scripts

2. Configure the foundation model

Copy the config relevant for your data engine into your working directory. For example, use:

bash
mkdir -p /basemodel/configs
cp /basemodel/onboarding_package/onboarding/0_foundation/00_quickstart_fm_config_parquet.yaml \
   /basemodel/configs/fm_config.yaml

for a minimum version using Parquet file it is 00_quickstart_fm_config_parquet.yaml.

Quickstart vs extended configs

  • Quickstart (00_quickstart_*) has the minimum fields to get a foundation model running
  • Extended (00_extended_*) adds attribute tables, joins, enrichments, and data loading controls
  • Start with quickstart; move to extended once your first model trains successfully
  • See Foundation Model for full documentation of all options

Open fm_config.yaml and follow the inline comments.

View the full yaml config for Parquet
yaml
# Generic Foundation Model Config — parquet
#
# How to use:
# - select the file for your data engine (this one: parquet)
# - start by configuring at least one event data source (mandatory)
# - optionally add main entity attributes (one row per entity) and attribute tables (dimensions) for joins
# - add necessary configurations for date ranges, training / validation split etc.
#
# Helpful docs (only use if stuck):
# - Parquet data sources: https://docs.basemodel.ai/docs/parquet-data-sources
# - Data sources overview: https://docs.basemodel.ai/docs/connection-to-the-data-sources

# ------------------------------------------------------------------------------
# 1) data_sources block
# - events are mandatory
# - main entity attributes and attribute tables are optional
# ------------------------------------------------------------------------------

data_sources:

  # events (minimum 1 table mandatory)
  - type: event
    # name your table for BaseModel reference
    name: transactions
    data_location:
      database_type: parquet
      connection_params:
        # parquet file path
        path: "/path/to/data_dir/transactions_train.parquet"
        # database cache; keep it local to your project/workdir
        cache_path: "/basemodel/db_cache/"
      # database table reference
      table_name: transactions
    # entity for which we model and predict; multiple time-stamped event rows per entity id are expected
    main_entity_column: customer_id
    # event time column (required for event sources)
    date_column:
      name: t_dat
      format: "%Y-%m-%d"
    # optional: drop columns early via allowed / disallowed columns (helps memory, may prevent leakage)
    disallowed_columns: ["order_id"] # unique ID per event, no signal
    # allowed_columns: ["customer_id", "article_id", "t_dat", "sales_channel_id", "price"]
    # check: attribute tables require explicit joins on each relevant event table 
    # docs: https://docs.basemodel.ai/docs/joins
    joined_data_sources:
      - name: articles # name of the atributes data source
        join_on:
          - [article_id, article_id] # [event_column, attribute_column]

  # main entity attributes (optional)
  # expected: one row per entity id (e.g., customer profile)
  - type: main_entity_attribute
    name: customers
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/customers.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: customers
    main_entity_column: customer_id

  # attributes (optional)
  # expected: dimension tables used in joins into events
  - type: attribute
    name: articles
    allowed_columns: ["product_type_name", "product_group_name", "department_name", "section_name", "colour_group_name", "perceived_colour_master_name"]
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/articles.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: articles

# ------------------------------------------------------------------------------
# 2) data_params block
# ------------------------------------------------------------------------------

data_params:
  # earliest timestamp used for events considered in training/validation
  data_start_date: "2018-09-20 00:00:00"

  # how we split into training and validation; here we use 10% entity hold-out
  split:
    type: entity
    training: 90
    validation: 10

    # optional: hold-out test window (later than training/validation)
    # docs: https://docs.basemodel.ai/docs/controlling-data
    test:
      start_date: "2020-09-05 00:00:00"
      end_date: "2020-09-22 00:00:00"

    training_validation_end: "2020-09-04 00:00:00"


# ------------------------------------------------------------------------------
# 3) training_params block (here only to be used for a smoke test)
# ------------------------------------------------------------------------------

training_params:
  # limit batches to validate the setup quickly, then remove for a real run
  limit_train_batches: 5
  limit_val_batches: 5


# ------------------------------------------------------------------------------
# Optional blocks you may add later (kept out of quickstart on purpose)
#
# - data_loader_params (batching / workers):
#   docs: https://docs.basemodel.ai/docs/data-loading-configuration
#
# - memory_constraining_params (memory / model size constraints):
# - query_optimization (query parallelization)
#   docs: https://docs.basemodel.ai/docs/controlling-space-and-memory
# ------------------------------------------------------------------------------

# ------------------------------------------------------------------------------
# Other configuration options:
# - Model training configuration (metaparams, multi-GPU, precision, early stopping, checkpointing etc.): 
#   docs: https://docs.basemodel.ai/docs/model-training-configuration
# - Data configurations (split point rules, sampling, weighing etc.):  
#   docs: https://docs.basemodel.ai/docs/data-configurations
# - Data transformations (filtering, grouping, lambdas, data type overrides): 
#   docs: https://docs.basemodel.ai/docs/data-transformations-advanced
# ------------------------------------------------------------------------------

At minimum you must set:

What to change Where in YAML
Entity ID column main_entity_column
File paths or connection credentials connection_params
Table names table_name, name
Date column and format date_column
Temporal split boundaries data_params.split

3. Train the foundation model

Copy the training script and update the paths marked # EDIT:

bash
cp /basemodel/onboarding_package/onboarding/0_foundation/01_pretrain.py \
   /basemodel/configs/01_pretrain.py
View the full pretrain script
Python
from pathlib import Path

from monad.ui import pretrain


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to your foundation model config (yaml)
# it defines data sources, splits, training params, and resource constraints
config_path = Path("/basemodel/configs/<foundation_model_config.yaml>").resolve()

# EDIT: define where foundation model outputs will be saved (this becomes your project_dir)
# the script will create /fm, /features, /lightning_checkpoints etc. under this directory
project_dir = Path("/basemodel/projects/project_dir").resolve()


# --- Pretrain ----------------------------------------------------------------

# CHECK: overwrite=True starts training from scratch and replaces existing outputs in project_dir
# replace it with resume=True to resume training if partial outputs already exist
pretrain(
    config_path=config_path,
    output_path=project_dir,
    overwrite=True,
)

# For context refer to docs:
# FM stage overview - https://docs.basemodel.ai/docs/pretrain-stage-overview
# FM training - https://docs.basemodel.ai/docs/training-foundation-model
# End-to-end tutorial - https://docs.basemodel.ai/docs/e2e-tutorial-kaggle-hm
Variable in 01_pretrain.py What to set
config_path Path to your edited config, e.g. /basemodel/configs/fm_config.yaml
output_path Where to save FM checkpoints, e.g. /basemodel/project/fm/

Run:

bash
python /basemodel/configs/01_pretrain.py

Foundation model training takes time — proceed to step 4 while it runs.

You can also train via the CLI directly. See Training Execution for details.

4. Pick and configure a scenario

While the foundation model trains, prepare your scenario scripts. Browse 1_scenarios/ and copy the folder that matches your use case into /basemodel/configs/:

Folder Use case
binary_classifier/ Churn, propensity, yes/no predictions
multiclass_classifier/ Category prediction (one of N)
multilabel_classifier/ Multiple simultaneous labels
regression/ Numeric predictions (LTV, spend, …)
recommendation/ Item ranking and suggestions

For example, to use the binary classifier:

bash
cp -r /basemodel/onboarding_package/onboarding/1_scenarios/binary_classifier/ \
      /basemodel/configs/binary_classifier/

5. Training, test, inference

Each scenario folder contains three scripts. Edits depend on the scenario type — here we use binary classifier as an example.

Train — 01_train.py

View the full training script for binary classifier
Python
from datetime import timedelta
from pathlib import Path
from typing import Dict

import numpy as np

from monad.batch import SPLIT_TIMESTAMP
from monad.ui.config import TrainingParams
from monad.ui.module import BinaryClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to project directory, PARENT to /fm, /features, /lightning_checkpoints etc.
project_dir = Path("/basemodel/projects/project_dir").resolve()
# EDIT: define name for scenario checkpoints directory; the script will put it under the same parent directory as fm
scenario_name = "scenario_name"

# creating the relative paths
foundation_model_path = project_dir / "fm"
scenario_model_path = project_dir / "scenarios" / scenario_name


# --- Target Function ---------------------------------------------------------

# churn definition, for reference:
# 1: churned (no events in the future window)
# 0: not churned (at least one event in the future window)

# EDIT: provide target details
TARGET_EVENT_TABLE = "transactions" # the event data source to base the target logic (here: purchases of products)
TARGET_WINDOW_DAYS = 21 # the length of time to look at into the future (other time units are possible)


def target_fn(history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray | None:

    # filters out users with too short remaining window
    if has_incomplete_training_window(_ctx, timedelta(days=TARGET_WINDOW_DAYS)):
        return None

    # filters out users with no history
    if history[TARGET_EVENT_TABLE].count() == 0:
        return None

    # trims the future to desired window
    future_window = future.interval_from(
        _ctx[SPLIT_TIMESTAMP],
        timedelta(days=TARGET_WINDOW_DAYS),
    )

    # churn label
    y = 0 if future_window[TARGET_EVENT_TABLE].count() > 0 else 1
    return np.array([y], dtype=np.float32)


# --- Training ----------------------------------------------------------------

# EDIT: metaparams - keep default unless experimenting
learning_rate = 0.0001
epochs = 3 # use 1 for smoke test

# EDIT: limited runs - use for smoke test, then comment out here and below
limit_train_batches = 5 
limit_val_batches = 5

# EDIT: parallelised training - comment out to default to a single GPU
strategy = "ddp"
devices = [0, 1] # list GPU indices

# For more options refer to docs: https://docs.basemodel.ai/reference/trainingparams

task = BinaryClassificationTask()

training_params = TrainingParams(
    checkpoint_dir=scenario_model_path,
    learning_rate=learning_rate,
    epochs=epochs,
    devices=devices,
    strategy=strategy,
    limit_train_batches=limit_train_batches, # smoke test, comment out for full runs
    limit_val_batches=limit_val_batches, # smoke test, comment out for full runs
)

trainer = load_from_foundation_model(
    checkpoint_path=foundation_model_path,
    downstream_task=task,
    target_fn=target_fn,
)

trainer.fit(training_params=training_params, overwrite=True) # replace the last param with resume=True for resumed training
Variable What to set
project_dir Path to your project directory, e.g. /basemodel/project
scenario_name Name for this scenario's checkpoints
TARGET_EVENT_TABLE Name of your event table (must match FM config)
TARGET_WINDOW_DAYS How far into the future to predict
devices GPU indices to use, e.g. [0] or [0, 1]

Once ready, run it with:

bash
python /basemodel/configs/binary_classifier/01_train.py

Test — 02_test.py

View the full test script for binary classifier
Python
from datetime import datetime
from pathlib import Path

from monad.ui.config import MetricParams, OutputType, TestingParams
from monad.ui.module import load_from_checkpoint


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to scenario checkpoints directory saved during training
scenario_checkpoint_dir = Path("/basemodel/projects/project_dir/scenarios/scenario_name").resolve()

# EDIT: if desired, modify where test outputs will be saved
predictions_path = scenario_checkpoint_dir / "test" / "predictions_and_ground_truth.tsv"


# --- Test Set Definition -----------------------------------------------------

# EDIT: prediction_date defines the start of the temporally defined test set
prediction_date = datetime(2020, 9, 5)

# EDIT: limited runs - use for smoke test, else comment out
limit_test_batches = 100


# --- Metrics -----------------------------------------------------------------

# EDIT: define metrics to compute during testing
metrics = [
    MetricParams(alias="auroc", metric_name="AUROC", kwargs={"task": "binary"}),
    MetricParams(alias="recall", metric_name="Recall", kwargs={"task": "binary"}),
    MetricParams(alias="precision", metric_name="Precision", kwargs={"task": "binary"}),
]

# For more options refer to docs.
# Testing parameters: https://docs.basemodel.ai/reference/testingparams
# Testing metrics: https://docs.basemodel.ai/docs/customizing-testing-metrics


# --- Test --------------------------------------------------------------------

testing_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)

testing_params = TestingParams(
    output_type=OutputType.DECODED,
    local_save_location=predictions_path,
    prediction_date=prediction_date,
    metrics=metrics,
    limit_test_batches=limit_test_batches,  # smoke test, comment out for full runs
)

testing_module.test(testing_params=testing_params)
Variable What to set
scenario_checkpoint_dir Path written by training, e.g. /basemodel/project/scenarios/<name>
prediction_date Start of the test period
metrics Metrics to compute (defaults: AUROC, Recall, Precision)

Once ready, run it with:

bash
python /basemodel/configs/binary_classifier/02_test.py

Inference03_predict.py

View the full inference script for binary classifier
Python
from datetime import datetime
from pathlib import Path

from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to scenario checkpoints directory saved during training
scenario_checkpoint_dir = Path("/basemodel/projects/project_dir/scenarios/scenario_name").resolve()

# EDIT: define where prediction outputs will be saved
predictions_path = scenario_checkpoint_dir / "predictions" / "predictions.tsv"


# --- Prediction Set Definition -----------------------------------------------

# EDIT: prediction_date defines the split point used for inference
prediction_date = datetime(2020, 9, 23)


# --- Predict -----------------------------------------------------------------

prediction_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)

testing_params = TestingParams(
    output_type=OutputType.DECODED,
    local_save_location=predictions_path,
    prediction_date=prediction_date,
)

prediction_module.predict(testing_params=testing_params)
Variable What to set
scenario_checkpoint_dir Same path as test
predictions_path Where to save the output TSV format file
prediction_date The date to generate predictions for

Once ready, run it with:

bash
python /basemodel/configs/binary_classifier/03_predict.py