Apply to Your Data

You've done the Tutorial. Now point BaseModel at your own data using the onboarding starter package.

Before you start

Confirm the BaseModel container is running (Setup)
Ensure data is accessible via a supported data source
Prepare at least one event table meeting the Requirements

1. Get the starter package

Recommended folder organization

Before unpacking, set up a working structure on your mounted volume:

/basemodel/onboarding_package/ — unpack the starter package here (read-only reference)
/basemodel/configs/ — copy and edit configs and scripts here (your working copies)
/basemodel/project/ — output directory for FM checkpoints, scenario models, and predictions (set via config)

This keeps the original templates, your edits, and model artefacts cleanly separated.

Download and unpack the onboarding package:

bash

mkdir -p /basemodel/onboarding_package
curl -sL https://docs.basemodel.ai/downloads/basemodel-onboarding-latest.tar.gz \
  | tar xz -C /basemodel/onboarding_package/

The onboarding_package contains the following files:

onboarding/
├── 0_foundation/
│   ├── 00_quickstart_fm_config_parquet.yaml      # (1)
│   ├── 00_quickstart_fm_config_snowflake.yaml
│   ├── 00_quickstart_fm_config_bigquery.yaml
│   ├── 00_quickstart_fm_config_*.yaml
│   ├── 00_extended_fm_config_parquet.yaml         # (2)
│   ├── 00_extended_fm_config_*.yaml
│   └── 01_pretrain.py                             # (3)
├── 1_scenarios/
│   ├── binary_classifier/                         # (4)
│   ├── multiclass_classifier/
│   ├── multilabel_classifier/
│   ├── recommendation/
│   └── regression/
└── readme.txt

Minimal FM config — one per supported data engine
Full-featured FM config with joins, enrichments, and advanced options
FM training script
Each folder contains a complete scenario pipeline: train, test, and predict scripts

2. Configure the foundation model

Copy the config relevant for your data engine into your working directory. For example, use:

bash

mkdir -p /basemodel/configs
cp /basemodel/onboarding_package/onboarding/0_foundation/00_quickstart_fm_config_parquet.yaml \
   /basemodel/configs/fm_config.yaml

for a minimum version using Parquet file it is 00_quickstart_fm_config_parquet.yaml.

Quickstart vs extended configs

Quickstart (00_quickstart_*) has the minimum fields to get a foundation model running
Extended (00_extended_*) adds attribute tables, joins, enrichments, and data loading controls
Start with quickstart; move to extended once your first model trains successfully
See Foundation Model for full documentation of all options

Open fm_config.yaml and follow the inline comments.

View the full yaml config for Parquet

yaml

# Generic Foundation Model Config — parquet
#
# How to use:
# - select the file for your data engine (this one: parquet)
# - start by configuring at least one event data source (mandatory)
# - optionally add main entity attributes (one row per entity) and attribute tables (dimensions) for joins
# - add necessary configurations for date ranges, training / validation split etc.
#
# Helpful docs (only use if stuck):
# - Parquet data sources: https://docs.basemodel.ai/docs/parquet-data-sources
# - Data sources overview: https://docs.basemodel.ai/docs/connection-to-the-data-sources

# ------------------------------------------------------------------------------
# 1) data_sources block
# - events are mandatory
# - main entity attributes and attribute tables are optional
# ------------------------------------------------------------------------------

data_sources:

  # events (minimum 1 table mandatory)
  - type: event
    # name your table for BaseModel reference
    name: transactions
    data_location:
      database_type: parquet
      connection_params:
        # parquet file path
        path: "/path/to/data_dir/transactions_train.parquet"
        # database cache; keep it local to your project/workdir
        cache_path: "/basemodel/db_cache/"
      # database table reference
      table_name: transactions
    # entity for which we model and predict; multiple time-stamped event rows per entity id are expected
    main_entity_column: customer_id
    # event time column (required for event sources)
    date_column:
      name: t_dat
      format: "%Y-%m-%d"
    # optional: drop columns early via allowed / disallowed columns (helps memory, may prevent leakage)
    disallowed_columns: ["order_id"] # unique ID per event, no signal
    # allowed_columns: ["customer_id", "article_id", "t_dat", "sales_channel_id", "price"]
    # check: attribute tables require explicit joins on each relevant event table 
    # docs: https://docs.basemodel.ai/docs/joins
    joined_data_sources:
      - name: articles # name of the atributes data source
        join_on:
          - [article_id, article_id] # [event_column, attribute_column]

  # main entity attributes (optional)
  # expected: one row per entity id (e.g., customer profile)
  - type: main_entity_attribute
    name: customers
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/customers.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: customers
    main_entity_column: customer_id

  # attributes (optional)
  # expected: dimension tables used in joins into events
  - type: attribute
    name: articles
    allowed_columns: ["product_type_name", "product_group_name", "department_name", "section_name", "colour_group_name", "perceived_colour_master_name"]
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/articles.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: articles

# ------------------------------------------------------------------------------
# 2) data_params block
# ------------------------------------------------------------------------------

data_params:
  # earliest timestamp used for events considered in training/validation
  data_start_date: "2018-09-20 00:00:00"

  # how we split into training and validation; here we use 10% entity hold-out
  split:
    type: entity
    training: 90
    validation: 10

    # optional: hold-out test window (later than training/validation)
    # docs: https://docs.basemodel.ai/docs/controlling-data
    test:
      start_date: "2020-09-05 00:00:00"
      end_date: "2020-09-22 00:00:00"

    training_validation_end: "2020-09-04 00:00:00"


# ------------------------------------------------------------------------------
# 3) training_params block (here only to be used for a smoke test)
# ------------------------------------------------------------------------------

training_params:
  # limit batches to validate the setup quickly, then remove for a real run
  limit_train_batches: 5
  limit_val_batches: 5


# ------------------------------------------------------------------------------
# Optional blocks you may add later (kept out of quickstart on purpose)
#
# - data_loader_params (batching / workers):
#   docs: https://docs.basemodel.ai/docs/data-loading-configuration
#
# - memory_constraining_params (memory / model size constraints):
# - query_optimization (query parallelization)
#   docs: https://docs.basemodel.ai/docs/controlling-space-and-memory
# ------------------------------------------------------------------------------

# ------------------------------------------------------------------------------
# Other configuration options:
# - Model training configuration (metaparams, multi-GPU, precision, early stopping, checkpointing etc.): 
#   docs: https://docs.basemodel.ai/docs/model-training-configuration
# - Data configurations (split point rules, sampling, weighing etc.):  
#   docs: https://docs.basemodel.ai/docs/data-configurations
# - Data transformations (filtering, grouping, lambdas, data type overrides): 
#   docs: https://docs.basemodel.ai/docs/data-transformations-advanced
# ------------------------------------------------------------------------------

At minimum you must set:

What to change	Where in YAML
Entity ID column	`main_entity_column`
File paths or connection credentials	`connection_params`
Table names	`table_name`, `name`
Date column and format	`date_column`
Temporal split boundaries	`data_params.split`

3. Train the foundation model

Copy the training script and update the paths marked # EDIT:

bash

cp /basemodel/onboarding_package/onboarding/0_foundation/01_pretrain.py \
   /basemodel/configs/01_pretrain.py

View the full pretrain script

Python

from pathlib import Path

from monad.ui import pretrain


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to your foundation model config (yaml)
# it defines data sources, splits, training params, and resource constraints
config_path = Path("/basemodel/configs/<foundation_model_config.yaml>").resolve()

# EDIT: define where foundation model outputs will be saved (this becomes your project_dir)
# the script will create /fm, /features, /lightning_checkpoints etc. under this directory
project_dir = Path("/basemodel/projects/project_dir").resolve()


# --- Pretrain ----------------------------------------------------------------

# CHECK: overwrite=True starts training from scratch and replaces existing outputs in project_dir
# replace it with resume=True to resume training if partial outputs already exist
pretrain(
    config_path=config_path,
    output_path=project_dir,
    overwrite=True,
)

# For context refer to docs:
# FM stage overview - https://docs.basemodel.ai/docs/pretrain-stage-overview
# FM training - https://docs.basemodel.ai/docs/training-foundation-model
# End-to-end tutorial - https://docs.basemodel.ai/docs/e2e-tutorial-kaggle-hm

Variable in `01_pretrain.py`	What to set
`config_path`	Path to your edited config, e.g. `/basemodel/configs/fm_config.yaml`
`output_path`	Where to save FM checkpoints, e.g. `/basemodel/project/fm/`

Run:

bash

python /basemodel/configs/01_pretrain.py

Foundation model training takes time — proceed to step 4 while it runs.

You can also train via the CLI directly. See Training Execution for details.

4. Pick and configure a scenario

While the foundation model trains, prepare your scenario scripts. Browse 1_scenarios/ and copy the folder that matches your use case into /basemodel/configs/:

Folder	Use case
`binary_classifier/`	Churn, propensity, yes/no predictions
`multiclass_classifier/`	Category prediction (one of N)
`multilabel_classifier/`	Multiple simultaneous labels
`regression/`	Numeric predictions (LTV, spend, …)
`recommendation/`	Item ranking and suggestions

For example, to use the binary classifier:

bash

cp -r /basemodel/onboarding_package/onboarding/1_scenarios/binary_classifier/ \
      /basemodel/configs/binary_classifier/

5. Training, test, inference

Each scenario folder contains three scripts. Edits depend on the scenario type — here we use binary classifier as an example.

Train — `01_train.py`

View the full training script for binary classifier

Python

from datetime import timedelta
from pathlib import Path
from typing import Dict

import numpy as np

from monad.batch import SPLIT_TIMESTAMP
from monad.ui.config import TrainingParams
from monad.ui.module import BinaryClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to project directory, PARENT to /fm, /features, /lightning_checkpoints etc.
project_dir = Path("/basemodel/projects/project_dir").resolve()
# EDIT: define name for scenario checkpoints directory; the script will put it under the same parent directory as fm
scenario_name = "scenario_name"

# creating the relative paths
foundation_model_path = project_dir / "fm"
scenario_model_path = project_dir / "scenarios" / scenario_name


# --- Target Function ---------------------------------------------------------

# churn definition, for reference:
# 1: churned (no events in the future window)
# 0: not churned (at least one event in the future window)

# EDIT: provide target details
TARGET_EVENT_TABLE = "transactions" # the event data source to base the target logic (here: purchases of products)
TARGET_WINDOW_DAYS = 21 # the length of time to look at into the future (other time units are possible)


def target_fn(history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray | None:

    # filters out users with too short remaining window
    if has_incomplete_training_window(_ctx, timedelta(days=TARGET_WINDOW_DAYS)):
        return None

    # filters out users with no history
    if history[TARGET_EVENT_TABLE].count() == 0:
        return None

    # trims the future to desired window
    future_window = future.interval_from(
        _ctx[SPLIT_TIMESTAMP],
        timedelta(days=TARGET_WINDOW_DAYS),
    )

    # churn label
    y = 0 if future_window[TARGET_EVENT_TABLE].count() > 0 else 1
    return np.array([y], dtype=np.float32)


# --- Training ----------------------------------------------------------------

# EDIT: metaparams - keep default unless experimenting
learning_rate = 0.0001
epochs = 3 # use 1 for smoke test

# EDIT: limited runs - use for smoke test, then comment out here and below
limit_train_batches = 5 
limit_val_batches = 5

# EDIT: parallelised training - comment out to default to a single GPU
strategy = "ddp"
devices = [0, 1] # list GPU indices

# For more options refer to docs: https://docs.basemodel.ai/reference/trainingparams

task = BinaryClassificationTask()

training_params = TrainingParams(
    checkpoint_dir=scenario_model_path,
    learning_rate=learning_rate,
    epochs=epochs,
    devices=devices,
    strategy=strategy,
    limit_train_batches=limit_train_batches, # smoke test, comment out for full runs
    limit_val_batches=limit_val_batches, # smoke test, comment out for full runs
)

trainer = load_from_foundation_model(
    checkpoint_path=foundation_model_path,
    downstream_task=task,
    target_fn=target_fn,
)

trainer.fit(training_params=training_params, overwrite=True) # replace the last param with resume=True for resumed training

Variable	What to set
`project_dir`	Path to your project directory, e.g. `/basemodel/project`
`scenario_name`	Name for this scenario's checkpoints
`TARGET_EVENT_TABLE`	Name of your event table (must match FM config)
`TARGET_WINDOW_DAYS`	How far into the future to predict
`devices`	GPU indices to use, e.g. `[0]` or `[0, 1]`

Once ready, run it with:

bash

python /basemodel/configs/binary_classifier/01_train.py

Test — `02_test.py`

View the full test script for binary classifier

Python

from datetime import datetime
from pathlib import Path

from monad.ui.config import MetricParams, OutputType, TestingParams
from monad.ui.module import load_from_checkpoint


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to scenario checkpoints directory saved during training
scenario_checkpoint_dir = Path("/basemodel/projects/project_dir/scenarios/scenario_name").resolve()

# EDIT: if desired, modify where test outputs will be saved
predictions_path = scenario_checkpoint_dir / "test" / "predictions_and_ground_truth.tsv"


# --- Test Set Definition -----------------------------------------------------

# EDIT: prediction_date defines the start of the temporally defined test set
prediction_date = datetime(2020, 9, 5)

# EDIT: limited runs - use for smoke test, else comment out
limit_test_batches = 100


# --- Metrics -----------------------------------------------------------------

# EDIT: define metrics to compute during testing
metrics = [
    MetricParams(alias="auroc", metric_name="AUROC", kwargs={"task": "binary"}),
    MetricParams(alias="recall", metric_name="Recall", kwargs={"task": "binary"}),
    MetricParams(alias="precision", metric_name="Precision", kwargs={"task": "binary"}),
]

# For more options refer to docs.
# Testing parameters: https://docs.basemodel.ai/reference/testingparams
# Testing metrics: https://docs.basemodel.ai/docs/customizing-testing-metrics


# --- Test --------------------------------------------------------------------

testing_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)

testing_params = TestingParams(
    output_type=OutputType.DECODED,
    local_save_location=predictions_path,
    prediction_date=prediction_date,
    metrics=metrics,
    limit_test_batches=limit_test_batches,  # smoke test, comment out for full runs
)

testing_module.test(testing_params=testing_params)

Variable	What to set
`scenario_checkpoint_dir`	Path written by training, e.g. `/basemodel/project/scenarios/<name>`
`prediction_date`	Start of the test period
`metrics`	Metrics to compute (defaults: AUROC, Recall, Precision)

Once ready, run it with:

bash

python /basemodel/configs/binary_classifier/02_test.py

Inference — `03_predict.py`

View the full inference script for binary classifier

Python

from datetime import datetime
from pathlib import Path

from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint


# --- Names & Paths -----------------------------------------------------------

# EDIT: provide path to scenario checkpoints directory saved during training
scenario_checkpoint_dir = Path("/basemodel/projects/project_dir/scenarios/scenario_name").resolve()

# EDIT: define where prediction outputs will be saved
predictions_path = scenario_checkpoint_dir / "predictions" / "predictions.tsv"


# --- Prediction Set Definition -----------------------------------------------

# EDIT: prediction_date defines the split point used for inference
prediction_date = datetime(2020, 9, 23)


# --- Predict -----------------------------------------------------------------

prediction_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)

testing_params = TestingParams(
    output_type=OutputType.DECODED,
    local_save_location=predictions_path,
    prediction_date=prediction_date,
)

prediction_module.predict(testing_params=testing_params)

Variable	What to set
`scenario_checkpoint_dir`	Same path as test
`predictions_path`	Where to save the output TSV format file
`prediction_date`	The date to generate predictions for

Once ready, run it with:

bash

python /basemodel/configs/binary_classifier/03_predict.py

Apply to Your Data

1. Get the starter package

2. Configure the foundation model

3. Train the foundation model

4. Pick and configure a scenario

5. Training, test, inference

Train — 01_train.py

Test — 02_test.py

Inference — 03_predict.py

Train — `01_train.py`

Test — `02_test.py`

Inference — `03_predict.py`