Apply to Your Data
You've done the Tutorial. Now point BaseModel at your own data using the onboarding starter package.
Before you start
- Confirm the BaseModel container is running (Setup)
- Ensure data is accessible via a supported data source
- Prepare at least one event table meeting the Requirements
1. Get the starter package
Recommended folder organization
Before unpacking, set up a working structure on your mounted volume:
/basemodel/onboarding_package/— unpack the starter package here (read-only reference)/basemodel/configs/— copy and edit configs and scripts here (your working copies)/basemodel/project/— output directory for FM checkpoints, scenario models, and predictions (set via config)
This keeps the original templates, your edits, and model artefacts cleanly separated.
Download and unpack the onboarding package:
mkdir -p /basemodel/onboarding_package
curl -sL https://docs.basemodel.ai/downloads/basemodel-onboarding-latest.tar.gz \
| tar xz -C /basemodel/onboarding_package/
The onboarding_package contains the following files:
onboarding/
├── 0_foundation/
│ ├── 00_quickstart_fm_config_parquet.yaml # (1)
│ ├── 00_quickstart_fm_config_snowflake.yaml
│ ├── 00_quickstart_fm_config_bigquery.yaml
│ ├── 00_quickstart_fm_config_*.yaml
│ ├── 00_extended_fm_config_parquet.yaml # (2)
│ ├── 00_extended_fm_config_*.yaml
│ └── 01_pretrain.py # (3)
├── 1_scenarios/
│ ├── binary_classifier/ # (4)
│ ├── multiclass_classifier/
│ ├── multilabel_classifier/
│ ├── recommendation/
│ └── regression/
└── readme.txt
- Minimal FM config — one per supported data engine
- Full-featured FM config with joins, enrichments, and advanced options
- FM training script
- Each folder contains a complete scenario pipeline: train, test, and predict scripts
2. Configure the foundation model
Copy the config relevant for your data engine into your working directory. For example, use:
mkdir -p /basemodel/configs
cp /basemodel/onboarding_package/onboarding/0_foundation/00_quickstart_fm_config_parquet.yaml \
/basemodel/configs/fm_config.yaml
for a minimum version using Parquet file it is 00_quickstart_fm_config_parquet.yaml.
Quickstart vs extended configs
- Quickstart (
00_quickstart_*) has the minimum fields to get a foundation model running - Extended (
00_extended_*) adds attribute tables, joins, enrichments, and data loading controls - Start with quickstart; move to extended once your first model trains successfully
- See Foundation Model for full documentation of all options
Open fm_config.yaml and follow the inline comments.
View the full yaml config for Parquet
# Generic Foundation Model Config — parquet
#
# How to use:
# - select the file for your data engine (this one: parquet)
# - start by configuring at least one event data source (mandatory)
# - optionally add main entity attributes (one row per entity) and attribute tables (dimensions) for joins
# - add necessary configurations for date ranges, training / validation split etc.
#
# Helpful docs (only use if stuck):
# - Parquet data sources: https://docs.basemodel.ai/docs/parquet-data-sources
# - Data sources overview: https://docs.basemodel.ai/docs/connection-to-the-data-sources
# ------------------------------------------------------------------------------
# 1) data_sources block
# - events are mandatory
# - main entity attributes and attribute tables are optional
# ------------------------------------------------------------------------------
data_sources:
# events (minimum 1 table mandatory)
- type: event
# name your table for BaseModel reference
name: transactions
data_location:
database_type: parquet
connection_params:
# parquet file path
path: "/path/to/data_dir/transactions_train.parquet"
# database cache; keep it local to your project/workdir
cache_path: "/basemodel/db_cache/"
# database table reference
table_name: transactions
# entity for which we model and predict; multiple time-stamped event rows per entity id are expected
main_entity_column: customer_id
# event time column (required for event sources)
date_column:
name: t_dat
format: "%Y-%m-%d"
# optional: drop columns early via allowed / disallowed columns (helps memory, may prevent leakage)
disallowed_columns: ["order_id"] # unique ID per event, no signal
# allowed_columns: ["customer_id", "article_id", "t_dat", "sales_channel_id", "price"]
# check: attribute tables require explicit joins on each relevant event table
# docs: https://docs.basemodel.ai/docs/joins
joined_data_sources:
- name: articles # name of the atributes data source
join_on:
- [article_id, article_id] # [event_column, attribute_column]
# main entity attributes (optional)
# expected: one row per entity id (e.g., customer profile)
- type: main_entity_attribute
name: customers
data_location:
database_type: parquet
connection_params:
path: "/path/to/data_dir/customers.parquet"
cache_path: "/basemodel/db_cache/"
table_name: customers
main_entity_column: customer_id
# attributes (optional)
# expected: dimension tables used in joins into events
- type: attribute
name: articles
allowed_columns: ["product_type_name", "product_group_name", "department_name", "section_name", "colour_group_name", "perceived_colour_master_name"]
data_location:
database_type: parquet
connection_params:
path: "/path/to/data_dir/articles.parquet"
cache_path: "/basemodel/db_cache/"
table_name: articles
# ------------------------------------------------------------------------------
# 2) data_params block
# ------------------------------------------------------------------------------
data_params:
# earliest timestamp used for events considered in training/validation
data_start_date: "2018-09-20 00:00:00"
# how we split into training and validation; here we use 10% entity hold-out
split:
type: entity
training: 90
validation: 10
# optional: hold-out test window (later than training/validation)
# docs: https://docs.basemodel.ai/docs/controlling-data
test:
start_date: "2020-09-05 00:00:00"
end_date: "2020-09-22 00:00:00"
training_validation_end: "2020-09-04 00:00:00"
# ------------------------------------------------------------------------------
# 3) training_params block (here only to be used for a smoke test)
# ------------------------------------------------------------------------------
training_params:
# limit batches to validate the setup quickly, then remove for a real run
limit_train_batches: 5
limit_val_batches: 5
# ------------------------------------------------------------------------------
# Optional blocks you may add later (kept out of quickstart on purpose)
#
# - data_loader_params (batching / workers):
# docs: https://docs.basemodel.ai/docs/data-loading-configuration
#
# - memory_constraining_params (memory / model size constraints):
# - query_optimization (query parallelization)
# docs: https://docs.basemodel.ai/docs/controlling-space-and-memory
# ------------------------------------------------------------------------------
# ------------------------------------------------------------------------------
# Other configuration options:
# - Model training configuration (metaparams, multi-GPU, precision, early stopping, checkpointing etc.):
# docs: https://docs.basemodel.ai/docs/model-training-configuration
# - Data configurations (split point rules, sampling, weighing etc.):
# docs: https://docs.basemodel.ai/docs/data-configurations
# - Data transformations (filtering, grouping, lambdas, data type overrides):
# docs: https://docs.basemodel.ai/docs/data-transformations-advanced
# ------------------------------------------------------------------------------
At minimum you must set:
| What to change | Where in YAML |
|---|---|
| Entity ID column | main_entity_column |
| File paths or connection credentials | connection_params |
| Table names | table_name, name |
| Date column and format | date_column |
| Temporal split boundaries | data_params.split |
3. Train the foundation model
Copy the training script and update the paths marked # EDIT:
cp /basemodel/onboarding_package/onboarding/0_foundation/01_pretrain.py \
/basemodel/configs/01_pretrain.py
View the full pretrain script
from pathlib import Path
from monad.ui import pretrain
# --- Names & Paths -----------------------------------------------------------
# EDIT: provide path to your foundation model config (yaml)
# it defines data sources, splits, training params, and resource constraints
config_path = Path("/basemodel/configs/<foundation_model_config.yaml>").resolve()
# EDIT: define where foundation model outputs will be saved (this becomes your project_dir)
# the script will create /fm, /features, /lightning_checkpoints etc. under this directory
project_dir = Path("/basemodel/projects/project_dir").resolve()
# --- Pretrain ----------------------------------------------------------------
# CHECK: overwrite=True starts training from scratch and replaces existing outputs in project_dir
# replace it with resume=True to resume training if partial outputs already exist
pretrain(
config_path=config_path,
output_path=project_dir,
overwrite=True,
)
# For context refer to docs:
# FM stage overview - https://docs.basemodel.ai/docs/pretrain-stage-overview
# FM training - https://docs.basemodel.ai/docs/training-foundation-model
# End-to-end tutorial - https://docs.basemodel.ai/docs/e2e-tutorial-kaggle-hm
Variable in 01_pretrain.py |
What to set |
|---|---|
config_path |
Path to your edited config, e.g. /basemodel/configs/fm_config.yaml |
output_path |
Where to save FM checkpoints, e.g. /basemodel/project/fm/ |
Run:
Foundation model training takes time — proceed to step 4 while it runs.
You can also train via the CLI directly. See Training Execution for details.
4. Pick and configure a scenario
While the foundation model trains, prepare your scenario scripts. Browse 1_scenarios/ and copy the folder that matches your use case into /basemodel/configs/:
| Folder | Use case |
|---|---|
binary_classifier/ |
Churn, propensity, yes/no predictions |
multiclass_classifier/ |
Category prediction (one of N) |
multilabel_classifier/ |
Multiple simultaneous labels |
regression/ |
Numeric predictions (LTV, spend, …) |
recommendation/ |
Item ranking and suggestions |
For example, to use the binary classifier:
cp -r /basemodel/onboarding_package/onboarding/1_scenarios/binary_classifier/ \
/basemodel/configs/binary_classifier/
5. Training, test, inference
Each scenario folder contains three scripts. Edits depend on the scenario type — here we use binary classifier as an example.
Train — 01_train.py
View the full training script for binary classifier
from datetime import timedelta
from pathlib import Path
from typing import Dict
import numpy as np
from monad.batch import SPLIT_TIMESTAMP
from monad.ui.config import TrainingParams
from monad.ui.module import BinaryClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window
# --- Names & Paths -----------------------------------------------------------
# EDIT: provide path to project directory, PARENT to /fm, /features, /lightning_checkpoints etc.
project_dir = Path("/basemodel/projects/project_dir").resolve()
# EDIT: define name for scenario checkpoints directory; the script will put it under the same parent directory as fm
scenario_name = "scenario_name"
# creating the relative paths
foundation_model_path = project_dir / "fm"
scenario_model_path = project_dir / "scenarios" / scenario_name
# --- Target Function ---------------------------------------------------------
# churn definition, for reference:
# 1: churned (no events in the future window)
# 0: not churned (at least one event in the future window)
# EDIT: provide target details
TARGET_EVENT_TABLE = "transactions" # the event data source to base the target logic (here: purchases of products)
TARGET_WINDOW_DAYS = 21 # the length of time to look at into the future (other time units are possible)
def target_fn(history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray | None:
# filters out users with too short remaining window
if has_incomplete_training_window(_ctx, timedelta(days=TARGET_WINDOW_DAYS)):
return None
# filters out users with no history
if history[TARGET_EVENT_TABLE].count() == 0:
return None
# trims the future to desired window
future_window = future.interval_from(
_ctx[SPLIT_TIMESTAMP],
timedelta(days=TARGET_WINDOW_DAYS),
)
# churn label
y = 0 if future_window[TARGET_EVENT_TABLE].count() > 0 else 1
return np.array([y], dtype=np.float32)
# --- Training ----------------------------------------------------------------
# EDIT: metaparams - keep default unless experimenting
learning_rate = 0.0001
epochs = 3 # use 1 for smoke test
# EDIT: limited runs - use for smoke test, then comment out here and below
limit_train_batches = 5
limit_val_batches = 5
# EDIT: parallelised training - comment out to default to a single GPU
strategy = "ddp"
devices = [0, 1] # list GPU indices
# For more options refer to docs: https://docs.basemodel.ai/reference/trainingparams
task = BinaryClassificationTask()
training_params = TrainingParams(
checkpoint_dir=scenario_model_path,
learning_rate=learning_rate,
epochs=epochs,
devices=devices,
strategy=strategy,
limit_train_batches=limit_train_batches, # smoke test, comment out for full runs
limit_val_batches=limit_val_batches, # smoke test, comment out for full runs
)
trainer = load_from_foundation_model(
checkpoint_path=foundation_model_path,
downstream_task=task,
target_fn=target_fn,
)
trainer.fit(training_params=training_params, overwrite=True) # replace the last param with resume=True for resumed training
| Variable | What to set |
|---|---|
project_dir |
Path to your project directory, e.g. /basemodel/project |
scenario_name |
Name for this scenario's checkpoints |
TARGET_EVENT_TABLE |
Name of your event table (must match FM config) |
TARGET_WINDOW_DAYS |
How far into the future to predict |
devices |
GPU indices to use, e.g. [0] or [0, 1] |
Once ready, run it with:
Test — 02_test.py
View the full test script for binary classifier
from datetime import datetime
from pathlib import Path
from monad.ui.config import MetricParams, OutputType, TestingParams
from monad.ui.module import load_from_checkpoint
# --- Names & Paths -----------------------------------------------------------
# EDIT: provide path to scenario checkpoints directory saved during training
scenario_checkpoint_dir = Path("/basemodel/projects/project_dir/scenarios/scenario_name").resolve()
# EDIT: if desired, modify where test outputs will be saved
predictions_path = scenario_checkpoint_dir / "test" / "predictions_and_ground_truth.tsv"
# --- Test Set Definition -----------------------------------------------------
# EDIT: prediction_date defines the start of the temporally defined test set
prediction_date = datetime(2020, 9, 5)
# EDIT: limited runs - use for smoke test, else comment out
limit_test_batches = 100
# --- Metrics -----------------------------------------------------------------
# EDIT: define metrics to compute during testing
metrics = [
MetricParams(alias="auroc", metric_name="AUROC", kwargs={"task": "binary"}),
MetricParams(alias="recall", metric_name="Recall", kwargs={"task": "binary"}),
MetricParams(alias="precision", metric_name="Precision", kwargs={"task": "binary"}),
]
# For more options refer to docs.
# Testing parameters: https://docs.basemodel.ai/reference/testingparams
# Testing metrics: https://docs.basemodel.ai/docs/customizing-testing-metrics
# --- Test --------------------------------------------------------------------
testing_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)
testing_params = TestingParams(
output_type=OutputType.DECODED,
local_save_location=predictions_path,
prediction_date=prediction_date,
metrics=metrics,
limit_test_batches=limit_test_batches, # smoke test, comment out for full runs
)
testing_module.test(testing_params=testing_params)
| Variable | What to set |
|---|---|
scenario_checkpoint_dir |
Path written by training, e.g. /basemodel/project/scenarios/<name> |
prediction_date |
Start of the test period |
metrics |
Metrics to compute (defaults: AUROC, Recall, Precision) |
Once ready, run it with:
Inference — 03_predict.py
View the full inference script for binary classifier
from datetime import datetime
from pathlib import Path
from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint
# --- Names & Paths -----------------------------------------------------------
# EDIT: provide path to scenario checkpoints directory saved during training
scenario_checkpoint_dir = Path("/basemodel/projects/project_dir/scenarios/scenario_name").resolve()
# EDIT: define where prediction outputs will be saved
predictions_path = scenario_checkpoint_dir / "predictions" / "predictions.tsv"
# --- Prediction Set Definition -----------------------------------------------
# EDIT: prediction_date defines the split point used for inference
prediction_date = datetime(2020, 9, 23)
# --- Predict -----------------------------------------------------------------
prediction_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)
testing_params = TestingParams(
output_type=OutputType.DECODED,
local_save_location=predictions_path,
prediction_date=prediction_date,
)
prediction_module.predict(testing_params=testing_params)
| Variable | What to set |
|---|---|
scenario_checkpoint_dir |
Same path as test |
predictions_path |
Where to save the output TSV format file |
prediction_date |
The date to generate predictions for |
Once ready, run it with: