Basic Configuration
A foundation model config is a single YAML file with three blocks: where your data lives, what date ranges and splits to use, and how to train. This page walks through the minimum you need to launch a first training run.
Structure at a Glance
data_sources: # 1. Where your data lives and how tables relate
- ...
data_params: # 2. Date range and train / validation split
...
training_params: # 3. Training controls (start minimal, tune later)
...
Data Sources
The data_sources list tells BaseModel which tables to read and how they connect. You must include at least one event source containing multiple time-stamped rows per entity. Entity attributes and dimension tables are optional but recommended.
data_sources:
# events (minimum 1 table mandatory)
- type: event
# name your table for BaseModel reference
name: transactions
data_location:
database_type: parquet
connection_params:
# parquet file path
path: "/path/to/data_dir/transactions_train.parquet"
# database cache; keep it local to your project/workdir
cache_path: "/basemodel/db_cache/"
# database table reference
table_name: transactions
# entity for which we model and predict;
# multiple time-stamped event rows per entity id are expected
main_entity_column: customer_id
# event time column (required for event sources)
date_column:
name: t_dat
format: "%Y-%m-%d"
# optional: drop columns early via allowed / disallowed columns
# (helps memory, may prevent leakage)
disallowed_columns: ["order_id"] # unique ID per event, no signal
# allowed_columns: ["customer_id", "article_id", "t_dat", "price"]
# attribute tables require explicit joins on each relevant event table
joined_data_sources:
- name: articles # name of the attribute data source
join_on:
- [article_id, article_id] # [event_column, attribute_column]
# main entity attributes (optional)
# expected: one row per entity id (e.g., customer profile)
- type: main_entity_attribute
name: customers
data_location:
database_type: parquet
connection_params:
path: "/path/to/data_dir/customers.parquet"
cache_path: "/basemodel/db_cache/"
table_name: customers
main_entity_column: customer_id
# attributes (optional)
# expected: dimension tables used in joins into events
- type: attribute
name: articles
allowed_columns:
- product_type_name
- product_group_name
- department_name
- section_name
- colour_group_name
- perceived_colour_master_name
data_location:
database_type: parquet
connection_params:
path: "/path/to/data_dir/articles.parquet"
cache_path: "/basemodel/db_cache/"
table_name: articles
Source Types
| Type | Details |
|---|---|
event |
Min. 1 required Many rows per entity Time-stamped behavioral data the model learns sequences from |
main_entity_attribute |
Optional One row per entity Static or slowly-changing entity properties (e.g. customer profile) |
attribute |
Optional Dimension table Enrichment data joined into events (e.g. product catalogue) |
Column Filtering
Use disallowed_columns or allowed_columns (never both) to control which columns are included in training. Typical candidates for exclusion include surrogate keys, row-level IDs, and any columns that could lead to target leakage.
Data Params
data_params defines the date window BaseModel reads and how data is split for training and validation.
data_params:
# earliest timestamp used for events considered in training/validation
data_start_date: "2018-09-20 00:00:00"
# how we split into training and validation;
# here we use 10% entity hold-out
split:
type: entity
training: 90
validation: 10
# optional: hold-out test window (later than training/validation)
test:
start_date: "2020-09-05 00:00:00"
end_date: "2020-09-22 00:00:00"
training_validation_end: "2020-09-04 00:00:00"
Split Types
| Type | Details |
|---|---|
entity |
Random percentage of entity IDs is held out for validation. Default — good when entity count is large enough. Should always be used in production. |
time |
Training and validation sets separated by date boundary. Best for experimentation — tests the model's ability to predict across time. Should not be used in production where you want the latest data to inform inference. |
Ensure clean data split boundaries
Set training_validation_end to the day before your test.start_date to ensure clean separation between training data and the test window.
Training Params
For a first run, keep training_params minimal. The goal is to verify the pipeline end-to-end before committing to a full training cycle.
training_params:
# limit batches to validate the setup quickly,
# then remove for a real run
limit_train_batches: 5
limit_val_batches: 5
This smoke-test config processes only 5 batches for training and validation — enough to confirm that data loads, the model initializes, and outputs are written correctly. Once everything checks out, remove both limits and optionally add controls covered in Training Controls.
Complete Example
The onboarding package includes a ready-to-use config file — copy it as a starting point and replace paths and column names with your own.
# Generic Foundation Model Config — parquet
#
# How to use:
# - select the file for your data engine (this one: parquet)
# - start by configuring at least one event data source (mandatory)
# - optionally add main entity attributes (one row per entity) and attribute tables (dimensions) for joins
# - add necessary configurations for date ranges, training / validation split etc.
#
# Helpful docs (only use if stuck):
# - Parquet data sources: https://docs.basemodel.ai/docs/parquet-data-sources
# - Data sources overview: https://docs.basemodel.ai/docs/connection-to-the-data-sources
# ------------------------------------------------------------------------------
# 1) data_sources block
# - events are mandatory
# - main entity attributes and attribute tables are optional
# ------------------------------------------------------------------------------
data_sources:
# events (minimum 1 table mandatory)
- type: event
# name your table for BaseModel reference
name: transactions
data_location:
database_type: parquet
connection_params:
# parquet file path
path: "/path/to/data_dir/transactions_train.parquet"
# database cache; keep it local to your project/workdir
cache_path: "/basemodel/db_cache/"
# database table reference
table_name: transactions
# entity for which we model and predict; multiple time-stamped event rows per entity id are expected
main_entity_column: customer_id
# event time column (required for event sources)
date_column:
name: t_dat
format: "%Y-%m-%d"
# optional: drop columns early via allowed / disallowed columns (helps memory, may prevent leakage)
disallowed_columns: ["order_id"] # unique ID per event, no signal
# allowed_columns: ["customer_id", "article_id", "t_dat", "sales_channel_id", "price"]
# check: attribute tables require explicit joins on each relevant event table
# docs: https://docs.basemodel.ai/docs/joins
joined_data_sources:
- name: articles # name of the atributes data source
join_on:
- [article_id, article_id] # [event_column, attribute_column]
# main entity attributes (optional)
# expected: one row per entity id (e.g., customer profile)
- type: main_entity_attribute
name: customers
data_location:
database_type: parquet
connection_params:
path: "/path/to/data_dir/customers.parquet"
cache_path: "/basemodel/db_cache/"
table_name: customers
main_entity_column: customer_id
# attributes (optional)
# expected: dimension tables used in joins into events
- type: attribute
name: articles
allowed_columns: ["product_type_name", "product_group_name", "department_name", "section_name", "colour_group_name", "perceived_colour_master_name"]
data_location:
database_type: parquet
connection_params:
path: "/path/to/data_dir/articles.parquet"
cache_path: "/basemodel/db_cache/"
table_name: articles
# ------------------------------------------------------------------------------
# 2) data_params block
# ------------------------------------------------------------------------------
data_params:
# earliest timestamp used for events considered in training/validation
data_start_date: "2018-09-20 00:00:00"
# how we split into training and validation; here we use 10% entity hold-out
split:
type: entity
training: 90
validation: 10
# optional: hold-out test window (later than training/validation)
# docs: https://docs.basemodel.ai/docs/controlling-data
test:
start_date: "2020-09-05 00:00:00"
end_date: "2020-09-22 00:00:00"
training_validation_end: "2020-09-04 00:00:00"
# ------------------------------------------------------------------------------
# 3) training_params block (here only to be used for a smoke test)
# ------------------------------------------------------------------------------
training_params:
# limit batches to validate the setup quickly, then remove for a real run
limit_train_batches: 5
limit_val_batches: 5
# ------------------------------------------------------------------------------
# Optional blocks you may add later (kept out of quickstart on purpose)
#
# - data_loader_params (batching / workers):
# docs: https://docs.basemodel.ai/docs/data-loading-configuration
#
# - memory_constraining_params (memory / model size constraints):
# - query_optimization (query parallelization)
# docs: https://docs.basemodel.ai/docs/controlling-space-and-memory
# ------------------------------------------------------------------------------
# ------------------------------------------------------------------------------
# Other configuration options:
# - Model training configuration (metaparams, multi-GPU, precision, early stopping, checkpointing etc.):
# docs: https://docs.basemodel.ai/docs/model-training-configuration
# - Data configurations (split point rules, sampling, weighing etc.):
# docs: https://docs.basemodel.ai/docs/data-configurations
# - Data transformations (filtering, grouping, lambdas, data type overrides):
# docs: https://docs.basemodel.ai/docs/data-transformations-advanced
# ------------------------------------------------------------------------------