Skip to content

Basic Configuration

A foundation model config is a single YAML file with three blocks: where your data lives, what date ranges and splits to use, and how to train. This page walks through the minimum you need to launch a first training run.

Structure at a Glance

yaml
data_sources:      # 1. Where your data lives and how tables relate
  - ...

data_params:       # 2. Date range and train / validation split
  ...

training_params:   # 3. Training controls (start minimal, tune later)
  ...

Data Sources

The data_sources list tells BaseModel which tables to read and how they connect. You must include at least one event source containing multiple time-stamped rows per entity. Entity attributes and dimension tables are optional but recommended.

yaml
data_sources:

  # events (minimum 1 table mandatory)
  - type: event
    # name your table for BaseModel reference
    name: transactions
    data_location:
      database_type: parquet
      connection_params:
        # parquet file path
        path: "/path/to/data_dir/transactions_train.parquet"
        # database cache; keep it local to your project/workdir
        cache_path: "/basemodel/db_cache/"
      # database table reference
      table_name: transactions
    # entity for which we model and predict;
    # multiple time-stamped event rows per entity id are expected
    main_entity_column: customer_id
    # event time column (required for event sources)
    date_column:
      name: t_dat
      format: "%Y-%m-%d"
    # optional: drop columns early via allowed / disallowed columns
    # (helps memory, may prevent leakage)
    disallowed_columns: ["order_id"]  # unique ID per event, no signal
    # allowed_columns: ["customer_id", "article_id", "t_dat", "price"]

    # attribute tables require explicit joins on each relevant event table
    joined_data_sources:
      - name: articles  # name of the attribute data source
        join_on:
          - [article_id, article_id]  # [event_column, attribute_column]

  # main entity attributes (optional)
  # expected: one row per entity id (e.g., customer profile)
  - type: main_entity_attribute
    name: customers
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/customers.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: customers
    main_entity_column: customer_id

  # attributes (optional)
  # expected: dimension tables used in joins into events
  - type: attribute
    name: articles
    allowed_columns:
      - product_type_name
      - product_group_name
      - department_name
      - section_name
      - colour_group_name
      - perceived_colour_master_name
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/articles.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: articles

Source Types

Type Details
event Min. 1 required
Many rows per entity
Time-stamped behavioral data the model learns sequences from
main_entity_attribute Optional
One row per entity
Static or slowly-changing entity properties (e.g. customer profile)
attribute Optional
Dimension table
Enrichment data joined into events (e.g. product catalogue)

Column Filtering

Use disallowed_columns or allowed_columns (never both) to control which columns are included in training. Typical candidates for exclusion include surrogate keys, row-level IDs, and any columns that could lead to target leakage.

Data Params

data_params defines the date window BaseModel reads and how data is split for training and validation.

yaml
data_params:
  # earliest timestamp used for events considered in training/validation
  data_start_date: "2018-09-20 00:00:00"

  # how we split into training and validation;
  # here we use 10% entity hold-out
  split:
    type: entity
    training: 90
    validation: 10

    # optional: hold-out test window (later than training/validation)
    test:
      start_date: "2020-09-05 00:00:00"
      end_date: "2020-09-22 00:00:00"

    training_validation_end: "2020-09-04 00:00:00"

Split Types

Type Details
entity Random percentage of entity IDs is held out for validation.
Default — good when entity count is large enough.
Should always be used in production.
time Training and validation sets separated by date boundary.
Best for experimentation — tests the model's ability to predict across time.
Should not be used in production where you want the latest data to inform inference.

Ensure clean data split boundaries

Set training_validation_end to the day before your test.start_date to ensure clean separation between training data and the test window.

Training Params

For a first run, keep training_params minimal. The goal is to verify the pipeline end-to-end before committing to a full training cycle.

yaml
training_params:
  # limit batches to validate the setup quickly,
  # then remove for a real run
  limit_train_batches: 5
  limit_val_batches: 5

This smoke-test config processes only 5 batches for training and validation — enough to confirm that data loads, the model initializes, and outputs are written correctly. Once everything checks out, remove both limits and optionally add controls covered in Training Controls.

Complete Example

The onboarding package includes a ready-to-use config file — copy it as a starting point and replace paths and column names with your own.

yaml
# Generic Foundation Model Config — parquet
#
# How to use:
# - select the file for your data engine (this one: parquet)
# - start by configuring at least one event data source (mandatory)
# - optionally add main entity attributes (one row per entity) and attribute tables (dimensions) for joins
# - add necessary configurations for date ranges, training / validation split etc.
#
# Helpful docs (only use if stuck):
# - Parquet data sources: https://docs.basemodel.ai/docs/parquet-data-sources
# - Data sources overview: https://docs.basemodel.ai/docs/connection-to-the-data-sources

# ------------------------------------------------------------------------------
# 1) data_sources block
# - events are mandatory
# - main entity attributes and attribute tables are optional
# ------------------------------------------------------------------------------

data_sources:

  # events (minimum 1 table mandatory)
  - type: event
    # name your table for BaseModel reference
    name: transactions
    data_location:
      database_type: parquet
      connection_params:
        # parquet file path
        path: "/path/to/data_dir/transactions_train.parquet"
        # database cache; keep it local to your project/workdir
        cache_path: "/basemodel/db_cache/"
      # database table reference
      table_name: transactions
    # entity for which we model and predict; multiple time-stamped event rows per entity id are expected
    main_entity_column: customer_id
    # event time column (required for event sources)
    date_column:
      name: t_dat
      format: "%Y-%m-%d"
    # optional: drop columns early via allowed / disallowed columns (helps memory, may prevent leakage)
    disallowed_columns: ["order_id"] # unique ID per event, no signal
    # allowed_columns: ["customer_id", "article_id", "t_dat", "sales_channel_id", "price"]
    # check: attribute tables require explicit joins on each relevant event table 
    # docs: https://docs.basemodel.ai/docs/joins
    joined_data_sources:
      - name: articles # name of the atributes data source
        join_on:
          - [article_id, article_id] # [event_column, attribute_column]

  # main entity attributes (optional)
  # expected: one row per entity id (e.g., customer profile)
  - type: main_entity_attribute
    name: customers
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/customers.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: customers
    main_entity_column: customer_id

  # attributes (optional)
  # expected: dimension tables used in joins into events
  - type: attribute
    name: articles
    allowed_columns: ["product_type_name", "product_group_name", "department_name", "section_name", "colour_group_name", "perceived_colour_master_name"]
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/data_dir/articles.parquet"
        cache_path: "/basemodel/db_cache/"
      table_name: articles

# ------------------------------------------------------------------------------
# 2) data_params block
# ------------------------------------------------------------------------------

data_params:
  # earliest timestamp used for events considered in training/validation
  data_start_date: "2018-09-20 00:00:00"

  # how we split into training and validation; here we use 10% entity hold-out
  split:
    type: entity
    training: 90
    validation: 10

    # optional: hold-out test window (later than training/validation)
    # docs: https://docs.basemodel.ai/docs/controlling-data
    test:
      start_date: "2020-09-05 00:00:00"
      end_date: "2020-09-22 00:00:00"

    training_validation_end: "2020-09-04 00:00:00"


# ------------------------------------------------------------------------------
# 3) training_params block (here only to be used for a smoke test)
# ------------------------------------------------------------------------------

training_params:
  # limit batches to validate the setup quickly, then remove for a real run
  limit_train_batches: 5
  limit_val_batches: 5


# ------------------------------------------------------------------------------
# Optional blocks you may add later (kept out of quickstart on purpose)
#
# - data_loader_params (batching / workers):
#   docs: https://docs.basemodel.ai/docs/data-loading-configuration
#
# - memory_constraining_params (memory / model size constraints):
# - query_optimization (query parallelization)
#   docs: https://docs.basemodel.ai/docs/controlling-space-and-memory
# ------------------------------------------------------------------------------

# ------------------------------------------------------------------------------
# Other configuration options:
# - Model training configuration (metaparams, multi-GPU, precision, early stopping, checkpointing etc.): 
#   docs: https://docs.basemodel.ai/docs/model-training-configuration
# - Data configurations (split point rules, sampling, weighing etc.):  
#   docs: https://docs.basemodel.ai/docs/data-configurations
# - Data transformations (filtering, grouping, lambdas, data type overrides): 
#   docs: https://docs.basemodel.ai/docs/data-transformations-advanced
# ------------------------------------------------------------------------------