Skip to content

Select & Organize

Once your data sources are connected, the next step is to tell BaseModel which tables to use, how they relate to each other, and what subset of the data should enter training. This page covers source types, column and row filtering, joins, and the date range / split configuration.

Source Types

Every entry in data_sources has a type that determines how BaseModel treats the table.

Type Details
event Min. 1 required
Many rows per entity
Time-stamped behavioral data the model learns sequences from
main_entity_attribute Optional
One row per entity
Static or slowly-changing entity properties (e.g. customer profile)
attribute Optional
Dimension table
Enrichment data joined into events (e.g. product catalogue)

Event source

The only mandatory type. Must include main_entity_column (the entity you model and predict for) and a date_column with the event timestamp.

yaml
- type: event
  name: transactions
  main_entity_column: customer_id
  date_column:
    name: t_dat
    format: "%Y-%m-%d"
  data_location: ...

Main entity attribute

A single-row-per-entity table — automatically joined to events on main_entity_column. No explicit join needed.

yaml
- type: main_entity_attribute
  name: customers
  main_entity_column: customer_id
  data_location: ...

Attribute (dimension table)

Dimension tables that enrich events with properties of other entities. Require an explicit join on the event source (see Joining Attribute Tables below).

yaml
- type: attribute
  name: articles
  data_location: ...

Entity IDs must be consistent across sources

The values in main_entity_column must use the same identifiers across all sources. If one table stores customer_id as "C-123" and another as 123, the join will silently produce no matches.

Filtering Columns

Use allowed_columns or disallowed_columns on any source to control which columns enter training. Pick one — never use both on the same source.

yaml
# exclude specific columns (e.g., surrogate keys, potential leakage)
disallowed_columns: ["order_id"]

# or include only specific columns
allowed_columns: ["customer_id", "article_id", "t_dat", "price"]

Good candidates for exclusion:

  • Row-level unique IDs that carry no signal (e.g., order_id, row_hash)
  • Date columns other than the declared event timestamp — BaseModel does not embed them. If a table represents a sequence within a single row (e.g., booking → check-in → boarding), define separate event sources with each date as the respective date_column and disallow the others.
  • Duplicate columns that describe the same dimension (e.g., product_id / product_code / product_name) — keep one, disallow the rest. BaseModel detects duplicates during fitting and logs a redundancy report — it will not remove them automatically, but the suggested_config.yaml generated in the output path already adds one column from each bijection pair to disallowed_columns and configures detected time-series columns as sql_lambdas.
  • Any column that could leak the prediction target

Filtering Rows

Use where_condition to include only a subset of rows from a table. The expression is passed directly to the query engine — use the SQL syntax appropriate for your backend.

yaml
- type: event
  name: transactions
  where_condition: "Timestamp >= today() - 365"
  ...

Common use cases:

  • Sampling — include only a fraction of records for faster iteration.
  • Thresholds — keep only events above a value, e.g., "amount > 0".
  • Separating mixed event types — if a single table contains distinct behaviors (e.g., purchases and returns), split them into separate event sources using different where_condition values. This will improve model quality.

Joining Attribute Tables

To enrich events with dimension data, first define the attribute source, then reference it in the event's joined_data_sources. Each join is specified as pairs of [event_column, attribute_column].

yaml
# on the event source:
joined_data_sources:
  - name: articles       # must match the attribute source's name
    join_on:
      - [article_id, article_id]

  - name: stores
    join_on:
      - [store_id, store_id]
      - [format, format]   # multi-column join

The entire attribute table is joined unless further filtered by allowed_columns or disallowed_columns on the attribute source itself.

Main entity attributes auto-join

main_entity_attribute tables are joined automatically on the entity column — you never need to add them to joined_data_sources.

Date Range and Split

The data_params block controls which time window BaseModel reads and how the data is divided for training and validation.

Split types

Type Details
entity Random percentage of entity IDs is held out for validation.
Default — good when entity count is large enough.
Should always be used in production.
time Training and validation sets separated by date boundary.
Best for experimentation — tests the model's ability to predict across time.
Should not be used in production where you want the latest data to inform inference.

Entity split example

yaml
data_params:
  data_start_date: "2018-09-20 00:00:00"

  split:
    type: entity
    training: 90
    validation: 10

    test:
      start_date: "2020-09-05 00:00:00"
      end_date: "2020-09-22 00:00:00"

    training_validation_end: "2020-09-04 00:00:00"

training and validation accept any value from 0.01 to 99.99, representing a percentage of entities. Fractional percentages are supported — for example, 90.5 assigns 90.5 % of entities to that split. The two values must sum to 100 or less.

Temporal split example

yaml
data_params:
  data_start_date: "2018-09-20 00:00:00"

  split:
    type: time
    training:
      start_date: "2018-09-20 00:00:00"
    validation:
      start_date: "2020-06-01 00:00:00"

    test:
      start_date: "2020-09-05 00:00:00"
      end_date: "2020-09-22 00:00:00"

Test window

The optional test block holds out a time window that is never seen during training or validation. Set training_validation_end to the day before test.start_date to ensure clean separation.

Extra Columns

Columns that BaseModel would discard as features but you will need inside a target function — e.g. order IDs, flags, raw prices, metadata. Declare them in data_params.extra_columns so they survive the fit stage and remain accessible during scenario training.

yaml
data_params:
  extra_columns:
    # plain column names — loaded as-is from the source
    - data_source_name: transactions
      columns:
        - order_id
        - return_flag

    # SQL lambda — compute a derived column on the fly
    - data_source_name: transactions
      columns:
        - alias: price_float
          expression: "CAST({{ resolve_fn('price', data_sources_path=['products']) }} AS FLOAT)"

    # works on attribute sources too
    - data_source_name: customers
      columns:
        - is_test_account

Each entry names a data_source_name (must match a source defined in data_sources) and a columns list. Columns can be:

  • Plain strings — the column is fetched unchanged from the source.
  • SQL lambdas (alias + expression) — a derived column evaluated at query time, following the same syntax as sql_lambdas on data sources.

Minimize extra columns for performance

Only declare columns you actually reference in your target function. Every extra column adds query and memory overhead without contributing to the model's learned features.

See Target Function → Extra Columns for how to access them via .extra in your target code.