Select & Organize
Once your data sources are connected, the next step is to tell BaseModel which tables to use, how they relate to each other, and what subset of the data should enter training. This page covers source types, column and row filtering, joins, and the date range / split configuration.
Source Types
Every entry in data_sources has a type that determines how BaseModel treats the table.
| Type | Details |
|---|---|
event |
Min. 1 required Many rows per entity Time-stamped behavioral data the model learns sequences from |
main_entity_attribute |
Optional One row per entity Static or slowly-changing entity properties (e.g. customer profile) |
attribute |
Optional Dimension table Enrichment data joined into events (e.g. product catalogue) |
Event source
The only mandatory type. Must include main_entity_column (the entity you model and predict for) and a date_column with the event timestamp.
- type: event
name: transactions
main_entity_column: customer_id
date_column:
name: t_dat
format: "%Y-%m-%d"
data_location: ...
Main entity attribute
A single-row-per-entity table — automatically joined to events on main_entity_column. No explicit join needed.
- type: main_entity_attribute
name: customers
main_entity_column: customer_id
data_location: ...
Attribute (dimension table)
Dimension tables that enrich events with properties of other entities. Require an explicit join on the event source (see Joining Attribute Tables below).
Entity IDs must be consistent across sources
The values in main_entity_column must use the same identifiers across all sources. If one table stores customer_id as "C-123" and another as 123, the join will silently produce no matches.
Filtering Columns
Use allowed_columns or disallowed_columns on any source to control which columns enter training. Pick one — never use both on the same source.
# exclude specific columns (e.g., surrogate keys, potential leakage)
disallowed_columns: ["order_id"]
# or include only specific columns
allowed_columns: ["customer_id", "article_id", "t_dat", "price"]
Good candidates for exclusion:
- Row-level unique IDs that carry no signal (e.g.,
order_id,row_hash) - Date columns other than the declared event timestamp — BaseModel does not embed them. If a table represents a sequence within a single row (e.g., booking → check-in → boarding), define separate event sources with each date as the respective
date_columnand disallow the others. - Duplicate columns that describe the same dimension (e.g.,
product_id/product_code/product_name) — keep one, disallow the rest. BaseModel detects duplicates during fitting and logs a redundancy report — it will not remove them automatically, but thesuggested_config.yamlgenerated in the output path already adds one column from each bijection pair todisallowed_columnsand configures detected time-series columns assql_lambdas. - Any column that could leak the prediction target
Filtering Rows
Use where_condition to include only a subset of rows from a table. The expression is passed directly to the query engine — use the SQL syntax appropriate for your backend.
Common use cases:
- Sampling — include only a fraction of records for faster iteration.
- Thresholds — keep only events above a value, e.g.,
"amount > 0". - Separating mixed event types — if a single table contains distinct behaviors (e.g., purchases and returns), split them into separate event sources using different
where_conditionvalues. This will improve model quality.
Joining Attribute Tables
To enrich events with dimension data, first define the attribute source, then reference it in the event's joined_data_sources. Each join is specified as pairs of [event_column, attribute_column].
# on the event source:
joined_data_sources:
- name: articles # must match the attribute source's name
join_on:
- [article_id, article_id]
- name: stores
join_on:
- [store_id, store_id]
- [format, format] # multi-column join
The entire attribute table is joined unless further filtered by allowed_columns or disallowed_columns on the attribute source itself.
Main entity attributes auto-join
main_entity_attribute tables are joined automatically on the entity column — you never need to add them to joined_data_sources.
Date Range and Split
The data_params block controls which time window BaseModel reads and how the data is divided for training and validation.
Split types
| Type | Details |
|---|---|
entity |
Random percentage of entity IDs is held out for validation. Default — good when entity count is large enough. Should always be used in production. |
time |
Training and validation sets separated by date boundary. Best for experimentation — tests the model's ability to predict across time. Should not be used in production where you want the latest data to inform inference. |
Entity split example
data_params:
data_start_date: "2018-09-20 00:00:00"
split:
type: entity
training: 90
validation: 10
test:
start_date: "2020-09-05 00:00:00"
end_date: "2020-09-22 00:00:00"
training_validation_end: "2020-09-04 00:00:00"
training and validation accept any value from 0.01 to 99.99, representing a percentage of entities. Fractional percentages are supported — for example, 90.5 assigns 90.5 % of entities to that split. The two values must sum to 100 or less.
Temporal split example
data_params:
data_start_date: "2018-09-20 00:00:00"
split:
type: time
training:
start_date: "2018-09-20 00:00:00"
validation:
start_date: "2020-06-01 00:00:00"
test:
start_date: "2020-09-05 00:00:00"
end_date: "2020-09-22 00:00:00"
Test window
The optional test block holds out a time window that is never seen during training or validation. Set training_validation_end to the day before test.start_date to ensure clean separation.
Extra Columns
Columns that BaseModel would discard as features but you will need inside a target function — e.g. order IDs, flags, raw prices, metadata. Declare them in data_params.extra_columns so they survive the fit stage and remain accessible during scenario training.
data_params:
extra_columns:
# plain column names — loaded as-is from the source
- data_source_name: transactions
columns:
- order_id
- return_flag
# SQL lambda — compute a derived column on the fly
- data_source_name: transactions
columns:
- alias: price_float
expression: "CAST({{ resolve_fn('price', data_sources_path=['products']) }} AS FLOAT)"
# works on attribute sources too
- data_source_name: customers
columns:
- is_test_account
Each entry names a data_source_name (must match a source defined in data_sources) and a columns list. Columns can be:
- Plain strings — the column is fetched unchanged from the source.
- SQL lambdas (
alias+expression) — a derived column evaluated at query time, following the same syntax assql_lambdason data sources.
Minimize extra columns for performance
Only declare columns you actually reference in your target function. Every extra column adds query and memory overhead without contributing to the model's learned features.
See Target Function → Extra Columns for how to access them via .extra in your target code.