Enrich & Transform
With your sources connected and organized, you can augment the data before it enters training — compute new columns, override how BaseModel interprets existing ones, let it auto-detect event subgroups, or unify entities that appear across multiple tables.
Computed Columns
sql_lambdas let you derive new columns using SQL expressions evaluated by the query engine. Each lambda needs an alias (the new column name) and an expression.
sql_lambdas:
- alias: price_float
expression: "TO_DOUBLE(price)"
- alias: discount_pct
expression: "(list_price - price) / list_price"
- alias: tenure_years
expression: "date_diff('year', enrolment_date::DATE, DATE '2024-12-01')"
Lambdas run at query time and can use any function your backend supports — casts, arithmetic, date functions, CASE WHEN, etc.
Make use of multiple date columns as features
Apart from the event timestamp (which BaseModel handles automatically), raw date columns are not embedded. Use sql_lambdas to transform them into numeric values — such as tenure in years or days since an event — so the model can learn from them.
Column Type Overrides
BaseModel auto-detects feature types (see Managing Data), but you can override the detection when the default is not ideal.
Check integer columns that represent quantities
An integer column like num_items (a count of purchased items) represents a continuous quantity where ordering and magnitude matter. However, BaseModel may classify it as categorical because it contains a small set of distinct whole numbers. Override it to decimal so the model treats it as a numerical feature.
Standard overrides
| Override value | When to use |
|---|---|
categorical |
Force a numeric column to be treated as categories (e.g., status codes, tiers) |
decimal |
Force a column to be treated as continuous numeric (e.g., integers that represent quantities) |
Rich encoding overrides
These unlock richer feature representations but require explicit opt-in.
Time Series
For numeric columns with meaningful temporal patterns — price history, account balances, sensor readings. Best practice is to create a derived column via sql_lambdas and mark it as time_series, so the model gets both the raw value and the temporal encoding:
BaseModel will extract sequence-based features rather than treating values independently.
Text
For free-form string columns with an average of more than ~5 tokens per record — product descriptions, reviews, customer feedback:
BaseModel generates tokenised semantic embeddings instead of treating the column as a categorical token.
Text columns are skipped by default
BaseModel skips text columns during fitting unless you explicitly opt in with column_type_overrides. If you expect a text column to contribute features but it appears under "Skipped columns" in the columns analysis report, add the override shown above.
Image
For columns containing a path or URL to a visual asset — product photos, property listings, travel destinations:
BaseModel accepts local paths, cloud URIs, or web URLs and generates visual embeddings from the image content.
Automatic Event Grouping
When an event table contains diverse event patterns that are hard to separate manually with where_condition, BaseModel can detect groups automatically and process them separately.
When to consider auto-grouping
Use num_groups when your event table contains latent event types characterized by combinations of metadata columns — for example, visits to different store formats, banking transactions that mix card payments with wire transfers, or medical encounters that range from routine checkups to emergency admissions. If you can't easily express the split with a single where_condition, auto-grouping will find the natural boundaries for you.
num_groups triggers pattern recognition that divides events into disjoint groups based on frequent combinations of categorical column values. If the data cannot support the requested number of groups, training will fail with a message indicating the maximum allowed value.
Auto-grouping only for event sources
num_groups is only valid for event sources.
Shared Entities
If the same real-world entity (e.g., a product) appears in multiple event tables under the same column name, BaseModel normally treats each occurrence independently. shared_entities lets you unify them into a single representation, enriching the model's understanding of that entity across all sources.
When to consider shared entities
Use shared_entities when an entity like a product, store, or content item appears across multiple event tables — for example, the same article_id in both a purchases table and a page-views table. Without sharing, BaseModel builds a separate representation of that entity in each source and misses cross-source signals. Sharing unifies those representations, so the model learns that the product a customer browsed is the same one they later bought.
Basic usage — same column in multiple event tables
- type: event
name: product_buy
...
shared_entities:
- name: product
columns: [article_id]
id_column: article_id
- type: event
name: page_visit
...
shared_entities:
- name: product
columns: [article_id]
id_column: article_id
The name must match across sources to identify the same shared entity.
Adding extra columns
When to add extra columns
Add extra columns when different event tables hold complementary information about the same entity. For example, a purchases table might have a product name while a page-views table has a product_description. Pulling both into the shared entity gives BaseModel a richer picture of each product than either source provides alone.
You can enrich the unified entity with additional columns from each source:
# product_buy has a "name" column, page_visit has "product_description"
shared_entities:
- name: product
columns: [article_id, name] # on product_buy
id_column: article_id
shared_entities:
- name: product
columns: [article_id, product_description] # on page_visit
id_column: article_id
Combining with joined attributes
When an attribute table is joined to an event source, you can pull its columns into the shared entity using the syntax [column_name, [source_name]]:
shared_entities:
- name: product
columns:
- article_id
- [description, [product_attributes]]
- [name, [product_attributes]]
id_column: article_id
Shared entities support limited column types
Shared entities currently support only Text and Categorical Compressed (Cleora-based) columns.
Parameters
| Parameter | Type | Description |
|---|---|---|
name |
str |
Identifier for the shared entity — must match across sources |
columns |
list |
Columns composing the entity. Strings for local columns; [col, [source]] for joined columns |
id_column |
str |
The main identifying column, usually the one with highest cardinality |