Skip to content

Enrich & Transform

With your sources connected and organized, you can augment the data before it enters training — compute new columns, override how BaseModel interprets existing ones, let it auto-detect event subgroups, or unify entities that appear across multiple tables.

Computed Columns

sql_lambdas let you derive new columns using SQL expressions evaluated by the query engine. Each lambda needs an alias (the new column name) and an expression.

yaml
sql_lambdas:
  - alias: price_float
    expression: "TO_DOUBLE(price)"
  - alias: discount_pct
    expression: "(list_price - price) / list_price"
  - alias: tenure_years
    expression: "date_diff('year', enrolment_date::DATE, DATE '2024-12-01')"

Lambdas run at query time and can use any function your backend supports — casts, arithmetic, date functions, CASE WHEN, etc.

Make use of multiple date columns as features

Apart from the event timestamp (which BaseModel handles automatically), raw date columns are not embedded. Use sql_lambdas to transform them into numeric values — such as tenure in years or days since an event — so the model can learn from them.

Column Type Overrides

BaseModel auto-detects feature types (see Managing Data), but you can override the detection when the default is not ideal.

yaml
column_type_overrides:
  loyalty_tier: categorical
  quantity: decimal

Check integer columns that represent quantities

An integer column like num_items (a count of purchased items) represents a continuous quantity where ordering and magnitude matter. However, BaseModel may classify it as categorical because it contains a small set of distinct whole numbers. Override it to decimal so the model treats it as a numerical feature.

Standard overrides

Override value When to use
categorical Force a numeric column to be treated as categories (e.g., status codes, tiers)
decimal Force a column to be treated as continuous numeric (e.g., integers that represent quantities)

Rich encoding overrides

These unlock richer feature representations but require explicit opt-in.

Time Series

For numeric columns with meaningful temporal patterns — price history, account balances, sensor readings. Best practice is to create a derived column via sql_lambdas and mark it as time_series, so the model gets both the raw value and the temporal encoding:

yaml
sql_lambdas:
  - alias: price_ts
    expression: price
column_type_overrides:
  price_ts: time_series

BaseModel will extract sequence-based features rather than treating values independently.

Text

For free-form string columns with an average of more than ~5 tokens per record — product descriptions, reviews, customer feedback:

yaml
column_type_overrides:
  description: text

BaseModel generates tokenised semantic embeddings instead of treating the column as a categorical token.

Text columns are skipped by default

BaseModel skips text columns during fitting unless you explicitly opt in with column_type_overrides. If you expect a text column to contribute features but it appears under "Skipped columns" in the columns analysis report, add the override shown above.

Image

For columns containing a path or URL to a visual asset — product photos, property listings, travel destinations:

yaml
column_type_overrides:
  image_url: image

BaseModel accepts local paths, cloud URIs, or web URLs and generates visual embeddings from the image content.

Automatic Event Grouping

When an event table contains diverse event patterns that are hard to separate manually with where_condition, BaseModel can detect groups automatically and process them separately.

When to consider auto-grouping

Use num_groups when your event table contains latent event types characterized by combinations of metadata columns — for example, visits to different store formats, banking transactions that mix card payments with wire transfers, or medical encounters that range from routine checkups to emergency admissions. If you can't easily express the split with a single where_condition, auto-grouping will find the natural boundaries for you.

yaml
- type: event
  name: transactions
  num_groups: 3
  ...

num_groups triggers pattern recognition that divides events into disjoint groups based on frequent combinations of categorical column values. If the data cannot support the requested number of groups, training will fail with a message indicating the maximum allowed value.

Auto-grouping only for event sources

num_groups is only valid for event sources.

Shared Entities

If the same real-world entity (e.g., a product) appears in multiple event tables under the same column name, BaseModel normally treats each occurrence independently. shared_entities lets you unify them into a single representation, enriching the model's understanding of that entity across all sources.

When to consider shared entities

Use shared_entities when an entity like a product, store, or content item appears across multiple event tables — for example, the same article_id in both a purchases table and a page-views table. Without sharing, BaseModel builds a separate representation of that entity in each source and misses cross-source signals. Sharing unifies those representations, so the model learns that the product a customer browsed is the same one they later bought.

Basic usage — same column in multiple event tables

yaml
- type: event
  name: product_buy
  ...
  shared_entities:
    - name: product
      columns: [article_id]
      id_column: article_id

- type: event
  name: page_visit
  ...
  shared_entities:
    - name: product
      columns: [article_id]
      id_column: article_id

The name must match across sources to identify the same shared entity.

Adding extra columns

When to add extra columns

Add extra columns when different event tables hold complementary information about the same entity. For example, a purchases table might have a product name while a page-views table has a product_description. Pulling both into the shared entity gives BaseModel a richer picture of each product than either source provides alone.

You can enrich the unified entity with additional columns from each source:

yaml
# product_buy has a "name" column, page_visit has "product_description"
shared_entities:
  - name: product
    columns: [article_id, name]          # on product_buy
    id_column: article_id

shared_entities:
  - name: product
    columns: [article_id, product_description]  # on page_visit
    id_column: article_id

Combining with joined attributes

When an attribute table is joined to an event source, you can pull its columns into the shared entity using the syntax [column_name, [source_name]]:

yaml
shared_entities:
  - name: product
    columns:
      - article_id
      - [description, [product_attributes]]
      - [name, [product_attributes]]
    id_column: article_id

Shared entities support limited column types

Shared entities currently support only Text and Categorical Compressed (Cleora-based) columns.

Parameters

Parameter Type Description
name str Identifier for the shared entity — must match across sources
columns list Columns composing the entity. Strings for local columns; [col, [source]] for joined columns
id_column str The main identifying column, usually the one with highest cardinality