Skip to content

Managing Data

BaseModel learns from behavioral data — time-stamped interactions between entities. This page explains the data model BaseModel expects, how it automatically turns your columns into features, and where to go for each configuration task.

Basic Configuration already sets up a working data_sources and data_params block. The guides below help you expand and refine that starting point.

In this section

Guide Description
This page Understand the data model and automatic feature encoding
Connect Sources Swap Parquet for the production backend you use
Select & Organize Add more tables, filter columns and rows, configure joins, and fine-tune the split
Enrich & Transform Layer on computed columns, type overrides, event grouping, and shared entities

The Behavioral Data Model

The behavioral data model that powers every foundation model is built on three concepts:

  • entities — the actors and objects
  • events — their interactions over time
  • attributes — their properties

BaseModel Data Structure

Entities

Entities are the subjects and objects linked by interaction: users, customers, employees on one side; products, stores, services on the other. Each entity has a unique ID.

  • Main entity — the one whose behavior you want to model and predict. Most often a customer or user, though it can be anything with a behavioral history (ATMs, cell towers, points of sale).
  • Other entities — the objects the main entity interacts with: products, stores, services, content items.

Events

Events are time-stamped records of interactions between entities — transactions, page views, support calls, sensor readings. For BaseModel to learn, each event must include the main entity's ID and a timestamp. You need at least one event table to train a foundation model.

Attributes

Attributes are the characteristics of entities that enrich the model's understanding — customer demographics, product categories, store metadata.

  • Main entity attributes — one row per main entity (e.g., a customer profile table). Automatically joined on the entity ID.
  • Attributes — dimension tables for other entities (e.g., a product catalogue). Require explicit joins to event tables.

Attribute tables are optional but recommended — they give the model richer context about the entities involved in each event.

How BaseModel Encodes Your Data

BaseModel automatically infers feature types from your columns and encodes them for training. You don't need to do feature engineering — but understanding the mapping helps you validate the schema and decide when an override is worthwhile.

The inference pipeline works in this order:

  1. Declared columns — columns you designate as date_column and main_entity_column are excluded from feature generation.
  2. User overrides — any column listed in column_type_overrides gets the assigned type; inference is skipped.
  3. Auto-detection — all remaining columns are classified based on schema metadata and value statistics.

Standard feature types

Type When inferred Encoding Example columns
Decimal Float columns Normalized numeric price, amount, rating
Categorical String / int with low cardinality One-hot status, country, channel
Categorical Compressed String / int with high cardinality Learned embedding product_id, session_id
Timestamp Declared date_column Temporal decomposition transaction_date

Advanced feature types (require explicit override)

Type Override value Use case
Time Series time_series Numeric column with meaningful temporal patterns — price history, balances, sensor data
Text text Free-form string with > ~5 tokens on average — descriptions, reviews, feedback
Image image Path or URL to a visual asset — product photos, property listings

Advanced overrides are configured via column_type_overrides and sometimes paired with sql_lambdas. See Enrich & Transform for details.

Handle integer columns correctly

Integers are treated as categorical by default. If a column should be numeric (e.g., quantity, count), either override it to decimal or cast it with an sql_lambda.