Data Types & Feature Encoding
Learn how BaseModel maps source data types to feature encodings
Automated feature encoding
BaseModel automatically infers and encodes feature types from your connected data sources, while still allowing full user control through overrides. During schema processing, BaseModel evaluates each column in the following order:
-
Timestamp or entity ID detection: if the column is defined as the event timestamp, or main entity identifier, it is excluded from feature inference.
-
User override check: if the column appears in
column_type_overrides, the assigned type is applied and inference is skipped. -
Automatic inference: for all remaining columns, BaseModel analyzes schema metadata and column statistics to determine the most suitable feature representation.
This process defines how each column is ultimately represented in the model — for example, as a numeric feature, one-hot vector, sketch embedding, etc. Understanding this mapping helps you validate the inferred schema, ensure consistent representations, and fine-tune feature behavior when needed.
In the following paragraphs, you can learn which feature types BaseModel supports, which ones are automatically inferred and encoded, and how they work — before moving on to advanced representations that require explicit configuration.
Overview of feature types
Each feature type defines how BaseModel interprets and encodes a column during model training. The list below summarizes the main feature categories, their intended use cases, and how they are internally represented.
- Decimal:
- Continuous numeric feature; values where scale and magnitude matter.
- Typical use case: amounts, counts, ratios, ratings.
- Encoding: numeric, normalized.
- Categorical:
- Low-cardinality identifiers or enums.
- Typical use case: status, gender, country, activity flag.
- Encoding: one-hot encoding.
- Categorical Compressed:
- High-cardinality identifiers; similarity-preserving and robust to unseen values.
- Typical use case: product, session, destination, or user IDs.
- Encoding: sketch embedding learned using Cleora and emde.
- Timestamp:
- Event timestamp used to learn temporal behavioral patterns; not used directly as a feature.
- Typical use case: transaction date, delivery time, signup moment.
- Encoding: extract into other feature types (e.g., hour, day, recency).
- Time Series:
- Sequential numeric feature that captures seasonality and temporal patterns with sufficient history and consistent intervals. Requires explicit enabling.
- Typical use case: price history, balance trends, sensor readings.
- Encoding: temporal encoding.
- Text:
- Free-form text that provides meaningful context when the field is clean and informative. Requires explicit enabling.
- Typical use case: product descriptions, reviews, customer feedback.
- Encoding: tokenized semantic embeddings.
- Image:
- An image associated with an entity. Provided as path or URL. Requires explicit enabling.
- Typical use case: product photos, property listings, travel destinations.
- Encoding: visual embedding learned using a vision transformer and emde.
Supported data engines and type mapping
BaseModel supports multiple data engines and file formats. Each source has its own data type system, which is automatically mapped to BaseModel's internal feature representation.
| BaseModel Feature | Data Types of Major Databases | |||||
|---|---|---|---|---|---|---|
| Snowflake | Databricks | BigQuery | ClickHouse | Hive | Parquet | |
| Categorical¹ | VARCHAR, BOOLEAN, INTEGER | STRING, BOOLEAN, INT | STRING, BOOL, INT64 | String, Enum, UInt8 | STRING, BOOLEAN, SMALLINT | string, boolean |
| Categorical Compressed¹ | VARCHAR, INTEGER | STRING, INT | STRING, INT64 | String, Enum, UInt8 | STRING, SMALLINT, INT | string, int32 |
| Sketch² | VARCHAR | STRING | STRING | String, UUID | STRING | string |
| Decimal³ | NUMBER, FLOAT, DECIMAL | DOUBLE, FLOAT, DECIMAL | FLOAT64, NUMERIC | Float32, Float64, Decimal | DOUBLE, DECIMAL | float, double, int32 |
| Time Series⁴ adv. | NUMBER, FLOAT, DECIMAL | DOUBLE, FLOAT, DECIMAL | FLOAT64, NUMERIC | Float32, Float64, Decimal | DOUBLE, DECIMAL | float, double, int32 |
| Text⁵ adv. | VARCHAR | STRING | STRING | String | STRING | string |
| Image⁶ adv. | VARCHAR (URL / path) | STRING (URL / path) | STRING (URL / path) | String (URL / path) | STRING (URL / path) | string (URL / path) |
¹ Categorical inferred when column cardinality is below ~2,500 unique values.
² Categorical Compressed inferred when cardinality exceeds ~2,500 unique values.
³ Decimal is only used for float values; integers require user override or conversion to be interpeted as number.
⁴ Time Series modeling requires a column_type_overrides entry and is typically applied to a derived numeric column created via sql_lambdas entry.
⁵ Text embedding can be created for unstructured string fields with an average token count > 5; requires a column_type_overrides entry.
⁶ Image embedding can be created for entity-linked visual assets; accepts file paths, cloud URIs, or URLs and generates visual embeddings; requires a column_type_overrides entry.
When using Parquet, type inference depends on the file's logical schema — ensure consistent typing across partitions.
Handling Date type
A date column is only consumed as an event timestamp when it defines the temporal order of records in event tables. Otherwise, BaseModel does not directly use raw DATE or TIMESTAMP columns as model features.
If a date field contains meaningful business information — such as enrollment date, contract start date, or renewal date — it can be included in modeling in two ways:
- Transforming it with
sql_lambdasto create derived numeric or categorical features. For example, you can compute a customer’s tenure or the number of days since a particular event and then reference this new column as TENURE or similar:
sql_lambdas: - alias: TENURE expression: date_diff('year', ENROLMENT_DATE::DATE, DATE '2024-12-01')
- Including it in the
extra_columnsblock to make it accessible to a target function, while not directly used as a model feature.
Advanced feature type overrides
As mentioned above, in some cases you may want to override this behavior to produce richer feature representations. Three common advanced overrides are:
text— for longer string fields (e.g., descriptions, reviews)image— for visual assets linked to entities (e.g., product photos)time_series— for numeric signals that evolve over time
Each can be defined explicitly under the column_type_overrides section in your data source definition.
Text
If a string column contains long-form content — typically more than five words per record — BaseModel can embed it as a semantic vector instead of treating it as a categorical token. This allows the model to capture semantic meaning from unstructured text such as product descriptions, customer reviews, or user bios.
column_type_overrides:
description: textImage
If your entities (e.g., products, places, properties) have associated images, you can include them as an image column. Image column a string path or URL pointing to the image file and can be either:
- provided in a separate table keyed by entity_id, or
- added to an attributes table that already references the same entity.
BaseModel automatically handles local paths, cloud URIs, or web URLs, processes the image through a vision backbone, and generates embeddings or sketches.
column_type_overrides:
image_url: imageTime_series
Sometimes a numeric feature carries temporal structure — where patterns, trends, or seasonality matter more than single values. Examples include:
- evolving balances or daily totals,
- metrics or aggregates sampled over time,
- or any sequence where the shape and trend convey information.
In these cases, create a derived numeric column using an entry in sql_lambdas block and mark it as time_series:
- this instructs BaseModel to extract sequence-based features (e.g. temporal aggregates, slopes, and recency effects) rather than treating values independently,
- this approach also keeps date values interpretable and usable in downstream computations without treating them as raw features.
sql_lambdas:
- alias: price_ts
expression: price
column_type_overrides:
price_ts: price_tsUpdated 7 days ago
