Automated feature encoding

BaseModel automatically infers and encodes feature types from your connected data sources, while still allowing full user control through overrides. During schema processing, BaseModel evaluates each column in the following order:

Timestamp or entity ID detection: if the column is defined as the event timestamp, or main entity identifier, it is excluded from feature inference.
User override check: if the column appears in column_type_overrides, the assigned type is applied and inference is skipped.
Automatic inference: for all remaining columns, BaseModel analyzes schema metadata and column statistics to determine the most suitable feature representation.

This process defines how each column is ultimately represented in the model — for example, as a numeric feature, one-hot vector, sketch embedding, etc. Understanding this mapping helps you validate the inferred schema, ensure consistent representations, and fine-tune feature behavior when needed.
In the following paragraphs, you can learn which feature types BaseModel supports, which ones are automatically inferred and encoded, and how they work — before moving on to advanced representations that require explicit configuration.

Overview of feature types

Each feature type defines how BaseModel interprets and encodes a column during model training. The list below summarizes the main feature categories, their intended use cases, and how they are internally represented.

Decimal:
- Continuous numeric feature; values where scale and magnitude matter.
- Typical use case: amounts, counts, ratios, ratings.
- Encoding: numeric, normalized.
Categorical:
- Low-cardinality identifiers or enums.
- Typical use case: status, gender, country, activity flag.
- Encoding: one-hot encoding.
Categorical Compressed:
- High-cardinality identifiers; similarity-preserving and robust to unseen values.
- Typical use case: product, session, destination, or user IDs.
- Encoding: sketch embedding learned using Cleora and emde.
Timestamp:
- Event timestamp used to learn temporal behavioral patterns; not used directly as a feature.
- Typical use case: transaction date, delivery time, signup moment.
- Encoding: extract into other feature types (e.g., hour, day, recency).
Time Series:
- Sequential numeric feature that captures seasonality and temporal patterns with sufficient history and consistent intervals. Requires explicit enabling.
- Typical use case: price history, balance trends, sensor readings.
- Encoding: temporal encoding.
Text:
- Free-form text that provides meaningful context when the field is clean and informative. Requires explicit enabling.
- Typical use case: product descriptions, reviews, customer feedback.
- Encoding: tokenized semantic embeddings.
Image:
- An image associated with an entity. Provided as path or URL. Requires explicit enabling.
- Typical use case: product photos, property listings, travel destinations.
- Encoding: visual embedding learned using a vision transformer and emde.

Supported data engines and type mapping

BaseModel supports multiple data engines and file formats. Each source has its own data type system, which is automatically mapped to BaseModel's internal feature representation.

BaseModel Feature	Data Types of Major Databases
BaseModel Feature	Snowflake	Databricks	BigQuery	ClickHouse	Hive	Parquet
Categorical¹	VARCHAR, BOOLEAN, INTEGER	STRING, BOOLEAN, INT	STRING, BOOL, INT64	String, Enum, UInt8	STRING, BOOLEAN, SMALLINT	string, boolean
Categorical Compressed¹	VARCHAR, INTEGER	STRING, INT	STRING, INT64	String, Enum, UInt8	STRING, SMALLINT, INT	string, int32
Sketch²	VARCHAR	STRING	STRING	String, UUID	STRING	string
Decimal³	NUMBER, FLOAT, DECIMAL	DOUBLE, FLOAT, DECIMAL	FLOAT64, NUMERIC	Float32, Float64, Decimal	DOUBLE, DECIMAL	float, double, int32
Time Series⁴ adv.	NUMBER, FLOAT, DECIMAL	DOUBLE, FLOAT, DECIMAL	FLOAT64, NUMERIC	Float32, Float64, Decimal	DOUBLE, DECIMAL	float, double, int32
Text⁵ adv.	VARCHAR	STRING	STRING	String	STRING	string
Image⁶ adv.	VARCHAR (URL / path)	STRING (URL / path)	STRING (URL / path)	String (URL / path)	STRING (URL / path)	string (URL / path)

¹ Categorical inferred when column cardinality is below ~2,500 unique values.
² Categorical Compressed inferred when cardinality exceeds ~2,500 unique values.
³ Decimal is only used for float values; integers require user override or conversion to be interpeted as number.
⁴ Time Series modeling requires a column_type_overrides entry and is typically applied to a derived numeric column created via sql_lambdas entry.
⁵ Text embedding can be created for unstructured string fields with an average token count > 5; requires a column_type_overrides entry.
⁶ Image embedding can be created for entity-linked visual assets; accepts file paths, cloud URIs, or URLs and generates visual embeddings; requires a column_type_overrides entry.

📘
When using Parquet, type inference depends on the file's logical schema — ensure consistent typing across partitions.

Handling Date type

A date column is only consumed as an event timestamp when it defines the temporal order of records in event tables. Otherwise, BaseModel does not directly use raw DATE or TIMESTAMP columns as model features.

If a date field contains meaningful business information — such as enrollment date, contract start date, or renewal date — it can be included in modeling in two ways:

Transforming it with sql_lambdas to create derived numeric or categorical features. For example, you can compute a customer’s tenure or the number of days since a particular event and then reference this new column as TENURE or similar:
```
sql_lambdas:
  - alias: TENURE
    expression: date_diff('year', ENROLMENT_DATE::DATE, DATE '2024-12-01')
```
Including it in the extra_columns block to make it accessible to a target function, while not directly used as a model feature.

Advanced feature type overrides

As mentioned above, in some cases you may want to override this behavior to produce richer feature representations. Three common advanced overrides are:

text — for longer string fields (e.g., descriptions, reviews)
image — for visual assets linked to entities (e.g., product photos)
time_series — for numeric signals that evolve over time

Each can be defined explicitly under the column_type_overrides section in your data source definition.

Text

If a string column contains long-form content — typically more than five words per record — BaseModel can embed it as a semantic vector instead of treating it as a categorical token. This allows the model to capture semantic meaning from unstructured text such as product descriptions, customer reviews, or user bios.

    column_type_overrides:
      description: text

Image

If your entities (e.g., products, places, properties) have associated images, you can include them as an image column. Image column a string path or URL pointing to the image file and can be either:

provided in a separate table keyed by entity_id, or
added to an attributes table that already references the same entity.

BaseModel automatically handles local paths, cloud URIs, or web URLs, processes the image through a vision backbone, and generates embeddings or sketches.

    column_type_overrides:
      image_url: image

Time_series

Sometimes a numeric feature carries temporal structure — where patterns, trends, or seasonality matter more than single values. Examples include:

evolving balances or daily totals,
metrics or aggregates sampled over time,
or any sequence where the shape and trend convey information.

In these cases, create a derived numeric column using an entry in sql_lambdas block and mark it as time_series:

this instructs BaseModel to extract sequence-based features (e.g. temporal aggregates, slopes, and recency effects) rather than treating values independently,
this approach also keeps date values interpretable and usable in downstream computations without treating them as raw features.

  sql_lambdas:
    - alias: price_ts
      expression: price
  column_type_overrides:
    price_ts: price_ts