Snowflake · Data Types & Feature Encoding

Summary of Types

Decimal – numbers; can be used as-is.

Categorical – low-cardinality features encoded as one-hot (default for short VARCHAR, BOOLEAN, small INTEGER).

Sketch – compact learned embedding for high-cardinality columns (e.g., user_id, sku).

Date – currently not consumed by BaseModel; extract calendar features upstream.

Text – tokenized text embeddings only when explicitly enabled.

Time Series – sequential numeric features only when explicitly enabled.

Snowflake → Feature type mapping (defaults)

Feature Type	Typical Snowflake column types	Encoding	When to use	Notes
Decimal	`NUMBER`, `DECIMAL`, `FLOAT`	raw	continuous measures (price, qty, score)	—
Categorical	`BOOLEAN`, `VARCHAR` (low cardinality), `INT`(low cardinality)	one-hot	flags, small enums (status, channel)	—
Sketch	`VARCHAR`/`NUMBER`/ `INT` with high cardinality	learned embedding	products/keys (e.g., `article_id`, `sku`)	Avoids huge one-hots; better model capacity.
Date	`DATE`, `TIMESTAMP_NTZ/LTZ/TZ`	(not used currently)	—	Extract calendar features in SQL; see below.
Text	long `VARCHAR`	tokenized text embedding	descriptions, reviews	Off by default; enable via override.
Time Series	numeric	sequential modeling	time-varying signals (sales/sensors over time, amounts etc.)	Enable via override.

Important: Date/Datetime columns are ignored by the current model as inputs. Add time-derived features upstream.

How type inference works

Automatic: we infer an initial type from Snowflake types + heuristics (e.g., low distinct count → Categorical).
You can override: set the desired feature type per column (UI/YAML) and, where relevant, choose normalization or embedding.

Override example (YAML)

column_type_overrides:
  amount: decimal
  is_return: categorical
  product_id: categorical_compressed               
  review_text: text         
  sales_ts: time_series

Decimal

Use when: feature is numeric and continuous (not categorical). Rule of thumb: values where magnitude and relative differences matter.

Pros: preserves order and scale, good for regression and ranking tasks. Cons: sensitive to outliers, may require normalization or scaling.

Categorical

Use when: low cardinality (flags/enums, small stable IDs). Rule of thumb: n_unique ≤ 200 → Categorical.

Pros: simple, interpretable. Cons: can explode if categories grow.

Snowflake tips:

-- normalize strings to reduce accidental cardinality
UPPER(TRIM(REPLACE(code,'-','_'))) AS code_norm;

-- check cardinality
SELECT COUNT(DISTINCT status) AS nunique FROM tbl;

Sketch

Use when: high cardinality (user_id, sku, long tail; many unseen). Rule of thumb: n_unique ≥ 200 or fast-growing vocab → Sketch.

Pros: compact, captures similarity, robust to tail. Cons: less directly interpretable; needs enough data to learn.

Date

Basemodel does not currently consume DATE/TIMESTAMP features directly.

Do this upstream in Snowflake:

SELECT
  order_id,
  order_ts,
  EXTRACT(DAYOFWEEK FROM order_ts) AS dow,
  EXTRACT(HOUR      FROM order_ts) AS hour,
  EXTRACT(DAY       FROM order_ts) AS day,
  EXTRACT(MONTH     FROM order_ts) AS month,
  EXTRACT(YEAR      FROM order_ts) AS year,
  DATEDIFF('minute', order_ts, CURRENT_TIMESTAMP()) AS minutes_since_order
FROM orders;

Treat dow, hour, month, etc. as Categorical (one-hot) or Decimal (if you prefer numeric buckets).

Use “time since” (minutes_since_order) as a Decimal.

Text

Use when: unstructured text may provide additional context or signals Rule of thumb: enable only if text fields are informative and relatively clean

Pros: captures semantic meaning, allows richer feature representation via embeddings Cons: disabled by default to control cost and noise; may require cleaning (e.g. lowercasing, stripping markup)

Time Series

Use when: feature evolves over time and past values influence the future Rule of thumb: sequential signals like price history, transaction amount trends, end-of-day balances, sensor readings

Pros: captures temporal patterns, seasonality, and trends Cons: more complex to model; requires sufficient history and sequence handling