Guides

Data Types & Feature Encoding

Snowflake · Data Types & Feature Encoding

Summary of Types

  • Decimal – numbers; can be used as-is.
  • Categorical – low-cardinality features encoded as one-hot (default for short VARCHAR, BOOLEAN, small INTEGER).
  • Sketchcompact learned embedding for high-cardinality columns (e.g., user_id, sku).
  • Datecurrently not consumed by BaseModel; extract calendar features upstream.
  • Text – tokenized text embeddings only when explicitly enabled.
  • Time Series – sequential numeric features only when explicitly enabled.

Snowflake → Feature type mapping (defaults)

Feature TypeTypical Snowflake column typesEncodingWhen to useNotes
DecimalNUMBER, DECIMAL, FLOATrawcontinuous measures (price, qty, score)
CategoricalBOOLEAN, VARCHAR (low cardinality), INT(low cardinality)one-hotflags, small enums (status, channel)
SketchVARCHAR/NUMBER/ INT with high cardinalitylearned embeddingproducts/keys (e.g., article_id, sku)Avoids huge one-hots; better model capacity.
DateDATE, TIMESTAMP_NTZ/LTZ/TZ(not used currently)Extract calendar features in SQL; see below.
Textlong VARCHARtokenized text embeddingdescriptions, reviewsOff by default; enable via override.
Time Seriesnumericsequential modelingtime-varying signals (sales/sensors over time, amounts etc.)Enable via override.

Important: Date/Datetime columns are ignored by the current model as inputs. Add time-derived features upstream.


How type inference works

  • Automatic: we infer an initial type from Snowflake types + heuristics (e.g., low distinct count → Categorical).
  • You can override: set the desired feature type per column (UI/YAML) and, where relevant, choose normalization or embedding.

Override example (YAML)

column_type_overrides:
  amount: decimal
  is_return: categorical
  product_id: categorical_compressed               
  review_text: text         
  sales_ts: time_series

Decimal

Use when: feature is numeric and continuous (not categorical). Rule of thumb: values where magnitude and relative differences matter.

Pros: preserves order and scale, good for regression and ranking tasks. Cons: sensitive to outliers, may require normalization or scaling.


Categorical

Use when: low cardinality (flags/enums, small stable IDs). Rule of thumb: n_unique ≤ 200 → Categorical.

Pros: simple, interpretable. Cons: can explode if categories grow.

Snowflake tips:

-- normalize strings to reduce accidental cardinality
UPPER(TRIM(REPLACE(code,'-','_'))) AS code_norm;

-- check cardinality
SELECT COUNT(DISTINCT status) AS nunique FROM tbl;

Sketch

Use when: high cardinality (user_id, sku, long tail; many unseen). Rule of thumb: n_unique ≥ 200 or fast-growing vocab → Sketch.

Pros: compact, captures similarity, robust to tail. Cons: less directly interpretable; needs enough data to learn.

Date

Basemodel does not currently consume DATE/TIMESTAMP features directly.

Do this upstream in Snowflake:

SELECT
  order_id,
  order_ts,
  EXTRACT(DAYOFWEEK FROM order_ts) AS dow,
  EXTRACT(HOUR      FROM order_ts) AS hour,
  EXTRACT(DAY       FROM order_ts) AS day,
  EXTRACT(MONTH     FROM order_ts) AS month,
  EXTRACT(YEAR      FROM order_ts) AS year,
  DATEDIFF('minute', order_ts, CURRENT_TIMESTAMP()) AS minutes_since_order
FROM orders;

Treat dow, hour, month, etc. as Categorical (one-hot) or Decimal (if you prefer numeric buckets).

Use “time since” (minutes_since_order) as a Decimal.


Text

Use when: unstructured text may provide additional context or signals Rule of thumb: enable only if text fields are informative and relatively clean

Pros: captures semantic meaning, allows richer feature representation via embeddings Cons: disabled by default to control cost and noise; may require cleaning (e.g. lowercasing, stripping markup)


Time Series

Use when: feature evolves over time and past values influence the future Rule of thumb: sequential signals like price history, transaction amount trends, end-of-day balances, sensor readings

Pros: captures temporal patterns, seasonality, and trends Cons: more complex to model; requires sufficient history and sequence handling