Skip to content

Snowflake · Data Types & Feature Encoding

Summary of Types

  • Decimal – numbers; can be used as-is.
  • Categorical – low-cardinality features encoded as one-hot (default for short VARCHAR, BOOLEAN, small INTEGER).
  • Sketchcompact learned embedding for high-cardinality columns (e.g., user_id, sku).
  • Datecurrently not consumed by BaseModel; extract calendar features upstream.
  • Text – tokenized text embeddings only when explicitly enabled.
  • Time Series – sequential numeric features only when explicitly enabled.

Snowflake → Feature type mapping (defaults)

Feature Type Typical Snowflake column types Encoding When to use Notes
Decimal NUMBER, DECIMAL, FLOAT raw continuous measures (price, qty, score)
Categorical BOOLEAN, VARCHAR (low cardinality), INT(low cardinality) one-hot flags, small enums (status, channel)
Sketch VARCHAR/NUMBER/ INT with high cardinality learned embedding products/keys (e.g., article_id, sku) Avoids huge one-hots; better model capacity.
Date DATE, TIMESTAMP_NTZ/LTZ/TZ (not used currently) Extract calendar features in SQL; see below.
Text long VARCHAR tokenized text embedding descriptions, reviews Off by default; enable via override.
Time Series numeric sequential modeling time-varying signals (sales/sensors over time, amounts etc.) Enable via override.

Important: Date/Datetime columns are ignored by the current model as inputs. Add time-derived features upstream.


How type inference works

  • Automatic: we infer an initial type from Snowflake types + heuristics (e.g., low distinct count → Categorical).
  • You can override: set the desired feature type per column (UI/YAML) and, where relevant, choose normalization or embedding.

Override example (YAML)

column_type_overrides:
  amount: decimal
  is_return: categorical
  product_id: categorical_compressed               
  review_text: text         
  sales_ts: time_series


Decimal

Use when: feature is numeric and continuous (not categorical). Rule of thumb: values where magnitude and relative differences matter.

Pros: preserves order and scale, good for regression and ranking tasks. Cons: sensitive to outliers, may require normalization or scaling.


Categorical

Use when: low cardinality (flags/enums, small stable IDs). Rule of thumb: low cardinality → Categorical. Check the columns analysis report for BaseModel's recommendation.

Pros: simple, interpretable. Cons: can explode if categories grow.

Snowflake tips:

-- normalize strings to reduce accidental cardinality
UPPER(TRIM(REPLACE(code,'-','_'))) AS code_norm;

-- check cardinality
SELECT COUNT(DISTINCT status) AS nunique FROM tbl;

Sketch

Use when: high cardinality (user_id, sku, long tail; many unseen). Rule of thumb: high cardinality or fast-growing vocab → Sketch. BaseModel chooses automatically; override if needed.

Pros: compact, captures similarity, robust to tail. Cons: less directly interpretable; needs enough data to learn.

Date

BaseModel does not currently consume DATE/TIMESTAMP features directly.

Do this upstream in Snowflake:

SELECT
  order_id,
  order_ts,
  EXTRACT(DAYOFWEEK FROM order_ts) AS dow,
  EXTRACT(HOUR      FROM order_ts) AS hour,
  EXTRACT(DAY       FROM order_ts) AS day,
  EXTRACT(MONTH     FROM order_ts) AS month,
  EXTRACT(YEAR      FROM order_ts) AS year,
  DATEDIFF('minute', order_ts, CURRENT_TIMESTAMP()) AS minutes_since_order
FROM orders;

Treat dow, hour, month, etc. as Categorical (one-hot) or Decimal (if you prefer numeric buckets).

Use “time since” (minutes_since_order) as a Decimal.


Text

Use when: unstructured text may provide additional context or signals Rule of thumb: enable only if text fields are informative and relatively clean

Pros: captures semantic meaning, allows richer feature representation via embeddings Cons: disabled by default to control cost and noise; may require cleaning (e.g. lowercasing, stripping markup)


Time Series

Use when: feature evolves over time and past values influence the future Rule of thumb: sequential signals like price history, transaction amount trends, end-of-day balances, sensor readings

Pros: captures temporal patterns, seasonality, and trends Cons: more complex to model; requires sufficient history and sequence handling