Data Types & Feature Encoding
Snowflake · Data Types & Feature Encoding
Summary of Types
- Decimal – numbers; can be used as-is.
- Categorical – low-cardinality features encoded as one-hot (default for short
VARCHAR,BOOLEAN, smallINTEGER).- Sketch – compact learned embedding for high-cardinality columns (e.g.,
user_id,sku).- Date – currently not consumed by BaseModel; extract calendar features upstream.
- Text – tokenized text embeddings only when explicitly enabled.
- Time Series – sequential numeric features only when explicitly enabled.
Snowflake → Feature type mapping (defaults)
| Feature Type | Typical Snowflake column types | Encoding | When to use | Notes |
|---|---|---|---|---|
| Decimal | NUMBER, DECIMAL, FLOAT | raw | continuous measures (price, qty, score) | — |
| Categorical | BOOLEAN, VARCHAR (low cardinality), INT(low cardinality) | one-hot | flags, small enums (status, channel) | — |
| Sketch | VARCHAR/NUMBER/ INT with high cardinality | learned embedding | products/keys (e.g., article_id, sku) | Avoids huge one-hots; better model capacity. |
| Date | DATE, TIMESTAMP_NTZ/LTZ/TZ | (not used currently) | — | Extract calendar features in SQL; see below. |
| Text | long VARCHAR | tokenized text embedding | descriptions, reviews | Off by default; enable via override. |
| Time Series | numeric | sequential modeling | time-varying signals (sales/sensors over time, amounts etc.) | Enable via override. |
Important: Date/Datetime columns are ignored by the current model as inputs. Add time-derived features upstream.
How type inference works
- Automatic: we infer an initial type from Snowflake types + heuristics (e.g., low distinct count → Categorical).
- You can override: set the desired feature type per column (UI/YAML) and, where relevant, choose normalization or embedding.
Override example (YAML)
column_type_overrides:
amount: decimal
is_return: categorical
product_id: categorical_compressed
review_text: text
sales_ts: time_seriesDecimal
Use when: feature is numeric and continuous (not categorical). Rule of thumb: values where magnitude and relative differences matter.
Pros: preserves order and scale, good for regression and ranking tasks. Cons: sensitive to outliers, may require normalization or scaling.
Categorical
Use when: low cardinality (flags/enums, small stable IDs).
Rule of thumb: n_unique ≤ 200 → Categorical.
Pros: simple, interpretable. Cons: can explode if categories grow.
Snowflake tips:
-- normalize strings to reduce accidental cardinality
UPPER(TRIM(REPLACE(code,'-','_'))) AS code_norm;
-- check cardinality
SELECT COUNT(DISTINCT status) AS nunique FROM tbl;Sketch
Use when: high cardinality (user_id, sku, long tail; many unseen).
Rule of thumb: n_unique ≥ 200 or fast-growing vocab → Sketch.
Pros: compact, captures similarity, robust to tail. Cons: less directly interpretable; needs enough data to learn.
Date
Basemodel does not currently consume DATE/TIMESTAMP features directly.
Do this upstream in Snowflake:
SELECT
order_id,
order_ts,
EXTRACT(DAYOFWEEK FROM order_ts) AS dow,
EXTRACT(HOUR FROM order_ts) AS hour,
EXTRACT(DAY FROM order_ts) AS day,
EXTRACT(MONTH FROM order_ts) AS month,
EXTRACT(YEAR FROM order_ts) AS year,
DATEDIFF('minute', order_ts, CURRENT_TIMESTAMP()) AS minutes_since_order
FROM orders;Treat dow, hour, month, etc. as Categorical (one-hot) or Decimal (if you prefer numeric buckets).
Use “time since” (minutes_since_order) as a Decimal.
Text
Use when: unstructured text may provide additional context or signals Rule of thumb: enable only if text fields are informative and relatively clean
Pros: captures semantic meaning, allows richer feature representation via embeddings Cons: disabled by default to control cost and noise; may require cleaning (e.g. lowercasing, stripping markup)
Time Series
Use when: feature evolves over time and past values influence the future Rule of thumb: sequential signals like price history, transaction amount trends, end-of-day balances, sensor readings
Pros: captures temporal patterns, seasonality, and trends Cons: more complex to model; requires sufficient history and sequence handling
Updated about 1 month ago
