Skip to content

Data Connectors

BaseModel reads data from multiple database types. This page documents the connection_params block for each supported database within the YAML configuration.

Overview

Database database_type Required fields
Parquet parquet path
Snowflake snowflake user, password, account, warehouse, database, db_schema
BigQuery bigquery filename
Databricks databricks host, warehouse_id, (token or client_id + client_secret)
Hive hive hive_params
ClickHouse clickhouse host
Synapse synapse server_name, database_name, user

Parquet

Read from local or cloud-stored Parquet files. Supports single files, directories, and glob patterns.

Parameter Type Default Description
path Path required Path to the Parquet file or directory. Glob patterns supported (e.g., *.parquet).
cache_path Path \| None None Database cache path. Can be reused between trainings for faster restarts.
config_overrides dict {} DuckDB connection overrides (e.g., {"max_memory": "100GB"}). See DuckDB configuration reference.
data_location:
  database_type: parquet
  connection_params:
    path: "/data/transactions.parquet"
    cache_path: "/basemodel/db_cache/"
  table_name: transactions

Glob pattern example:

connection_params:
  path: "/data/transactions/*.parquet"
  cache_path: "/basemodel/db_cache/"

Snowflake

Parameter Type Default Description
user str required Login name of the user. Supports environment variables (e.g., ${SNOWFLAKE_USER}).
password str required Password for the user. Supports environment variables.
account str required Snowflake account identifier.
warehouse str required Virtual warehouse to use.
database str required Default database.
db_schema str required Default schema (alias: schema).
role str "PUBLIC" Snowflake role to use for the session.
data_location:
  database_type: snowflake
  connection_params:
    user: "${SNOWFLAKE_USER}"
    password: "${SNOWFLAKE_PASSWORD}"
    account: "${SNOWFLAKE_ACCOUNT}"
    warehouse: "${SNOWFLAKE_WAREHOUSE}"
    database: "${SNOWFLAKE_DATABASE}"
    schema: "${SNOWFLAKE_SCHEMA}"
    role: "${SNOWFLAKE_ROLE}"
  table_name: transactions

OAuth Token Authentication (Snowflake Container Services)

For Snowflake Container Services, use the OAuth token-based configuration. This variant reads the token from /snowflake/session/token and uses SNOWFLAKE_ACCOUNT and SNOWFLAKE_HOST environment variables. Fields: warehouse, database, db_schema, authenticator (set to "oauth").


BigQuery

Parameter Type Default Description
filename Path required Path to the service account JSON file.
project_id str \| None None BigQuery project ID. Only needed if different from the one in the service account.
data_location:
  database_type: bigquery
  connection_params:
    filename: "/secrets/bigquery-user.json"
    project_id: "my-bigquery-project"
  table_name: transactions
  schema_name: my_dataset

Databricks

Parameter Type Default Description
host str required Server hostname for your Databricks cluster or SQL warehouse.
warehouse_id str required SQL warehouse ID for your Databricks SQL warehouse.
token str \| None None Personal access token. Mutually exclusive with client_id/client_secret.
client_id str \| None None Service principal client ID. Requires client_secret.
client_secret str \| None None Service principal client secret. Requires client_id.
catalog str \| None None Unity Catalog name.
db_schema str "default" Schema name (alias: schema).

Warning

token and client_id/client_secret are mutually exclusive. Use either personal access token or service principal authentication, not both.

# Personal access token authentication
data_location:
  database_type: databricks
  connection_params:
    host: "${DATABRICKS_SERVER_HOSTNAME}"
    warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
    token: "${DATABRICKS_TOKEN}"
  table_name: transactions
# Service principal authentication
data_location:
  database_type: databricks
  connection_params:
    host: "${DATABRICKS_SERVER_HOSTNAME}"
    warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
    client_id: "${DATABRICKS_CLIENT_ID}"
    client_secret: "${DATABRICKS_CLIENT_SECRET}"
  table_name: transactions

Hive

Parameter Type Default Description
hive_params HiveParamsConfig required Connection parameters. Configure via DSN or Driver + Port + HiveServerType.
ini_file str \| None $ODBCINI env var Path to ODBC .ini file. Required when using DSN.
kerberos_params KerberosParamsConfig \| None None Kerberos authentication parameters. Required for Kerberos-secured clusters.

HiveParamsConfig

Parameter Type Default Description
DSN str \| None None ODBC Data Source Name.
Driver str \| None None ODBC driver path. Required if DSN is not set.
Port int \| None None Hive server port. Required if DSN is not set.
HiveServerType int \| None None Hive server type (e.g., 2). Required if DSN is not set.

KerberosParamsConfig

Parameter Type Default Description
user str required Kerberos principal name.
kinit_realm str required Kerberos realm.
kerberos_host str required Kerberos service host IP.
kerberos_service_name str required Kerberos service name (e.g., "hive").
kerberos_fqdn str required Fully qualified domain name.
keytab_path str required Path to the keytab file.
krb5_config_path str "/etc/krb5.conf" Path to krb5.conf.
password str \| None None Password (plain text or file path). File-based works only with Heimdal Kerberos.
kerberos_renewal_interval_minutes int 540 Ticket renewal interval in minutes.
verbose bool False Whether to print verbose output.
# DSN-based configuration
data_location:
  database_type: hive
  connection_params:
    hive_params:
      DSN: "MyHiveDSN"
    ini_file: "/etc/odbc.ini"
    kerberos_params:
      user: "hive_user@REALM"
      kinit_realm: "REALM"
      kerberos_host: "10.0.0.1"
      kerberos_service_name: "hive"
      kerberos_fqdn: "hive-server.example.com"
      keytab_path: "/etc/security/keytabs/hive.keytab"
  table_name: transactions

ClickHouse

Parameter Type Default Description
host str required ClickHouse connection string (e.g., clickhouse://user:password@host:port/database).

Warning

The CLICKHOUSE_VERSION environment variable must be set. Supported versions: 22.8 through 24.4.

data_location:
  database_type: clickhouse
  connection_params:
    host: "${CLICKHOUSE_HOST}"
  table_name: transactions

Synapse

Parameter Type Default Description
server_name str required Azure Synapse server name.
database_name str required Dedicated SQL pool name.
user str required Username. Supports environment variables.
password str \| None None Password. Supports environment variables.
data_location:
  database_type: synapse
  connection_params:
    server_name: "my-synapse.sql.azuresynapse.net"
    database_name: "my_pool"
    user: "${SYNAPSE_USER}"
    password: "${SYNAPSE_PASSWORD}"
  table_name: transactions

Column Selection

Control which columns are included in training using allowed_columns or disallowed_columns on any data source.

Parameter Type Description
allowed_columns list[str] Only these columns will be used. All others are excluded.
disallowed_columns list[str] These columns are excluded. All others are used.

Note

allowed_columns and disallowed_columns are mutually exclusive. Use disallowed_columns to drop PII, unique IDs, or columns that would cause data leakage.

# Drop columns that add no signal
disallowed_columns: ["order_id", "internal_id"]

# Or explicitly select columns
allowed_columns: ["customer_id", "article_id", "t_dat", "price", "sales_channel_id"]

Date Column Formats

The date_column block specifies the event timestamp column and its format.

Format String Example Value
"%Y-%m-%d" 2024-01-15
"%Y-%m-%d %H:%M:%S" 2024-01-15 14:30:00
"%d/%m/%Y" 15/01/2024
"unix" 1705312200 (seconds since epoch)
"unix_ms" 1705312200000 (milliseconds since epoch)
date_column:
  name: t_dat
  format: "%Y-%m-%d"

Date format depends on database engine

The format string syntax differs by database:

  • Snowflake uses SQL-standard tokens: YYYY-MM-DD, YYYY-MM-DD HH24:MI:SS
  • All other engines (Parquet, BigQuery, Databricks, ClickHouse, Hive) use Python strftime: %Y-%m-%d, %Y-%m-%d %H:%M:%S

Using the wrong syntax causes silent data filtering failures or Can't parse date errors at training time.


Joining Tables

Use joined_data_sources on event data sources to join attribute (dimension) tables.

# On the event data source
joined_data_sources:
  - name: articles           # Name of the attribute data source
    join_on:
      - [article_id, article_id]  # [event_column, attribute_column]

Multiple joins are supported:

joined_data_sources:
  - name: articles
    join_on:
      - [article_id, article_id]
  - name: stores
    join_on:
      - [store_id, store_id]

Tip

The attribute data source must be defined separately in data_sources with type: attribute. The join_on pairs map [event_table_column, attribute_table_column].