Data Connectors

BaseModel reads data from multiple database types. This page documents the connection_params block for each supported database within the YAML configuration.

Overview

Database	`database_type`	Required fields
Parquet	`parquet`	`path`
Snowflake	`snowflake`	`user`, `password`, `account`, `warehouse`, `database`, `db_schema`
BigQuery	`bigquery`	`filename`
Databricks	`databricks`	`host`, `warehouse_id`, (`token` or `client_id` + `client_secret`)
Hive	`hive`	`hive_params`
ClickHouse	`clickhouse`	`host`
Synapse	`synapse`	`server_name`, `database_name`, `user`

Parquet

Read from local or cloud-stored Parquet files. Supports single files, directories, and glob patterns.

Parameter	Type	Default	Description
`path`	`Path`	required	Path to the Parquet file or directory. Glob patterns supported (e.g., `*.parquet`).
`cache_path`	`Path \\| None`	`None`	Database cache path. Can be reused between trainings for faster restarts.
`config_overrides`	`dict`	`{}`	DuckDB connection overrides (e.g., `{"max_memory": "100GB"}`). See DuckDB configuration reference.

data_location:
  database_type: parquet
  connection_params:
    path: "/data/transactions.parquet"
    cache_path: "/basemodel/db_cache/"
  table_name: transactions

Glob pattern example:

connection_params:
  path: "/data/transactions/*.parquet"
  cache_path: "/basemodel/db_cache/"

Snowflake

Parameter	Type	Default	Description
`user`	`str`	required	Login name of the user. Supports environment variables (e.g., `${SNOWFLAKE_USER}`).
`password`	`str`	required	Password for the user. Supports environment variables.
`account`	`str`	required	Snowflake account identifier.
`warehouse`	`str`	required	Virtual warehouse to use.
`database`	`str`	required	Default database.
`db_schema`	`str`	required	Default schema (alias: `schema`).
`role`	`str`	`"PUBLIC"`	Snowflake role to use for the session.

data_location:
  database_type: snowflake
  connection_params:
    user: "${SNOWFLAKE_USER}"
    password: "${SNOWFLAKE_PASSWORD}"
    account: "${SNOWFLAKE_ACCOUNT}"
    warehouse: "${SNOWFLAKE_WAREHOUSE}"
    database: "${SNOWFLAKE_DATABASE}"
    schema: "${SNOWFLAKE_SCHEMA}"
    role: "${SNOWFLAKE_ROLE}"
  table_name: transactions

OAuth Token Authentication (Snowflake Container Services)

For Snowflake Container Services, use the OAuth token-based configuration. This variant reads the token from /snowflake/session/token and uses SNOWFLAKE_ACCOUNT and SNOWFLAKE_HOST environment variables. Fields: warehouse, database, db_schema, authenticator (set to "oauth").

BigQuery

Parameter	Type	Default	Description
`filename`	`Path`	required	Path to the service account JSON file.
`project_id`	`str \\| None`	`None`	BigQuery project ID. Only needed if different from the one in the service account.

data_location:
  database_type: bigquery
  connection_params:
    filename: "/secrets/bigquery-user.json"
    project_id: "my-bigquery-project"
  table_name: transactions
  schema_name: my_dataset

Databricks

Parameter	Type	Default	Description
`host`	`str`	required	Server hostname for your Databricks cluster or SQL warehouse.
`warehouse_id`	`str`	required	SQL warehouse ID for your Databricks SQL warehouse.
`token`	`str \\| None`	`None`	Personal access token. Mutually exclusive with `client_id`/`client_secret`.
`client_id`	`str \\| None`	`None`	Service principal client ID. Requires `client_secret`.
`client_secret`	`str \\| None`	`None`	Service principal client secret. Requires `client_id`.
`catalog`	`str \\| None`	`None`	Unity Catalog name.
`db_schema`	`str`	`"default"`	Schema name (alias: `schema`).

Warning

token and client_id/client_secret are mutually exclusive. Use either personal access token or service principal authentication, not both.

# Personal access token authentication
data_location:
  database_type: databricks
  connection_params:
    host: "${DATABRICKS_SERVER_HOSTNAME}"
    warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
    token: "${DATABRICKS_TOKEN}"
  table_name: transactions

# Service principal authentication
data_location:
  database_type: databricks
  connection_params:
    host: "${DATABRICKS_SERVER_HOSTNAME}"
    warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
    client_id: "${DATABRICKS_CLIENT_ID}"
    client_secret: "${DATABRICKS_CLIENT_SECRET}"
  table_name: transactions

Hive

Parameter	Type	Default	Description
`hive_params`	`HiveParamsConfig`	required	Connection parameters. Configure via DSN or Driver + Port + HiveServerType.
`ini_file`	`str \\| None`	`$ODBCINI` env var	Path to ODBC `.ini` file. Required when using DSN.
`kerberos_params`	`KerberosParamsConfig \\| None`	`None`	Kerberos authentication parameters. Required for Kerberos-secured clusters.

`HiveParamsConfig`

Parameter	Type	Default	Description
`DSN`	`str \\| None`	`None`	ODBC Data Source Name.
`Driver`	`str \\| None`	`None`	ODBC driver path. Required if DSN is not set.
`Port`	`int \\| None`	`None`	Hive server port. Required if DSN is not set.
`HiveServerType`	`int \\| None`	`None`	Hive server type (e.g., `2`). Required if DSN is not set.

`KerberosParamsConfig`

Parameter	Type	Default	Description
`user`	`str`	required	Kerberos principal name.
`kinit_realm`	`str`	required	Kerberos realm.
`kerberos_host`	`str`	required	Kerberos service host IP.
`kerberos_service_name`	`str`	required	Kerberos service name (e.g., `"hive"`).
`kerberos_fqdn`	`str`	required	Fully qualified domain name.
`keytab_path`	`str`	required	Path to the keytab file.
`krb5_config_path`	`str`	`"/etc/krb5.conf"`	Path to `krb5.conf`.
`password`	`str \\| None`	`None`	Password (plain text or file path). File-based works only with Heimdal Kerberos.
`kerberos_renewal_interval_minutes`	`int`	`540`	Ticket renewal interval in minutes.
`verbose`	`bool`	`False`	Whether to print verbose output.

# DSN-based configuration
data_location:
  database_type: hive
  connection_params:
    hive_params:
      DSN: "MyHiveDSN"
    ini_file: "/etc/odbc.ini"
    kerberos_params:
      user: "hive_user@REALM"
      kinit_realm: "REALM"
      kerberos_host: "10.0.0.1"
      kerberos_service_name: "hive"
      kerberos_fqdn: "hive-server.example.com"
      keytab_path: "/etc/security/keytabs/hive.keytab"
  table_name: transactions

ClickHouse

Parameter	Type	Default	Description
`host`	`str`	required	ClickHouse connection string (e.g., `clickhouse://user:password@host:port/database`).

Warning

The CLICKHOUSE_VERSION environment variable must be set. Supported versions: 22.8 through 24.4.

data_location:
  database_type: clickhouse
  connection_params:
    host: "${CLICKHOUSE_HOST}"
  table_name: transactions

Synapse

Parameter	Type	Default	Description
`server_name`	`str`	required	Azure Synapse server name.
`database_name`	`str`	required	Dedicated SQL pool name.
`user`	`str`	required	Username. Supports environment variables.
`password`	`str \\| None`	`None`	Password. Supports environment variables.

data_location:
  database_type: synapse
  connection_params:
    server_name: "my-synapse.sql.azuresynapse.net"
    database_name: "my_pool"
    user: "${SYNAPSE_USER}"
    password: "${SYNAPSE_PASSWORD}"
  table_name: transactions

Column Selection

Control which columns are included in training using allowed_columns or disallowed_columns on any data source.

Parameter	Type	Description
`allowed_columns`	`list[str]`	Only these columns will be used. All others are excluded.
`disallowed_columns`	`list[str]`	These columns are excluded. All others are used.

Note

allowed_columns and disallowed_columns are mutually exclusive. Use disallowed_columns to drop PII, unique IDs, or columns that would cause data leakage.

# Drop columns that add no signal
disallowed_columns: ["order_id", "internal_id"]

# Or explicitly select columns
allowed_columns: ["customer_id", "article_id", "t_dat", "price", "sales_channel_id"]

Date Column Formats

The date_column block specifies the event timestamp column and its format.

Format String	Example Value
`"%Y-%m-%d"`	`2024-01-15`
`"%Y-%m-%d %H:%M:%S"`	`2024-01-15 14:30:00`
`"%d/%m/%Y"`	`15/01/2024`
`"unix"`	`1705312200` (seconds since epoch)
`"unix_ms"`	`1705312200000` (milliseconds since epoch)

date_column:
  name: t_dat
  format: "%Y-%m-%d"

Date format depends on database engine

The format string syntax differs by database:

Snowflake uses SQL-standard tokens: YYYY-MM-DD, YYYY-MM-DD HH24:MI:SS
All other engines (Parquet, BigQuery, Databricks, ClickHouse, Hive) use Python strftime: %Y-%m-%d, %Y-%m-%d %H:%M:%S

Using the wrong syntax causes silent data filtering failures or Can't parse date errors at training time.

Joining Tables

Use joined_data_sources on event data sources to join attribute (dimension) tables.

# On the event data source
joined_data_sources:
  - name: articles           # Name of the attribute data source
    join_on:
      - [article_id, article_id]  # [event_column, attribute_column]

Multiple joins are supported:

joined_data_sources:
  - name: articles
    join_on:
      - [article_id, article_id]
  - name: stores
    join_on:
      - [store_id, store_id]

Tip

The attribute data source must be defined separately in data_sources with type: attribute. The join_on pairs map [event_table_column, attribute_table_column].