Data Connectors
BaseModel reads data from multiple database types. This page documents the connection_params block for each supported database within the YAML configuration.
Overview
| Database | database_type |
Required fields |
|---|---|---|
| Parquet | parquet |
path |
| Snowflake | snowflake |
user, password, account, warehouse, database, db_schema |
| BigQuery | bigquery |
filename |
| Databricks | databricks |
host, warehouse_id, (token or client_id + client_secret) |
| Hive | hive |
hive_params |
| ClickHouse | clickhouse |
host |
| Synapse | synapse |
server_name, database_name, user |
Parquet
Read from local or cloud-stored Parquet files. Supports single files, directories, and glob patterns.
| Parameter | Type | Default | Description |
|---|---|---|---|
path |
Path |
required | Path to the Parquet file or directory. Glob patterns supported (e.g., *.parquet). |
cache_path |
Path \| None |
None |
Database cache path. Can be reused between trainings for faster restarts. |
config_overrides |
dict |
{} |
DuckDB connection overrides (e.g., {"max_memory": "100GB"}). See DuckDB configuration reference. |
data_location:
database_type: parquet
connection_params:
path: "/data/transactions.parquet"
cache_path: "/basemodel/db_cache/"
table_name: transactions
Glob pattern example:
Snowflake
| Parameter | Type | Default | Description |
|---|---|---|---|
user |
str |
required | Login name of the user. Supports environment variables (e.g., ${SNOWFLAKE_USER}). |
password |
str |
required | Password for the user. Supports environment variables. |
account |
str |
required | Snowflake account identifier. |
warehouse |
str |
required | Virtual warehouse to use. |
database |
str |
required | Default database. |
db_schema |
str |
required | Default schema (alias: schema). |
role |
str |
"PUBLIC" |
Snowflake role to use for the session. |
data_location:
database_type: snowflake
connection_params:
user: "${SNOWFLAKE_USER}"
password: "${SNOWFLAKE_PASSWORD}"
account: "${SNOWFLAKE_ACCOUNT}"
warehouse: "${SNOWFLAKE_WAREHOUSE}"
database: "${SNOWFLAKE_DATABASE}"
schema: "${SNOWFLAKE_SCHEMA}"
role: "${SNOWFLAKE_ROLE}"
table_name: transactions
OAuth Token Authentication (Snowflake Container Services)
For Snowflake Container Services, use the OAuth token-based configuration. This variant reads the token from /snowflake/session/token and uses SNOWFLAKE_ACCOUNT and SNOWFLAKE_HOST environment variables. Fields: warehouse, database, db_schema, authenticator (set to "oauth").
BigQuery
| Parameter | Type | Default | Description |
|---|---|---|---|
filename |
Path |
required | Path to the service account JSON file. |
project_id |
str \| None |
None |
BigQuery project ID. Only needed if different from the one in the service account. |
data_location:
database_type: bigquery
connection_params:
filename: "/secrets/bigquery-user.json"
project_id: "my-bigquery-project"
table_name: transactions
schema_name: my_dataset
Databricks
| Parameter | Type | Default | Description |
|---|---|---|---|
host |
str |
required | Server hostname for your Databricks cluster or SQL warehouse. |
warehouse_id |
str |
required | SQL warehouse ID for your Databricks SQL warehouse. |
token |
str \| None |
None |
Personal access token. Mutually exclusive with client_id/client_secret. |
client_id |
str \| None |
None |
Service principal client ID. Requires client_secret. |
client_secret |
str \| None |
None |
Service principal client secret. Requires client_id. |
catalog |
str \| None |
None |
Unity Catalog name. |
db_schema |
str |
"default" |
Schema name (alias: schema). |
Warning
token and client_id/client_secret are mutually exclusive. Use either personal access token or service principal authentication, not both.
# Personal access token authentication
data_location:
database_type: databricks
connection_params:
host: "${DATABRICKS_SERVER_HOSTNAME}"
warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
token: "${DATABRICKS_TOKEN}"
table_name: transactions
# Service principal authentication
data_location:
database_type: databricks
connection_params:
host: "${DATABRICKS_SERVER_HOSTNAME}"
warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
client_id: "${DATABRICKS_CLIENT_ID}"
client_secret: "${DATABRICKS_CLIENT_SECRET}"
table_name: transactions
Hive
| Parameter | Type | Default | Description |
|---|---|---|---|
hive_params |
HiveParamsConfig |
required | Connection parameters. Configure via DSN or Driver + Port + HiveServerType. |
ini_file |
str \| None |
$ODBCINI env var |
Path to ODBC .ini file. Required when using DSN. |
kerberos_params |
KerberosParamsConfig \| None |
None |
Kerberos authentication parameters. Required for Kerberos-secured clusters. |
HiveParamsConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
DSN |
str \| None |
None |
ODBC Data Source Name. |
Driver |
str \| None |
None |
ODBC driver path. Required if DSN is not set. |
Port |
int \| None |
None |
Hive server port. Required if DSN is not set. |
HiveServerType |
int \| None |
None |
Hive server type (e.g., 2). Required if DSN is not set. |
KerberosParamsConfig
| Parameter | Type | Default | Description |
|---|---|---|---|
user |
str |
required | Kerberos principal name. |
kinit_realm |
str |
required | Kerberos realm. |
kerberos_host |
str |
required | Kerberos service host IP. |
kerberos_service_name |
str |
required | Kerberos service name (e.g., "hive"). |
kerberos_fqdn |
str |
required | Fully qualified domain name. |
keytab_path |
str |
required | Path to the keytab file. |
krb5_config_path |
str |
"/etc/krb5.conf" |
Path to krb5.conf. |
password |
str \| None |
None |
Password (plain text or file path). File-based works only with Heimdal Kerberos. |
kerberos_renewal_interval_minutes |
int |
540 |
Ticket renewal interval in minutes. |
verbose |
bool |
False |
Whether to print verbose output. |
# DSN-based configuration
data_location:
database_type: hive
connection_params:
hive_params:
DSN: "MyHiveDSN"
ini_file: "/etc/odbc.ini"
kerberos_params:
user: "hive_user@REALM"
kinit_realm: "REALM"
kerberos_host: "10.0.0.1"
kerberos_service_name: "hive"
kerberos_fqdn: "hive-server.example.com"
keytab_path: "/etc/security/keytabs/hive.keytab"
table_name: transactions
ClickHouse
| Parameter | Type | Default | Description |
|---|---|---|---|
host |
str |
required | ClickHouse connection string (e.g., clickhouse://user:password@host:port/database). |
Warning
The CLICKHOUSE_VERSION environment variable must be set. Supported versions: 22.8 through 24.4.
data_location:
database_type: clickhouse
connection_params:
host: "${CLICKHOUSE_HOST}"
table_name: transactions
Synapse
| Parameter | Type | Default | Description |
|---|---|---|---|
server_name |
str |
required | Azure Synapse server name. |
database_name |
str |
required | Dedicated SQL pool name. |
user |
str |
required | Username. Supports environment variables. |
password |
str \| None |
None |
Password. Supports environment variables. |
data_location:
database_type: synapse
connection_params:
server_name: "my-synapse.sql.azuresynapse.net"
database_name: "my_pool"
user: "${SYNAPSE_USER}"
password: "${SYNAPSE_PASSWORD}"
table_name: transactions
Column Selection
Control which columns are included in training using allowed_columns or disallowed_columns on any data source.
| Parameter | Type | Description |
|---|---|---|
allowed_columns |
list[str] |
Only these columns will be used. All others are excluded. |
disallowed_columns |
list[str] |
These columns are excluded. All others are used. |
Note
allowed_columns and disallowed_columns are mutually exclusive. Use disallowed_columns to drop PII, unique IDs, or columns that would cause data leakage.
# Drop columns that add no signal
disallowed_columns: ["order_id", "internal_id"]
# Or explicitly select columns
allowed_columns: ["customer_id", "article_id", "t_dat", "price", "sales_channel_id"]
Date Column Formats
The date_column block specifies the event timestamp column and its format.
| Format String | Example Value |
|---|---|
"%Y-%m-%d" |
2024-01-15 |
"%Y-%m-%d %H:%M:%S" |
2024-01-15 14:30:00 |
"%d/%m/%Y" |
15/01/2024 |
"unix" |
1705312200 (seconds since epoch) |
"unix_ms" |
1705312200000 (milliseconds since epoch) |
Date format depends on database engine
The format string syntax differs by database:
- Snowflake uses SQL-standard tokens:
YYYY-MM-DD,YYYY-MM-DD HH24:MI:SS - All other engines (Parquet, BigQuery, Databricks, ClickHouse, Hive) use Python strftime:
%Y-%m-%d,%Y-%m-%d %H:%M:%S
Using the wrong syntax causes silent data filtering failures or Can't parse date errors at training time.
Joining Tables
Use joined_data_sources on event data sources to join attribute (dimension) tables.
# On the event data source
joined_data_sources:
- name: articles # Name of the attribute data source
join_on:
- [article_id, article_id] # [event_column, attribute_column]
Multiple joins are supported:
joined_data_sources:
- name: articles
join_on:
- [article_id, article_id]
- name: stores
join_on:
- [store_id, store_id]
Tip
The attribute data source must be defined separately in data_sources with type: attribute. The join_on pairs map [event_table_column, attribute_table_column].