Connect Sources

Every entry in the data_sources list needs a data_location block that tells BaseModel where to read from. This page covers the supported backends and the connection parameters for each.

Supported databases

Database	`database_type`	Required fields
Parquet	`parquet`	`path`
Snowflake	`snowflake`	`user`, `password`, `account`, `warehouse`, `database`, `db_schema`
BigQuery	`bigquery`	`filename`
Databricks	`databricks`	`host`, `warehouse_id`, (`token` or `client_id` + `client_secret`)
Hive	`hive`	`hive_params`
ClickHouse	`clickhouse`	`host`
Synapse	`synapse`	`server_name`, `database_name`, `user`

Connection pattern

Every data_location block follows the same structure:

yaml

data_location:
  database_type: ...        # see table above
  connection_params:
    ...                     # engine-specific — see reference
  table_name: my_table

Engine-specific details

Each engine has its own required fields, optional parameters, and authentication methods. See the full reference for YAML examples and parameter tables:

Parquet — local/mounted files
Snowflake — password or key-pair auth
BigQuery — service-account JSON
Databricks — token or service principal
ClickHouse — connection string
Azure Synapse — SQL pool
Apache Hive — DSN or driver, optional Kerberos

Best Practices

Never commit credentials to config files

Never commit passwords or tokens into YAML configuration files. All connection_params values support ${ENV_VAR} syntax.

Stick to one database_type per config. All tables within a single foundation model config should use the same backend. If your data lives in multiple engines, either export the smaller tables into the larger one or export everything to Parquet.

Test connectivity before training

Verify that your credentials, network access, and table names are correct using your database client before launching a training run.

For the full list of parameters per engine, see the Data Configuration Reference.