Connect Sources
Every entry in the data_sources list needs a data_location block that tells BaseModel where to read from. This page covers the supported backends and the connection parameters for each.
Supported databases
| Database | database_type |
Required fields |
|---|---|---|
| Parquet | parquet |
path |
| Snowflake | snowflake |
user, password, account, warehouse, database, db_schema |
| BigQuery | bigquery |
filename |
| Databricks | databricks |
host, warehouse_id, (token or client_id + client_secret) |
| Hive | hive |
hive_params |
| ClickHouse | clickhouse |
host |
| Synapse | synapse |
server_name, database_name, user |
Connection pattern
Every data_location block follows the same structure:
data_location:
database_type: ... # see table above
connection_params:
... # engine-specific — see reference
table_name: my_table
Engine-specific details
Each engine has its own required fields, optional parameters, and authentication methods. See the full reference for YAML examples and parameter tables:
- Parquet — local/mounted files
- Snowflake — password or key-pair auth
- BigQuery — service-account JSON
- Databricks — token or service principal
- ClickHouse — connection string
- Azure Synapse — SQL pool
- Apache Hive — DSN or driver, optional Kerberos
Best Practices
Never commit credentials to config files
Never commit passwords or tokens into YAML configuration files. All connection_params values support ${ENV_VAR} syntax.
- Stick to one
database_typeper config. All tables within a single foundation model config should use the same backend. If your data lives in multiple engines, either export the smaller tables into the larger one or export everything to Parquet.
Test connectivity before training
Verify that your credentials, network access, and table names are correct using your database client before launching a training run.
For the full list of parameters per engine, see the Data Configuration Reference.