Databricks data sources

⚠️
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

Various data sources are specified in the YAML file used by the pretrain function and configured by the entries in data_location section. Below is an example code that should be adapted to your configuration.

data_location:
  database_type: databricks
  connection_params:
    host: "${DATABRICKS_HOST}"
    warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
    client_id: "${DATABRICKS_CLIENT_ID}"
    client_secret: "${DATABRICKS_CLIENT_SECRET}"
    catalog: hm_kaggle
    db_schema: private
  table_name: some_table

Parameters

database_type : str, required No default value.
Information about the database type or source file. All data tables should be stored in the same type. Set to: databricks.
connection_params : dict, required Configures the connection to the database. For Databricks, keyword arguments are:
- host : str, required No default value.
  Server Hostname value for your cluster or SQL warehouse,
- warehouse_id : str, required No default value.
  HTTP Path value for your cluster or SQL warehouse,
- token : str, optional default=None.
  Databricks personal access token.
- client_id::str, optional default=None.
  Client id. Can't be set if token is set.
- client_secret: str, optional default=None.
  Client secret. Can't be set if token is set.
- catalog : str, optional default=None
  Initial catalog to use for the connection.
- db_schema : str, optional default="default"
  Initial schema to use for the connection.
table_name : str, required No default value.
Specifies the table to use to create features. Example: customers.

The connection_params should be set separately in each data_location block, for each data source.

⚠️
Note
For security reasons, avoid providing token and Databricks connection variables directly in the code; instead, set them as environment variables and call as such, an in the example below.

Example

The following example demonstrates the connection to Databricks in the context of a simple configuration with two data sources.

data_sources:
  -type: main_entity_attribute
   main_entity_column: UserID
   name: customers
   data_location:
     database_type: databricks
     connection_params:
        host: "${DATABRICKS_HOST}"
        warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
        client_id: "${DATABRICKS_CLIENT_ID}"
        client_secret: "${DATABRICKS_CLIENT_SECRET}"
        catalog: hm_kaggle
        db_schema: private
     table_name: customers
   disallowed_columns: [CreatedAt]
  -type: event
   main_entity_column: UserID
   name: purchases
   date_column: 
     name: Timestamp
   data_location:
     database_type: databricks
     connection_params:
        host: "${DATABRICKS_HOST}"
        warehouse_id: "${DATABRICKS_WAREHOUSE_ID}"
        client_id: "${DATABRICKS_CLIENT_ID}"
        client_secret: "${DATABRICKS_CLIENT_SECRET}"
        catalog: hm_kaggle
        db_schema: private
     table_name: purchases
   where_condition: "Timestamp >= today() - 365"
   sql_lambdas: 
     - alias: price_float
       expression: "TO_DOUBLE(price)"

For more details about Python connector to Databricks please refer to the Databricks documentation.

📘
Learn More
The detailed description of optional fields such as disallowed_columns, where_condition, sql_lambda, and many others is provided here