Google BigQuery data sources

⚠️
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

Various data sources are specified in the YAML file used by the pretrain function and configured by the entries in data_location section. Below is an example configuration for BigQuery, which should be adapted to your environment:

data_location:
  database_type: bigquery
  connection_params:
    filename: /path/to/your/file.json
    project_id: optional project name if is different from the one in the service account
  schema_name: your_schema_name
  table_name: your_table_name

Parameters

database_type : str, required
No default value.
Specifies the type of database. All listed tables must come from the same source type. Set to: bigquery.
connection_params : dict, required
Connection configuration. For BigQuery, the required keys are:
- filename : str, required
  No default value.
  Path to the service account JSON file (see example below).
  
  Example: bigquery-user.json.
- project_id: str, optional
  Default value: None
  Overrides the Google Cloud project ID defined in the service account JSON file. Use this if the BigQuery project you want to use for querying or computation is different from the one the service account belongs to.
schema_name (str) : str, required
No default value.
The schema to use when building features. You may also pass project_id.schema_name if the data resides in a different GCP project than the one used to execute the query. Example: test_schema.
table_name : str, required
No default value.
Specifies the table to use to create features. Example: customers.

📘
Note
The connection_params must be defined individually for each data_location block, even if the same file is reused across multiple sources.

To connect BaseModel to BigQuery, you need a Google Cloud service account key file in JSON format. This file contains credentials that authorize secure, programmatic access to your BigQuery project.

You can generate this file in the Google Cloud Console by creating a service account, assigning the necessary BigQuery roles, and downloading the key as a JSON file.

Below is a sample structure of the file:

{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "your-private-key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----\nYOUR-PRIVATE-KEY\n-----END PRIVATE KEY-----\n",
  "client_email": "[email protected]",
  "client_id": "your-client-id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-service-account%40your-project-id.iam.gserviceaccount.com",
  "universe_domain": "googleapis.com"
}

Examples

The example below demonstrates how to configure multiple BigQuery-based data sources, including main entity and event data:

data_sources:
  -type: main_entity_attribute
   main_entity_column: UserID
   name: customers
   data_location:
     database_type: bigquery
     connection_params:
       filename: /path/to/your/file.json
     schema_name: your_schema_name
     table_name: customers
   disallowed_columns: [CreatedAt]
  -type: event
   main_entity_column: UserID
   name: purchases
   date_column: 
     name: Timestamp
   data_location:
     database_type: bigquery
     connection_params:
       filename: /path/to/your/file.json
     schema_name: your_schema_name
     table_name: purchases
  where_condition: "Timestamp >= today() - 365"
  sql_lambdas: 
    - alias: price_float
      expression: "TO_DOUBLE(price)"

If your data is stored in a project that differs from the one used to execute the query, you can specify both the compute and storage projects explicitly:

data_sources:
  -type: event
   main_entity_column: UserID
   name: purchases
   date_column: 
     name: Timestamp
   data_location:
     database_type: bigquery
     connection_params:
       filename: /path/to/your/file.json
       project_id: compute_project_id
     schema_name: data_project_id.data_schema_name
     table_name: purchases

📘
Learn More
The detailed description of optional fields such as disallowed_columns, where_condition, sql_lambda, and many others is provided here