⚠️
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

Configuration File Overview (YAML)

Before a foundation model is trained with pretrain function, a YAML file needs to be prepared to store its configuration. The overall structure looks as follows:

data_sources:
    - List of data sources and their configurations.

data_params:
    - Data related parameters, eg. sampling, dates to separate training, validation and test sets etc.

data_loader_params:
    - Parameters that modify how the data is loaded from source, such as batch sizes, workers etc.

training_params:
    - Parameters describing the training process, such as learning rate, epochs etc.
    
memory_constraining_params: 
    - Parameters that control size of the model, eg. size of the networks hidden dimension etc.

query_optimization:
    - Parameters controlling the degree of parallelization, eg. dividing query in chunks.

This article focuses on defining and connecting data done in data_sources block. For the other parts - customizing the data loading and model parameters - please refer to the dedicated article.

Defining Data Sources

All data sources are declared in the data_sources block of theYAML file, one after another. The example below has two data sources but the flow and logic is the same if there are more sources to connect:

data_sources:
  - type: main_entity_attribute
    main_entity_column: UserID
    name: customers
    data_location:
    	database_type: parquet	
      connection_params:
        path: "/path/to/parquet/file"
        cache_path: "/path/to/cache/"
      table_name: customers
    disallowed_columns: [CreatedAt]
  - type: event
    main_entity_column: UserID
    name: purchases
    date_column: 
     	name: Timestamp
    data_location:
    	database_type: parquet	
      connection_params:
        path: "/path/to/parquet/file"
        cache_path: "/path/to/cache/"
      table_name: purchases
    where_condition: "Timestamp >= today() - 365"
    sql_lambdas: 
      - alias: price_float
        expression: "TO_DOUBLE(price)"

❗️
Watch Out
If you want to combine several data sources, the main entity identifier needs to match between sources.

Mandatory fields of data_sources block

We will now describe the fields are provided as part of the data_sources — a list of data sources in format List[Dict].

type (str) Example:attribute
Defines the kind of data table, with possible values of event , main_entity_attribute and attribute. For more information regarding the data source type refer to this article.

⚠️
Remember
Defining data source with attribute type allows you to join them to the event type data source. However, Main Entity Attribute is automatically joined on the main_entity_column with the events data and does not require explicit joins described later.
main_entity_column (str) Example:UserID
Specifies a column with a unique identifier for the entity which attributes are stored in a table.
name (str) Example:customers
Specifies the data source name.
date_column (dict) A block defining the timestamp of the event, hence only configured for event data sources. Parameters:
- name (str) Example:TransactionDate
  Specifies a column with the event timestamp.
- format (str) Example:"%Y%m%D"
  Format in which column should be parsed, provider dependent. If not provided no parsing is performed and column is assumed to be streamed as a date / datetime type.
data_location (dict) A block for detailing the information needed to connect to data bases or load data from files. Supported are bigquery, clickhouse, databricks, hive, snowflake, synapse, parquet. Please refer to the subsection specific to your data source:

Joining Other Entities' Attributes to Event Tables

If you intend to enrich your event data by joining it with additional attributes for other entities, once you defined your attribute type data sources please refer here.

❗️
Watch Out
Joining attribute tables for data supplied by parquet files is currently not supported.

Optional fields of `data_sources` block

There are many more options to customize your data sources, such as specifying columns to use / ignore, adding where condition, overriding column data types, adding lambda functions etc. These options are described in this article.

Check This First!

Configuration File Overview (YAML)

Defining Data Sources

Watch Out

Mandatory fields of data_sources block

Remember

Joining Other Entities' Attributes to Event Tables

Watch Out

Optional fields of data_sources block

Optional fields of `data_sources` block