Defining Data Sources

Configuring your data in YAML configuration file

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.


Configuration File Overview (YAML)

Before a foundation model is trained with pretrain function, a YAML file needs to be prepared to store its configuration. The overall structure looks as follows:

data_sources:
    - List of data sources and their configurations.

data_params:
    - Data related parameters, eg. temporal splits for training / validation / test set.

data_loader_params:
    - Parameters that modify how the data is loaded from source, such as batch sizes, workers etc.

training_params:
    - Parameters describing the training process, such as learning rate, epochs etc.
    
memory_constraining_params: 
    - Parameters that control size of the model, eg. size of the networks hidden dimension etc.

query_optimization:
    - Parameters controlling the degree of parallelization, eg. dividing query in chunks.

This article focuses on defining and connecting data done in data_sources block.
For the other parts - customizing the data loading and model parameters - please refer to the dedicated article.

Defining Data Sources

All data sources are declared in the data_sources block of the YAML file, one after another.
The example below has two data sources but the flow and logic is the same if there are more sources to connect:

data_sources:
  - type: main_entity_attribute
    main_entity_column: UserID
    name: customers
    data_location:
    	database_type: parquet
      connection_params:
        path: PATH_TO_PARQUET
      table_name: customers
    disallowed_columns: [CreatedAt]
  - type: event
    main_entity_column: UserID
    name: purchases
    date_column: 
     	name: Timestamp
    data_location:
    	database_type: parquet
      connection_params:
        path: PATH_TO_PARQUET
      table_name: purchases
    where_condition: "Timestamp >= today() - 365"
    sql_lambdas: 
      - alias: price_float
        expression: "TO_DOUBLE(price)"

❗️

Watch Out

If you want to combine several data sources, the main entity identifier needs to match between sources.

Mandatory fields of data_sources block

We will now describe the fields are provided as part of the data_sources — a list of data sources in format List[Dict].

  • type (str)
    Example: attribute
    Defines the kind of data table, with possible values of event , main_entity_attribute and attribute.
    For more information regarding the data source type refer to this article.

    ⚠️

    Remember

    Defining data source with attribute type allows you to join them to the event type data source.
    However, Main Entity Attribute is automatically joined on the main_entity_column with the events data and does not require explicit joins described later.

  • main_entity_column (str)
    Example: UserID
    Specifies a column with a unique identifier for the entity which attributes are stored in a table.

  • name (str)
    Example: customers
    Specifies the data source name.

  • date_column (dict)
    A block defining the timestamp of the event, hence only configured for event data sources.
    Parameters:

    • name (str)
      Example: TransactionDate
      Specifies a column with the event timestamp.

    • format (str)
      Example: "%Y%m%D"
      Format in which column should be parsed, provider dependent. If not provided no parsing is performed and column is assumed to be streamed as a date / datetime type.

  • data_location (dict)
    A block for detailing the information needed to connect to data bases or load data from files.
    Supported are bigquery, clickhouse, databricks, hive, snowflake, synapse, parquet.
    Please refer to the subsection specific to your data source:

Joining Other Entities' Attributes to Event Tables

If you intend to enrich your event data by joining it with additional attributes for other entities, once you defined your attribute type data sources please refer here.

❗️

Watch Out

Joining attribute tables for data supplied by parquet files is currently not supported.

Optional fields of data_sources block

There are many more options to customize your data sources, such as specifying columns to use / ignore, adding where condition, overriding column data types, adding lambda functions etc. These options are described in this article.