Defining Data Sources
Configuring your data in YAML configuration file
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
Configuration File Overview (YAML)
Before a foundation model is trained with pretrain
function, a YAML
file needs to be prepared to store its configuration. The overall structure looks as follows:
data_sources:
- List of data sources and their configurations.
data_params:
- Data related parameters, eg. temporal splits for training / validation / test set.
data_loader_params:
- Parameters that modify how the data is loaded from source, such as batch sizes, workers etc.
training_params:
- Parameters describing the training process, such as learning rate, epochs etc.
memory_constraining_params:
- Parameters that control size of the model, eg. size of the networks hidden dimension etc.
query_optimization:
- Parameters controlling the degree of parallelization, eg. dividing query in chunks.
This article focuses on defining and connecting data done in data_sources
block.
For the other parts - customizing the data loading and model parameters - please refer to the dedicated article.
Defining Data Sources
All data sources are declared in the data_sources block of the YAML
file, one after another.
The example below has two data sources but the flow and logic is the same if there are more sources to connect:
data_sources:
- type: main_entity_attribute
main_entity_column: UserID
name: customers
data_location:
database_type: parquet
connection_params:
path: PATH_TO_PARQUET
table_name: customers
disallowed_columns: [CreatedAt]
- type: event
main_entity_column: UserID
name: purchases
date_column:
name: Timestamp
data_location:
database_type: parquet
connection_params:
path: PATH_TO_PARQUET
table_name: purchases
where_condition: "Timestamp >= today() - 365"
sql_lambdas:
- alias: price_float
expression: "TO_DOUBLE(price)"
Watch Out
If you want to combine several data sources, the main entity identifier needs to match between sources.
Mandatory fields of data_sources block
We will now describe the fields are provided as part of the data_sources — a list of data sources in format List[Dict].
-
type (str)
Example:attribute
Defines the kind of data table, with possible values ofevent
,main_entity_attribute
andattribute
.
For more information regarding the data source type refer to this article.Remember
Defining data source with
attribute
type allows you to join them to theevent
type data source.
However,Main Entity Attribute
is automatically joined on the main_entity_column with the events data and does not require explicit joins described later. -
main_entity_column (str)
Example:UserID
Specifies a column with a unique identifier for the entity which attributes are stored in a table. -
name (str)
Example:customers
Specifies the data source name. -
date_column (dict)
A block defining the timestamp of the event, hence only configured forevent
data sources.
Parameters:-
name (str)
Example:TransactionDate
Specifies a column with the event timestamp. -
format (str)
Example:"%Y%m%D"
Format in which column should be parsed, provider dependent. If not provided no parsing is performed and column is assumed to be streamed as a date / datetime type.
-
-
data_location (dict)
A block for detailing the information needed to connect to data bases or load data from files.
Supported arebigquery
,clickhouse
,databricks
,hive
,snowflake
,synapse
,parquet
.
Please refer to the subsection specific to your data source:
Joining Other Entities' Attributes to Event Tables
If you intend to enrich your event data by joining it with additional attributes for other entities, once you defined your attribute
type data sources please refer here.
Watch Out
Joining attribute tables for data supplied by parquet files is currently not supported.
Optional fields of data_sources block
There are many more options to customize your data sources, such as specifying columns to use / ignore, adding where condition, overriding column data types, adding lambda functions etc. These options are described in this article.
Updated about 1 month ago