Parquet data sources

⚠️
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

Various data sources are specified in the YAML file used by the pretrain function and configured by the entries in data_location section. Aside from connecting directly to a data solution there is also a possibility to load data from parquet file. Below is an example code that should be adapted to your configuration.

data_location:
  database_type: parquet
    connection_params:
      path: /path/to/parquet/file
      cache_path: /path/to/cache/directory/
    table_name: TABLE_NAME

Parameters

database_type : str
No default value. Required
Information about the database type or source file. All data tables should be stored in the same type. Set to: parquet.
connection_params : dict
Configures the connection to the database. This block is mandatory, as there are two required arguments for parquet file as data source.
- path : str
  No default value. Required.
  The full path to the parquet file. For parquet files divided into parts, the path should be <path_to_directory_with_parquet_files>/*.parquet. Example: /home/data/customers.parquet or /home/data/customers/*.parquet.
- cache_path : str
  No default value. Optional.
  The full path to the cache directory to store temporary query results and persisted data to optimize performance and reduce redundant computations. Example: /home/data/cache/.
- max_memory : str
  No default value. Optional.
  The parameter controls the maximum amount of memory that BaseModel is allowed to use for query execution. Example: '10GB'.
table_name : str
No default value. Required.
Specifies the table to use to create features. If cache_path is used, , it must be the same as the path parameter. Example: customers.

The connection_params should be set separately in each data_location block, for each data source.

Example

The following example demonstrates the usage of parquet files in the context of a simple configuration with two data sources.

data_sources:
  -type: main_entity_attribute
   main_entity_column: UserID
   name: customers
   data_location:
     database_type: parquet
     connection_params:
       path: /path/to/parquet/file
       cache_path: /path/to/cache/directory/
     table_name: customers
   disallowed_columns: [CreatedAt]
  -type: event
   main_entity_column: UserID
   name: purchases
   date_column: 
     name: Timestamp
   data_location:
     database_type: parquet
     connection_params:
       path: /path/to/parquet/file
       cache_path: /path/to/cache/directory/
     table_name: purchases
   where_condition: "Timestamp >= today() - 365"
   sql_lambdas: 
     - alias: price_float
       expression: "TO_DOUBLE(price)"

📘
Learn More
The detailed description of optional fields such as disallowed_columns, where_condition, sql_lambda, and many others is provided here