Data partitioning
Defining partitioning to optimize queries
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
BaseModel divides queries into smaller chunks (partitions) to decrease memory consumption of the database. Please note that the database partitioning is different than BaseModel partitioning and does not have to be defined identically, however, setting BaseModel chunks to align with the database partitioning will result in better query optimization. The database partitioned by the same column as defined in the BaseModel config significantly decreases the execution time of queries.
To increase or decrease the number of chunks change num_query_chunks
described in the relevant section of Managing space and memory or num_workers
described in the relevant section of Control of data loading process.
Parameters |
---|
- partition_column : str
default: main_entity_column
Column used for data partitioning. Chunks will be defined based on values in this column. By default it will be the same column as defined in the fieldmain_entity_column
.Important!
The partition column defined in the config should be either the
main_entity_column
or any other column which values disjointly separates main entity values. In other words, a given main entity value can be in only one partition.
Not fulfilling this condition will result in the wrong data distribution during training.
- partition_values_transformation : PartitionValueTransformation
default: PartitionValueTransformation.HASH_MOD
Type of transformation applied to thepartition_column
to determine the id of the partition. Available options:PartitionValueTransformation.HASH_MOD
: Can be used for any values in the partition column. The chunk id is a hash of partition column value modulo number of chunks.PartitionValueTransformation.EXACT_GROUP
: Can be used for any values in the partition column. Partitions will be created to obtain a similar size of each chunk. Should not be used with high cardinality columns as it might be cost-ineffective.PartitionValueTransformation.EXACT_INT_MOD
: Can be used when partition values are integers of any range. Values will be distributed to chunks based on the modulo of a number of partitions.PartitionValueTransformation.EXACT_INT
: Can be used when partition column values are consecutive integers in the range [0, number of partitions - 1]. Chunk id is the value in the partition column.
Example:
Suppose you have two data sources, product_buy
and page_visit
, both containing a column named user_id
that identifies the main entity. Data source page_visit
is partitioned in the database by a column user_group
that divides users into 30 disjoint groups marked with integers. To optimize queries you can use the following configuration:
data_sorces:
- type: event
main_entity_column: user_id
name: product_buy
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: product_buy
- type: event
main_entity_column: user_id
name: page_visit
date_column:
name: timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: page_visit
partition_column: user_group
partition_values_transformation: exact_int_mod
You can leave default BaseModel partitioning for product_buy
since it is not partitioned in the database. Whereas page_visit
is already partitioned by the column that disjointly divides users, it might be beneficial to set custom partition_column
and partition_values_transformation
parameters. You can use exact_int_mod
transformation because values in user_group
are integers.
Updated 5 days ago