Guides

Data partitioning

Defining partitioning to optimize queries

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

BaseModel divides queries into smaller chunks (partitions) to decrease memory consumption of the database. Please note that the database partitioning is different than BaseModel partitioning and does not have to be defined identically, however, setting BaseModel chunks to align with the database partitioning will result in better query optimization. The database partitioned by the same column as defined in the BaseModel config significantly decreases the execution time of queries.

To increase or decrease the number of chunks change num_query_chunks described in the relevant section of Managing space and memory or num_workers described in the relevant section of Control of data loading process.


Parameters
  • partition_column : str
    default: main_entity_column
    Column used for data partitioning. Chunks will be defined based on values in this column. By default it will be the same column as defined in the field main_entity_column.

    ⚠️

    Important!

    The partition column defined in the config should be either the main_entity_column or any other column which values disjointly separates main entity values. In other words, a given main entity value can be in only one partition.
    Not fulfilling this condition will result in the wrong data distribution during training.


  • partition_values_transformation : PartitionValueTransformation
    default: PartitionValueTransformation.HASH_MOD
    Type of transformation applied to the partition_column to determine the id of the partition. Available options:
    • PartitionValueTransformation.HASH_MOD: Can be used for any values in the partition column. The chunk id is a hash of partition column value modulo number of chunks.
    • PartitionValueTransformation.EXACT_GROUP: Can be used for any values in the partition column. Partitions will be created to obtain a similar size of each chunk. Should not be used with high cardinality columns as it might be cost-ineffective.
    • PartitionValueTransformation.EXACT_INT_MOD: Can be used when partition values are integers of any range. Values will be distributed to chunks based on the modulo of a number of partitions.
    • PartitionValueTransformation.EXACT_INT: Can be used when partition column values are consecutive integers in the range [0, number of partitions - 1]. Chunk id is the value in the partition column.


Example:

Suppose you have two data sources, product_buy and page_visit, both containing a column named user_id that identifies the main entity. Data source page_visit is partitioned in the database by a column user_group that divides users into 30 disjoint groups marked with integers. To optimize queries you can use the following configuration:

data_sorces:
  - type: event
    main_entity_column: user_id
    name: product_buy
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: product_buy        
  - type: event
    main_entity_column: user_id
    name: page_visit
    date_column:
       name: timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: page_visit
    partition_column: user_group
    partition_values_transformation: exact_int_mod

You can leave default BaseModel partitioning for product_buy since it is not partitioned in the database. Whereas page_visit is already partitioned by the column that disjointly divides users, it might be beneficial to set custom partition_column and partition_values_transformation parameters. You can use exact_int_mod transformation because values in user_group are integers.