Data transformations

Additional options to select or modify your data

📘

Note

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

Available data transformations

You can customize how BaseModel shall consume each of the data sources:

  • selecting only a subset of columns or rows to be processed in the learning process,
  • modifying their values or data types,
  • divide them into groups [what for?].

Selecting columns and rows

The following parameters enable you to select columns and rows to include in the learning:

  • disallowed_columns (List, optional)
    Specify which columns to exclude from feature generation.

  • allowed_columns (List, optional)
    Use only specified columns for feature generation.

    🚧

    Remember

    If you specify both allowed and disallowed columns, BaseModel will process only the "allowed columns" minus the "disallowed columns" instead of all columns in a data source.

  • where_condition (str, optional)
    Filter table entries by column values, only including subset of rows.

    📘

    Note

    There are a few scenarios where this functionality comes in very handy:

    Sampling
    You may want to only include a fraction of the records in the processing.

    Discarding events based on criteria
    You may want to only include transactions of certain segment of customers, or meeting certain threshold.

    Heterogeneous events in single table
    Sometimes data sources contain events that are, in fact, different behaviors or different phenomena, eg. incomes and expenses in account_transactions, product purchases and returns in store_transactions etc.

    In such cases it will usually increase BaseModel predictive performance if these groups of events are processed as separate data sources. You can call the same table in data base and use where_condition to separate it into more data sources.

Modifying data values or types

You may want to explicitly declare the type of the variable, or cast the values to specified type.
This can be achieved with one of the following options:

  • sql_lambdas (List or str, optional)
    SQL functions to modify columns, e.g., casting column entries to the specified data type.

  • column_type_overrides (dict, str)
    While BaseModel automatically detects the data types of the columns at the initial processing stage, it is sometimes useful to define overrides, eg.::

    • Integers are treated as categorical variables unless overridden,
    • We also recommend explicitly defining text variables here.

Dividing event tables

Sometimes event data sources consist of diverse events which cannot be simply divided manually (eg. when they follow patterns that are difficult to identify or very complex to specify with where_condition). In such cases BaseModel can automatically detect them and process separately to deliver increased predictive performance.

In order for events to be divided based on patter recognition the num_groups parameter is used:

  • num_groups (int, optional)
    Number of groups into which data source will be divided for computing decimal features and histograms.
    If set to None, no division into groups will be performed.

This option is valid only for event data sources and triggers identification of groups that should:
- exhibit similar frequent combinations of values of categorical columns that appear in events,
- be disjoint,
- be as similar in in size as possible, and meet the minimum of 100 events.
If minimum group size prohibits obtaining defined group number splits the script will fail informing about max value for this parameter.

📘

Please Note:

You can set 2 environmental variables when using grouping:

  • GROUPS_N_RETRIES - default value 20
  • GROUPS_RETRY_INTERVAL - default value 60
  • In case that the group computation exceeds the time of GROUPS_N_RETRIES * GROUPS_RETRY_INTERVAL then the pretrain phase will stop with an error.

Example scenario

Let's consider the scenario below as a complete example where data set manipulation options come together.

Data sources

Consider the two data sources available in the grocery_app dataset:

Transactions: "purchases" table

This table records user interactions with a service (product-buy actions). Therefore, it is an event type table.
The following columns are available:

  • UserID: the unique identifier of the user who made the purchase
  • ProductID: The unique identifier of the product that was bought
  • ProductName: The name of the product that was purchased from the grocery store
  • ProductCategory: The category to which the product belongs, such as Fruits, Dairy, Bakery, etc.
  • ProductBrand: The brand of the product
  • Quantity: The number of units of the product that were purchased
  • Timestamp: The date and time when the purchase occurred
  • Price: The price of each unit of the product
  • PaymentMethod: The method used for payment, such as credit card, cash, or mobile payment
UserIDProductIDProductNameProductCategoryProductBrandQuantityTimestampPricePaymentMethod
12789a4bc1234d4567ef890ab2cdBananasFruitsChiquita22023-06-10 09:15:32$3.99CreditCard
9e3f6d2a1b5c8f2c4d6e8f0a1b3cMilkDairyOrganicFarms12023-06-11 17:42:19$2.49Cash
8f9e6d5c4b3a2fa1b2c3d4e5f6g7BreadBakeryWonderBread32023-06-12 11:20:05$1.99CreditCard
7g6f5e4d3c2b1a123456789abcdefApplesFruitsHoneycrisp12023-06-13 14:35:11$0.79MobilePayment
3c2b1a7g6f5e4d7g6f5e4d3c2b1aEggsDairyOrganicFarms22023-06-14 19:55:48$2.99CreditCard

Customer metadata: "customers" table.

The table stores customers information. Since we intend to predict future behavior of the user, it is the main_entity _attribute. The following columns are available:

  • UserID: The unique identifier of the user that matches the corresponding customer in the previous table
  • Age: The age of the customer
  • Gender: The gender of the customer
  • Geolocation: The geolocation information of the customer
  • CreatedAt: The timestamp when the customer's account was created
UserIDAgeGenderGeolocationCreatedAt
12789a4bc1234d35MaleNew York, USA2022-03-15 10:23:41
9e3f6d2a1b5c8f28FemaleLondon, UK2021-09-28 16:50:19
8f9e6d5c4b3a2f42MaleSydney, Australia2020-06-10 09:35:57
7g6f5e4d3c2b1a50FemaleParis, France2023-01-05 14:12:33

Implementation

To enable BaseModel to process the above tables, we will now prepare the data_sources block ot he YAML file:

  • for convenience, we will keep the customers and purchases as name for tables;
    these names will be used in further stages to access features created based on the corresponding tables,

  • the type should be main_entity_attribute for customers and event for purchases,

  • both data tables are stored in a Snowflake database, so we set source to snowflake,
    adding required connection_params to configure the database access as described in Snowflake article,

  • we should now set the main_entity_column to be UserID, the unique identifier of the customer,

  • we use the appropriate names for table_name, again keeping customers and purchases ,

  • we should also provide date_column specifically for purchases event table '- Timestamp in our case.

  • we then want to:

    • in customers table, exclude 'CreatedAt' column from the feature extraction process;
      we do this by providing its name as an item of a list argument in disallow_columns parameter.
    • in purchases table, include only those events that happened in the last year;
      this is done with the where_condition set to "Timestamp >= today() - 365"
    • change the type of "Price" column values to float;
      the list of sql_lambdas defined by alias price_float and expression "TO_DOUBLE(Price)"will do that for us.
data_sources:
   -type: main_entity_attribute
    main_entity_column: UserID
    name: customers
    data_location:
       database_type: snowflake
       connection_params:
          user: username,
          password: strongpassword123,
          account: xy12345.west-europe.azure,
          database: EXAMPLE_DB,
          schema: EXAMPLE_SCHEMA,
          role: ACCOUNT_ADMIN
       table_name: customers
       disallowed_columns: [CreatedAt]
   -type: event
    main_entity_column: UserID
    name: purchases
    date_column:
       name: Timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: purchases
        where_condition: "Timestamp >= today() - 365"
        sql_lambdas: 
          - alias: price_float
            expression: "TO_DOUBLE(Price)"

BaseModel feature extraction

Based on this configuration, BaseModelAI will assert column types and handle them respectively to create features.

  • Age, Gender, and Geolocation will be transformed into one-hot-encoded categorical features.

  • CreatedAt column will be omitted as it is in the disallowed_columns list.

  • ProductName column will be transformed into a graph that shows relations between entities in the main_entity_column (here UserID) and products in ProductName — reflecting which products were bought by which user. Our proprietary algorithm Cleora will generate product embeddings based on
    relations in the created graph.

  • ProductBrand column, similarly to ProductName, will be transformed into a graph showing the relation
    between UserID and ProductBrand — reflecting which brands were bought by which user; Cleora is used to create brand embeddings.

  • ProductCategory and PaymentMethods don't have enough distinct values in the column to be transformed into a graph; the columns will be one-hot encoded.

  • Quantity and Price will be encoded into quantiles.