📘
Note
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.

Available data transformations

You can customize how BaseModel shall consume each of the data sources:

selecting only a subset of columns or rows to be processed in the learning process,
modifying their values or data types,
divide them into groups [what for?].

Selecting columns and rows

The following parameters enable you to select columns and rows to include in the learning:

disallowed_columns (List, optional)
Specify which columns to exclude from feature generation.
allowed_columns (List, optional)
Use only specified columns for feature generation.

🚧
Remember
If you specify both allowed and disallowed columns, BaseModel will process only the "allowed columns" minus the "disallowed columns" instead of all columns in a data source.
where_condition (str, optional)
Filter table entries by column values, only including subset of rows.

📘
Note
There are a few scenarios where this functionality comes in very handy:
Sampling
You may want to only include a fraction of the records in the processing.
Discarding events based on criteria
You may want to only include transactions of certain segment of customers, or meeting certain threshold.
Heterogeneous events in single table
Sometimes data sources contain events that are, in fact, different behaviors or different phenomena, eg. incomes and expenses in account_transactions, product purchases and returns in store_transactions etc.
In such cases it will usually increase BaseModel predictive performance if these groups of events are processed as separate data sources. You can call the same table in data base and use where_condition to separate it into more data sources.

Modifying data values or types

You may want to explicitly declare the type of the variable, or cast the values to specified type.
This can be achieved with one of the following options:

sql_lambdas (List or str, optional)
SQL functions to modify columns, e.g., casting column entries to the specified data type.
column_type_overrides (dict, str)
While BaseModel automatically detects the data types of the columns at the initial processing stage, it is sometimes useful to define overrides, eg.::
- Integers are treated as categorical variables unless overridden,
- We also recommend explicitly defining text variables here.

Dividing event tables

Sometimes event data sources consist of diverse events which cannot be simply divided manually (eg. when they follow patterns that are difficult to identify or very complex to specify with where_condition). In such cases BaseModel can automatically detect them and process separately to deliver increased predictive performance.

In order for events to be divided based on patter recognition the num_groups parameter is used:

num_groups (int, optional)
Number of groups into which data source will be divided for computing decimal features and histograms.
If set to None, no division into groups will be performed.

This option is valid only for event data sources and triggers identification of groups that should:
- exhibit similar frequent combinations of values of categorical columns that appear in events,
- be disjoint,
- be as similar in in size as possible, and meet the minimum of 100 events.
If minimum group size prohibits obtaining defined group number splits the script will fail informing about max value for this parameter.

📘
Please Note:
You can set 2 environmental variables when using grouping:

GROUPS_N_RETRIES - default value 20

GROUPS_RETRY_INTERVAL - default value 60

In case that the group computation exceeds the time of GROUPS_N_RETRIES * GROUPS_RETRY_INTERVAL then the pretrain phase will stop with an error.

Example scenario

Let's consider the scenario below as a complete example where data set manipulation options come together.

Data sources

Consider the two data sources available in the grocery_app dataset:

Transactions: "purchases" table

This table records user interactions with a service (product-buy actions). Therefore, it is an event type table.
The following columns are available:

UserID: the unique identifier of the user who made the purchase
ProductID: The unique identifier of the product that was bought
ProductName: The name of the product that was purchased from the grocery store
ProductCategory: The category to which the product belongs, such as Fruits, Dairy, Bakery, etc.
ProductBrand: The brand of the product
Quantity: The number of units of the product that were purchased
Timestamp: The date and time when the purchase occurred
Price: The price of each unit of the product
PaymentMethod: The method used for payment, such as credit card, cash, or mobile payment

UserID	ProductID	ProductName	ProductCategory	ProductBrand	Quantity	Timestamp	Price	PaymentMethod
12789a4bc1234d	4567ef890ab2cd	Bananas	Fruits	Chiquita	2	2023-06-10 09:15:32	$3.99	CreditCard
9e3f6d2a1b5c8f	2c4d6e8f0a1b3c	Milk	Dairy	OrganicFarms	1	2023-06-11 17:42:19	$2.49	Cash
8f9e6d5c4b3a2f	a1b2c3d4e5f6g7	Bread	Bakery	WonderBread	3	2023-06-12 11:20:05	$1.99	CreditCard
7g6f5e4d3c2b1a	123456789abcdef	Apples	Fruits	Honeycrisp	1	2023-06-13 14:35:11	$0.79	MobilePayment
3c2b1a7g6f5e4d	7g6f5e4d3c2b1a	Eggs	Dairy	OrganicFarms	2	2023-06-14 19:55:48	$2.99	CreditCard

Customer metadata: "customers" table.

The table stores customers information. Since we intend to predict future behavior of the user, it is the main_entity _attribute. The following columns are available:

UserID: The unique identifier of the user that matches the corresponding customer in the previous table
Age: The age of the customer
Gender: The gender of the customer
Geolocation: The geolocation information of the customer
CreatedAt: The timestamp when the customer's account was created

UserID	Age	Gender	Geolocation	CreatedAt
12789a4bc1234d	35	Male	New York, USA	2022-03-15 10:23:41
9e3f6d2a1b5c8f	28	Female	London, UK	2021-09-28 16:50:19
8f9e6d5c4b3a2f	42	Male	Sydney, Australia	2020-06-10 09:35:57
7g6f5e4d3c2b1a	50	Female	Paris, France	2023-01-05 14:12:33

Implementation

To enable BaseModel to process the above tables, we will now prepare the data_sources block ot he YAML file:

for convenience, we will keep the customers and purchases as name for tables;
these names will be used in further stages to access features created based on the corresponding tables,
the type should be main_entity_attribute for customers and event for purchases,
both data tables are stored in a Snowflake database, so we set source to snowflake,
adding required connection_params to configure the database access as described in Snowflake article,
we should now set the main_entity_column to be UserID, the unique identifier of the customer,
we use the appropriate names for table_name, again keeping customers and purchases ,
we should also provide date_column specifically for purchases event table '- Timestamp in our case.
we then want to:
- in customers table, exclude 'CreatedAt' column from the feature extraction process;
  we do this by providing its name as an item of a list argument in disallow_columns parameter.
- in purchases table, include only those events that happened in the last year;
  this is done with the where_condition set to "Timestamp >= today() - 365"
- change the type of "Price" column values to float;
  the list of sql_lambdas defined by alias price_float and expression "TO_DOUBLE(Price)"will do that for us.

data_sources:
   -type: main_entity_attribute
    main_entity_column: UserID
    name: customers
    data_location:
       database_type: snowflake
       connection_params:
          user: username,
          password: strongpassword123,
          account: xy12345.west-europe.azure,
          database: EXAMPLE_DB,
          schema: EXAMPLE_SCHEMA,
          role: ACCOUNT_ADMIN
       table_name: customers
       disallowed_columns: [CreatedAt]
   -type: event
    main_entity_column: UserID
    name: purchases
    date_column:
       name: Timestamp
    data_location:
       database_type: snowflake
       connection_params:
           user: username,
           password: strongpassword123,
           account: xy12345.west-europe.azure,
           database: EXAMPLE_DB,
           schema: EXAMPLE_SCHEMA,
           role: ACCOUNT_ADMIN
        table_name: purchases
        where_condition: "Timestamp >= today() - 365"
        sql_lambdas: 
          - alias: price_float
            expression: "TO_DOUBLE(Price)"

BaseModel feature extraction

Based on this configuration, BaseModelAI will assert column types and handle them respectively to create features.

Age, Gender, and Geolocation will be transformed into one-hot-encoded categorical features.
CreatedAt column will be omitted as it is in the disallowed_columns list.
ProductName column will be transformed into a graph that shows relations between entities in the main_entity_column (here UserID) and products in ProductName — reflecting which products were bought by which user. Our proprietary algorithm Cleora will generate product embeddings based on
relations in the created graph.
ProductBrand column, similarly to ProductName, will be transformed into a graph showing the relation
between UserID and ProductBrand — reflecting which brands were bought by which user; Cleora is used to create brand embeddings.
ProductCategory and PaymentMethods don't have enough distinct values in the column to be transformed into a graph; the columns will be one-hot encoded.
Quantity and Price will be encoded into quantiles.

Data transformations

📘
Note

Available data transformations

Selecting columns and rows

🚧
Remember

📘
Note

Modifying data values or types

Dividing event tables

📘
Please Note:

Example scenario

Data sources

Transactions: "purchases" table

Customer metadata: "customers" table.

Implementation

BaseModel feature extraction

📘Note

Available data transformations

Selecting columns and rows

🚧Remember

📘Note

Modifying data values or types

Dividing event tables

📘Please Note:

Example scenario

Data sources

Transactions: "purchases" table

Customer metadata: "customers" table.

Implementation

BaseModel feature extraction

📘
Note

🚧
Remember

📘
Note

📘
Please Note: