Data transformations
Additional options to select or modify your data
Check This First!
This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.
Available data transformations
You can customize how BaseModel shall consume each of the data sources:
- selecting only a subset of columns or rows to be processed in the learning process,
- modifying their values or data types,
- divide them into groups [what for?].
Selecting columns and rows
The following parameters enable you to select columns and rows to include in the learning:
-
disallowed_columns (List, optional)
Specify which columns to exclude from feature generation. -
allowed_columns (List, optional)
Use only specified columns for feature generation.Remember
If you specify both allowed and disallowed columns, BaseModel will process only the "allowed columns" minus the "disallowed columns" instead of all columns in a data source.
-
where_condition (str, optional)
Filter table entries by column values, only including subset of rows.Note
There are a few scenarios where this functionality comes in very handy:
Sampling
You may want to only include a fraction of the records in the processing.Discarding events based on criteria
You may want to only include transactions of certain segment of customers, or meeting certain threshold.Heterogeneous events in single table
Sometimes data sources contain events that are, in fact, different behaviors or different phenomena, eg. incomes and expenses in account_transactions, product purchases and returns in store_transactions etc.In such cases it will usually increase BaseModel predictive performance if these groups of events are processed as separate data sources. You can call the same table in data base and use
where_condition
to separate it into more data sources.
Modifying data values or types
You may want to explicitly declare the type of the variable, or cast the values to specified type.
This can be achieved with one of the following options:
-
sql_lambdas (List or str, optional)
SQL functions to modify columns, e.g., casting column entries to the specified data type. -
column_type_overrides (dict, str)
While BaseModel automatically detects the data types of the columns at the initial processing stage, it is sometimes useful to define overrides, eg.::- Integers are treated as categorical variables unless overridden,
- We also recommend explicitly defining text variables here.
Dividing event tables
Sometimes event data sources consist of diverse events which cannot be simply divided manually (eg. when they follow patterns that are difficult to identify or very complex to specify with where_condition
). In such cases BaseModel can automatically detect them and process separately to deliver increased predictive performance.
In order for events to be divided based on patter recognition the num_groups
parameter is used:
- num_groups (int, optional)
Number of groups into which data source will be divided for computing decimal features and histograms.
If set toNone
, no division into groups will be performed.
This option is valid only for event data sources and triggers identification of groups that should:
- exhibit similar frequent combinations of values of categorical columns that appear in events,
- be disjoint,
- be as similar in in size as possible, and meet the minimum of 100 events.
If minimum group size prohibits obtaining defined group number splits the script will fail informing about max value for this parameter.
Note:
You can set 2 environmental variables when using grouping:
- GROUPS_N_RETRIES - default value 20
- GROUPS_RETRY_INTERVAL - default value 60
- In case that the group computation exceeds the time of GROUPS_N_RETRIES * GROUPS_RETRY_INTERVAL then the pretrain phase will stop with an error.
Example scenario
Let's consider the scenario below as a complete example where data set manipulation options come together.
Data sources
Consider the two data sources available in the grocery_app
dataset:
Transactions: "purchases" table
This table records user interactions with a service (product-buy actions). Therefore, it is an event
type table.
The following columns are available:
- UserID: the unique identifier of the user who made the purchase
- ProductID: The unique identifier of the product that was bought
- ProductName: The name of the product that was purchased from the grocery store
- ProductCategory: The category to which the product belongs, such as Fruits, Dairy, Bakery, etc.
- ProductBrand: The brand of the product
- Quantity: The number of units of the product that were purchased
- Timestamp: The date and time when the purchase occurred
- Price: The price of each unit of the product
- PaymentMethod: The method used for payment, such as credit card, cash, or mobile payment
UserID | ProductID | ProductName | ProductCategory | ProductBrand | Quantity | Timestamp | Price | PaymentMethod |
---|---|---|---|---|---|---|---|---|
12789a4bc1234d | 4567ef890ab2cd | Bananas | Fruits | Chiquita | 2 | 2023-06-10 09:15:32 | $3.99 | CreditCard |
9e3f6d2a1b5c8f | 2c4d6e8f0a1b3c | Milk | Dairy | OrganicFarms | 1 | 2023-06-11 17:42:19 | $2.49 | Cash |
8f9e6d5c4b3a2f | a1b2c3d4e5f6g7 | Bread | Bakery | WonderBread | 3 | 2023-06-12 11:20:05 | $1.99 | CreditCard |
7g6f5e4d3c2b1a | 123456789abcdef | Apples | Fruits | Honeycrisp | 1 | 2023-06-13 14:35:11 | $0.79 | MobilePayment |
3c2b1a7g6f5e4d | 7g6f5e4d3c2b1a | Eggs | Dairy | OrganicFarms | 2 | 2023-06-14 19:55:48 | $2.99 | CreditCard |
Customer metadata: "customers" table.
The table stores customers information. Since we intend to predict future behavior of the user, it is the main_entity _attribute. The following columns are available:
- UserID: The unique identifier of the user that matches the corresponding customer in the previous table
- Age: The age of the customer
- Gender: The gender of the customer
- Geolocation: The geolocation information of the customer
- CreatedAt: The timestamp when the customer's account was created
UserID | Age | Gender | Geolocation | CreatedAt |
---|---|---|---|---|
12789a4bc1234d | 35 | Male | New York, USA | 2022-03-15 10:23:41 |
9e3f6d2a1b5c8f | 28 | Female | London, UK | 2021-09-28 16:50:19 |
8f9e6d5c4b3a2f | 42 | Male | Sydney, Australia | 2020-06-10 09:35:57 |
7g6f5e4d3c2b1a | 50 | Female | Paris, France | 2023-01-05 14:12:33 |
Implementation
To enable BaseModel to process the above tables, we will now prepare the data_sources
block ot he YAML
file:
-
for convenience, we will keep the
customers
andpurchases
asname
for tables;
these names will be used in further stages to access features created based on the corresponding tables, -
the
type
should bemain_entity_attribute
for customers andevent
for purchases, -
both data tables are stored in a Snowflake database, so we set
source
tosnowflake
,
adding requiredconnection_params
to configure the database access as described in Snowflake article, -
we should now set the
main_entity_column
to beUserID
, the unique identifier of the customer, -
we use the appropriate names for
table_name
, again keepingcustomers
andpurchases
, -
we should also provide
date_column
specifically for purchasesevent
table '-Timestamp
in our case. -
we then want to:
- in customers table, exclude 'CreatedAt' column from the feature extraction process;
we do this by providing its name as an item of a list argument indisallow_columns
parameter. - in purchases table, include only those events that happened in the last year;
this is done with thewhere_condition
set to"Timestamp >= today() - 365"
- change the type of "Price" column values to float;
the list ofsql_lambdas
defined byalias
price_float
andexpression
"TO_DOUBLE(Price)"
will do that for us.
- in customers table, exclude 'CreatedAt' column from the feature extraction process;
data_sources:
-type: main_entity_attribute
main_entity_column: UserID
name: customers
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: customers
disallowed_columns: [CreatedAt]
-type: event
main_entity_column: UserID
name: purchases
date_column:
name: Timestamp
data_location:
database_type: snowflake
connection_params:
user: username,
password: strongpassword123,
account: xy12345.west-europe.azure,
database: EXAMPLE_DB,
schema: EXAMPLE_SCHEMA,
role: ACCOUNT_ADMIN
table_name: purchases
where_condition: "Timestamp >= today() - 365"
sql_lambdas:
- alias: price_float
expression: "TO_DOUBLE(Price)"
BaseModel feature extraction
Based on this configuration, BaseModelAI will assert column types and handle them respectively to create features.
-
Age, Gender, and Geolocation will be transformed into one-hot-encoded categorical features.
-
CreatedAt column will be omitted as it is in the
disallowed_columns
list. -
ProductName column will be transformed into a graph that shows relations between entities in the
main_entity_column
(here UserID) and products in ProductName — reflecting which products were bought by which user. Our proprietary algorithm Cleora will generate product embeddings based on
relations in the created graph. -
ProductBrand column, similarly to ProductName, will be transformed into a graph showing the relation
between UserID and ProductBrand — reflecting which brands were bought by which user; Cleora is used to create brand embeddings. -
ProductCategory and PaymentMethods don't have enough distinct values in the column to be transformed into a graph; the columns will be one-hot encoded.
-
Quantity and Price will be encoded into quantiles.
Updated 15 days ago