End-to-end example using Kaggle H&M fashion dataset
Complete flow from foundation model training to inference
In this tutorial, we will guide you though doing propensity predictions on one of the Kaggle datasets - namely this one : HM Personalized Fashion Recommendations
Step 1 - configure the foundation model and its data sources
First step is to correctly configure datasource connection and the foundation model. In this example, we will be using Snowflake data warehouse, but it is possible to use any other supported connection type found in documentation.
data_sources:
- type: main_entity_attribute
main_entity_column: customer_id
name: customers
data_location:
database_type: snowflake
connection_params:
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
account: ${SNOWFLAKE_ACCOUNT}
warehouse: ${SNOWFLAKE_WAREHOUSE}
database: HM_KAGGLE
db_schema: PRIVATE
table_name: customers
- type: event
main_entity_column: customer_id
name: transactions
date_column:
name: t_dat
data_location:
database_type: snowflake
connection_params:
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
account: ${SNOWFLAKE_ACCOUNT}
warehouse: ${SNOWFLAKE_WAREHOUSE}
database: HM_KAGGLE
db_schema: PRIVATE
table_name: transactions
data_params:
data_start_date: 2018-09-20 00:00:00
validation_start_date: 2020-09-01 00:00:00
loader_params:
batch_size: 256
num_workers: 5
training_params:
learning_rate: 0.0003
epochs: 3
hidden_dim: 2048
In this case, we have already joined data from ARTICLES table into the TRANSACTIONS table, so we only provide 2 data sources: one of event
type and one of attribute
type.
For the purpose of this tutorial, we leave the rest of parameters unchanged. More details about them can be found in documentation.
Naturally, under connection_params
users need to provide their credentials for example to snowflake or any other supported database. We can move on to the next stage:
Step 2 - run the preprocessing and train foundation model
At this point, the only thing we need to to is to run the BaseModel on the previously prepared configuration file.
Let's run the following CLI command:
python -m pretrain --config <path/to/config.yml> --features-path <path/to/store/pretrain/artifacts>
At this stage user can monitor what is happening by reviewing the logs outputted to console.
After a while, data preprocessing will be done and the FoundationModel will be ready. All necessary files will be stored in the artifacts
folder defined in the --features-path
above.
- fm - foundation model folder with model checkpoint
- features - folder with features transformations/embeddings
This is it! Foundation Model is ready. We are ready to move on to the next stage.
Step 3 - configure a scenario model
At this stage we need to prepare configuration for the Downstream task training phase. This is done by writing your own python script or by modifying existing templates.
In this case, let's say we want to calculate propensity to buy specifc categories per user.
Let's have a look at training script used in this tutorial:
from typing import Dict
import numpy as np
from monad.ui.config import TrainingParams
from monad.ui.module import MultilabelClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window, next_n_days, SPLIT_TIMESTAMP
TARGET_NAMES = [
"Denim Trousers",
"Swimwear",
"Trousers",
"Jersey Basic",
"Ladies Sport Bottoms",
"Basic 1",
"Jersey fancy",
"Blouse",
"Shorts",
"Trouser",
"Ladies Sport Bras",
"Casual Lingerie",
"Expressive Lingerie",
"Dress",
"Dresses",
"Tops Knitwear",
"Skirt",
"Nightwear",
"Knitwear",
]
TARGET_ENTITY = "department_name"
def target_fn(_history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray:
target_window_days = 21
if has_incomplete_training_window(_ctx, target_window_days):
return None
future = next_n_days(future, _ctx[SPLIT_TIMESTAMP], target_window_days)
purchase_target, _ = future["transactions"].groupBy(TARGET_ENTITY).exists(groups=TARGET_NAMES)
return purchase_target
# type of ML task and size of the model output
num_outputs = len(TARGET_NAMES)
task = MultilabelClassificationTask(num_classes=num_outputs)
# number of classes to predict, here equal to count of departments
# paths
fm_path = "/path/to/your/model/fm"
checkpoint_dir="/path/to/store/your/downstream/scenario"
training_params = TrainingParams(
learning_rate=0.0001,
checkpoint_dir=checkpoint_dir,
epochs=1,
devices=[1],
)
if __name__ == "__main__":
trainer = load_from_foundation_model(
checkpoint_path=fm_path,
downstream_task=task,
target_fn=target_fn,
)
trainer.fit(training_params=training_params)
TARGET_ENTITY
- in this case it is basically target column that has the interesting categories, that we define in TARGET_NAMES
.
Next, we define the target_fn
in a way, that describes users' purchase behaviour. In this case it will be labelling 1 the classes (specific brands) that the customer will purchase in the future.
We then need to specify the type of ML task and the size of the output:
num_outputs = len(TARGET_NAMES)
specifies the number of classes the model (multilabel classifier in this case) needs to predict for.task = MultiLabelClassificationTask(num_classes=num_outputs)
defines the downstream learning task that will be used in training.
We need to also define two paths: to foundation model fm
folder (to load the model) and to the directory we want to store our model checkpoints. We input the latter to TrainingParams
where we can also choose to modify some parameters.
The final part of the script instantiates the trainer by loading foundation model and adding to it our new task, target function and output size, and trains the model using just defined training parameters.
Once the file is prepared, we want to run it via python train.py
.
Step 4 - generate predictions
Now the final step - run predictions. You will run it the same way as before, by preparing and running a python script. In our case it it will look like this:
from monad.ui.config import TestingParams
from monad.ui.module import load_from_checkpoint
if __name__ == "__main__":
checkpoint_dir = "checkpoint"
testing_module = load_from_checkpoint(checkpoint_dir)
testing_params = TestingParams(
local_save_location="checkpoint/preds.csv",
)
testing_module.predict(testing_params=testing_params)
Updated 8 days ago