End-to-End Tutorial: H&M Kaggle Example
Building first foundation and scenario models with the H&M Kaggle dataset
This guide will walk you through the process of creating your first foundation model and scenario model using the HM Personalized Fashion Recommendations dataset. We'll cover data preparation, model training, testing, and interpreting predictions.
Prerequisites
Before we begin, ensure you have:
- Access to the H&M dataset, either:
- Stored locally as Parquet files.
- Uploaded to a Snowflake account.
- BaseModel installed and configured.
We assume you have basic understanding of YAML and Python.
Preparing the Data"
The H&M dataset consists of three main tables:
- Customers: Customer metadata.
- Transactions: Purchase history.
- Articles: Product information.
Ensure these tables are available in your chosen data storage:
- Local Parquet Files: Convert the CSV files to Parquet format and store them locally.
- Snowflake: Upload the CSV files to your Snowflake account.
Preparing the Configuration File
Create a configuration file to define your data sources, loading, optimisation and training parameters.
You can do that by creating a new text file and saving it with a .yaml extension (e.g., config.yaml
).
Review the sections below to learn how to:
- define your data sources,
- configure data-related, optimization, and training parameters.
Defining Data Sources
In this YAML configuration, we define three data sources for BaseModel:
- Customers Table: Designated as main_entity_attribute, this table contains metadata about the primary entities (customers) and is automatically joined with event data based on the
customer_id
. - Articles Table: Defined as an attribute type, this table holds information about products.
- Transactions Table: Marked as an event type, this table records customer transactions over time. It includes a date_column (
t_dat
) to timestamp events. In the joined_data_sources section it specifies the join to articles attribute table using thearticle_id
to enrich event data with product attributes.
This setup enables BaseModel to process customer behaviors (transactions) enriched with product information, facilitating the training of a foundation model that captures the interactions between customers and products.
data_sources:
- type: main_entity_attribute
main_entity_column: customer_id
name: customers
data_location:
database_type: parquet
connection_params:
path: "/path/to/hm_data/customers.parquet"
cache_path: "/db_cache/"
table_name: customers
- type: attribute
name: articles
data_location:
database_type: parquet
connection_params:
path: "/path/to/hm_data/articles.parquet"
cache_path: "/db_cache/"
table_name: articles
- type: event
main_entity_column: customer_id
name: transactions
date_column:
name: t_dat
format: '%Y-%m-%d'
joined_data_sources:
- name: articles
join_on:
- [article_id, article_id]
data_location:
database_type: parquet
connection_params:
path: "/path/to/hm_data/transactions_train.parquet"
cache_path: "/db_cache/"
table_name: transactions
Configuring Parameters
The data sources in YAML file should be followed by configuration of data sets, data loading, training and other parameters. In our example:
- In data_params we set:
- the initial date to be considered for training (
data_start_date
) for 2018-09-20 00:00:00, - the boundaries for training, validation and test sets to be temporal (
split
→type
→time
), - and separate these datasets with
start_date
parameter in respective blocks.
- the initial date to be considered for training (
- In loader_params we set:
batch_size
to 256, to process the training in batches of 256 samples (main entity IDs),num_workers
to 5, to parallelise data loading in five subprocesses.
- In training_params we configure:
learning_rate
to 0.0003, a good starting point for BaseModel,epochs
, for the training to go through the entire dataset only once.
data_params:
data_start_date: 2018-09-20 00:00:00
split:
type: time
training:
start_date: 2018-09-20 00:00:00
validation:
start_date: 2020-07-01 00:00:00
test:
start_date: 2020-09-01 00:00:00
data_loader_params:
batch_size: 256
num_workers: 5
training_params:
learning_rate: 0.0003
epochs: 1
Train the Foundation Model
With the YAML configuration saved, you can now train the foundation model—either via a terminal command or a Python script calling the appropriate function.
Command Line
Run the following command to train the foundation model:
python -m monad.run \
--pretrain \
--config-path "path/to/config.yml" \
--output-path "path/to/store/pretrain/artifacts" \
--overwrite
Python Script
Run the following script to train the foundation model:
from monad.ui import pretrain
from pathlib import Path
pretrain(
config_path=Path("path/to/config.yaml"),
output_path=Path("path/to/store/pretrain/artifacts")
)
Train the Scenario Model
With BaseModel, you can fine-tune your foundation model to a broad range of scenarios and supporting ML problems. In this tutorial we will demonstrate a buying propensity use case which is a multi-label classification problem.
Import required libraries
First, we need to import a few required libraries:
- We want to load our foundation model and build a multi-label classifier on top of it, so we need
load_from_foundation_model
andMultilabelClassificationTask
frommonad.ui.module
, - From
monad.ui.target_function
we need a few other imports required to define the function, TrainingParams
frommonad.ui.config
allows us to configure the training,numpy
library andDict
fromtyping
close our list for this script.
from typing import Dict
import numpy as np
from monad.ui.config import TrainingParams
from monad.ui.module import MultilabelClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window, get_qualified_column_name, next_n_days, SPLIT_TIMESTAMP
Define the target function icon="fa-crosshairs"
Now we will define the objective for our model:
- We start with defining
TARGET_NAMES
as a list of product categories (e.g., "Denim Trousers", "Swimwear") that the model should predict. - We also need to point BaseModel to the right column which stores the category names as
TARGET_ENTITY
; in our case it exists in the joined attribute table which is why we need a helper functionget_qualified_column_name
. - We then implement Target Function (
target_fn
) by:- Checking for sufficient historical data using
has_incomplete_training_window
, otherwise returningNone
for that main entity (customer). - Limiting future events in scope to a 21-day campaign window using
next_n_days
. - Creating
purchase_target
: a binary vector indicating which categories were purchased by each customer; this is done by grouping future transactions byTARGET_ENTITY
and flagging the presence of each target category withexists
. - Our function needs to return that vector.
- Checking for sufficient historical data using
TARGET_NAMES = [
"Denim Trousers", "Swimwear", "Trousers", "Jersey Basic", "Ladies Sport Bottoms",
"Basic 1", "Jersey fancy", "Blouse", "Shorts", "Trouser", "Ladies Sport Bras",
"Casual Lingerie", "Expressive Lingerie", "Dress", "Dresses", "Tops Knitwear",
"Skirt", "Nightwear", "Knitwear",
]
TARGET_ENTITY = get_qualified_column_name("department_name", ["articles"])
def target_fn(_history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray:
target_window_days = 21
if has_incomplete_training_window(_ctx, target_window_days):
return None
future = next_n_days(future, _ctx[SPLIT_TIMESTAMP], target_window_days)
purchase_target, _ = future["transactions"].groupBy(TARGET_ENTITY).exists(groups=TARGET_NAMES)
return purchase_target
For more examples, see the Target Function Examples page.
Configure and fit the model
In the final section of our scenario training script we:
- Initialize the task (
MultilabelClassificationTask
) providing it with names of target categories asclass_names
. - Set Paths to the pre-trained foundation model (
fm_path
), and the directory to save the scenario model checkpoints (checkpoint_dir
). - Configure
TrainingParams
- you can define your desired location to save the model withcheckpoint_dir
, modify the defaultlearning_rate
,epochs
, set the GPU withdevices
etc. - Train the Model:
- Load the foundation model using
load_from_foundation_model
, providing thecheckpoint_path
,downstream_task
, andtarget_fn
. - Call fit on the trainer with the specified
training_params
to commence training.
- Load the foundation model using
task = MultilabelClassificationTask(class_names=TARGET_NAMES)
fm_path = "/path/to/your/model/fm"
checkpoint_dir = "/path/to/store/your/downstream/scenario"
training_params = TrainingParams(
learning_rate=0.0001,
checkpoint_dir=checkpoint_dir,
epochs=1,
devices=[1],
)
if __name__ == "__main__":
trainer = load_from_foundation_model(
checkpoint_path=fm_path,
downstream_task=task,
target_fn=target_fn,
)
trainer.fit(training_params=training_params)
Save this script and run it to train the scenario model.
Evaluate the model
Once you have trained a downstream model, you will likely want to test its performance. To do this, you should prepare and execute a Python script with the following steps:
- Import required functions and packages; at minimum, we need:
load_from_checkpoint
frommonad.ui.module
- to instantiate the testing module.TestingParams
from monad.ui.config - to configure your evaluation.
Please note that in the example below we import a few other, in order to use temporal split for testing set, and to specify the model output.
- Instantiate the testing module by calling
load_from_checkpoint
and providingcheckpoint_path
(the location of your scenario model's checkpoints). Additionally, as we want to use a specific period for our test, we passstart_date
andend_date
asTEST
insplit
. - Configure
TestingParams
- here we want to specify the output asDECODED
, so we get interpretable likelihood values, and point BaseModel to the desired location to save our results (local_save_location
). - Run evaluation by calling the
test()
method of the testing module and providingtesting_params
as its argument.
from monad.config import DataTimeSplit, TimeRange
from monad.ui.config import OutputType, TestingParams, DataMode
from monad.ui.module import load_from_checkpoint
from datetime import datetime
# declare variables
checkpoint_path = "/path/to/downstream/model/checkpoints" # location of scenario model checkpoints
save_path = "/path/to/test/predictions_and_ground_truth.tsv" # location to store evaluation results
test_start_date = datetime(2020, 9, 1) # first day of test period
test_end_date = datetime(2020, 9, 21) # last day of test period
# load scenario model to instantiate testing module
testing_module = load_from_checkpoint(
checkpoint_path = checkpoint_path,
split={DataMode.TEST: TimeRange(start_date=test_start_date, end_date=test_end_date)}
)
# define testing parameters
testing_params = TestingParams(
local_save_location = save_path,
output_type = OutputType.DECODED,
)
# run evaluation
testing_module.test(testing_params = testing_params)
Save this script and execute it to test your scenario model.
Generate predictions
Time for the final step of this tutorial: running predictions. As before, we need to prepare and run a Python script with the following elements:
- Import required functions and packages; at minimum, we need:
load_from_checkpoint
frommonad.ui.module
- to instantiate the testing module.TestingParams
from monad.ui.config - to configure your inference.
- Instantiate the testing module by calling
load_from_checkpoint
and providingcheckpoint_path
(the location of your scenario model's checkpoints). - Configure
TestingParams
- here we want to specify the output asDECODED
, so we get interpretable likelihood values, and point BaseModel to the desired location to save our results (local_save_location
). We also pass theprediction_date
- ie. the date we want to obtain our prediction for. - Run inference by calling the
predict()
method of the testing module and providingtesting_params
as its argument.
from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint
from datetime import datetime
# declare variables
save_path = "/path/to/predictions/predictions.tsv" # location to store predictions
checkpoint_path = "/path/to/downstream/model/checkpoints" # location of scenario model checkpoints
prediction_date = datetime(2020, 9, 22) # first day of prediction period
# load scenario model to instantiate testing module
testing_module = load_from_checkpoint(
checkpoint_path = checkpoint_path,
)
# define testing parameters
testing_params = TestingParams(
local_save_location = save_path,
output_type = OutputType.DECODED,
prediction_date=prediction_date
)
# run inference
testing_module.predict(testing_params = testing_params)
This concludes our tutorial using H&M Fashion Kaggle dataset.
Updated 6 days ago