End-to-End Tutorial: H&M Kaggle Example

This guide will walk you through the process of creating your first foundation model and scenario model using the HM Personalized Fashion Recommendations dataset. We'll cover data preparation, model training, testing, and interpreting predictions.

Prerequisites

Before we begin, ensure you have:

Access to the H&M dataset, either:
- Stored locally as Parquet files.
- Uploaded to a Snowflake account.
BaseModel installed and configured.

We assume you have basic understanding of YAML and Python.

Preparing the Data"

The H&M dataset consists of three main tables:

Customers: Customer metadata.
Transactions: Purchase history.
Articles: Product information.

Ensure these tables are available in your chosen data storage:

Local Parquet Files: Convert the CSV files to Parquet format and store them locally.
Snowflake: Upload the CSV files to your Snowflake account.

Preparing the Configuration File

Create a configuration file to define your data sources, loading, optimisation and training parameters.

You can do that by creating a new text file and saving it with a .yaml extension (e.g., config.yaml).
Review the sections below to learn how to:

define your data sources,
configure data-related, optimization, and training parameters.

Defining Data Sources

In this YAML configuration, we define three data sources for BaseModel:

Customers Table: Designated as main_entity_attribute, this table contains metadata about the primary entities (customers) and is automatically joined with event data based on the customer_id.
Articles Table: Defined as an attribute type, this table holds information about products.
Transactions Table: Marked as an event type, this table records customer transactions over time. It includes a date_column (t_dat) to timestamp events. In the joined_data_sources section it specifies the join to articles attribute table using the article_id to enrich event data with product attributes.

This setup enables BaseModel to process customer behaviors (transactions) enriched with product information, facilitating the training of a foundation model that captures the interactions between customers and products.

data_sources:

  - type: main_entity_attribute
    main_entity_column: customer_id
    name: customers
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/hm_data/customers.parquet"
        cache_path: "/db_cache/"
      table_name: customers

  - type: attribute
    name: articles
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/hm_data/articles.parquet"
        cache_path: "/db_cache/"
      table_name: articles
      
  - type: event
    main_entity_column: customer_id
    name: transactions
    date_column:
      name: t_dat
      format: '%Y-%m-%d'
    joined_data_sources:
      - name: articles
        join_on: 
          - [article_id, article_id]
    data_location:
      database_type: parquet
      connection_params:
        path: "/path/to/hm_data/transactions_train.parquet"
        cache_path: "/db_cache/"
      table_name: transactions

Configuring Parameters

The data sources in YAML file should be followed by configuration of data sets, data loading, training and other parameters. In our example:

In data_params we set:
- the initial date to be considered for training (data_start_date) for 2018-09-20 00:00:00,
- the boundaries for training, validation and test sets to be temporal (split→type→time),
- and separate these datasets with start_date parameter in respective blocks.
In loader_params we set:
- batch_size to 256, to process the training in batches of 256 samples (main entity IDs),
- num_workers to 5, to parallelise data loading in five subprocesses.
In training_params we configure:
- learning_rate to 0.0003, a good starting point for BaseModel,
- epochs, for the training to go through the entire dataset only once.


data_params:
  data_start_date: 2018-09-20 00:00:00
  split:
    type: time
    training:
      start_date: 2018-09-20 00:00:00
    validation:
      start_date: 2020-07-01 00:00:00
    test:
      start_date: 2020-09-01 00:00:00

data_loader_params:
  batch_size: 256
  num_workers: 5

training_params:
  learning_rate: 0.0003
  epochs: 1

Train the Foundation Model

With the YAML configuration saved, you can now train the foundation model—either via a terminal command or a Python script calling the appropriate function.

Command Line

Run the following command to train the foundation model:

python -m monad.run \
--pretrain \
--config-path "path/to/config.yml" \ 
--output-path "path/to/store/pretrain/artifacts" \
--overwrite

Python Script

Run the following script to train the foundation model:

from monad.ui import pretrain
from pathlib import Path


pretrain(
    config_path=Path("path/to/config.yaml"), 
    output_path=Path("path/to/store/pretrain/artifacts")
)

Train the Scenario Model

With BaseModel, you can fine-tune your foundation model to a broad range of scenarios and supporting ML problems. In this tutorial we will demonstrate a buying propensity use case which is a multi-label classification problem.

Import required libraries

First, we need to import a few required libraries:

We want to load our foundation model and build a multi-label classifier on top of it, so we need load_from_foundation_model and MultilabelClassificationTask from monad.ui.module,
From monad.ui.target_function we need a few other imports required to define the function,
TrainingParams from monad.ui.config allows us to configure the training,
numpy library and Dict from typing close our list for this script.

from typing import Dict
import numpy as np
from monad.ui.config import TrainingParams
from monad.ui.module import MultilabelClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window, get_qualified_column_name, next_n_days, SPLIT_TIMESTAMP

Define the target function icon="fa-crosshairs"

Now we will define the objective for our model:

We start with defining TARGET_NAMES as a list of product categories (e.g., "Denim Trousers", "Swimwear") that the model should predict.
We also need to point BaseModel to the right column which stores the category names as TARGET_ENTITY; in our case it exists in the joined attribute table which is why we need a helper function get_qualified_column_name.
We then implement Target Function (target_fn) by:
- Checking for sufficient historical data using has_incomplete_training_window, otherwise returning None for that main entity (customer).
- Limiting future events in scope to a 21-day campaign window using next_n_days.
- Creating purchase_target: a binary vector indicating which categories were purchased by each customer; this is done by grouping future transactions by TARGET_ENTITY and flagging the presence of each target category with exists.
- Our function needs to return that vector.

TARGET_NAMES = [    
    "Denim Trousers", "Swimwear", "Trousers", "Jersey Basic", "Ladies Sport Bottoms",
    "Basic 1", "Jersey fancy", "Blouse", "Shorts", "Trouser", "Ladies Sport Bras",
    "Casual Lingerie", "Expressive Lingerie", "Dress", "Dresses", "Tops Knitwear",
    "Skirt", "Nightwear", "Knitwear",
]
TARGET_ENTITY = get_qualified_column_name("department_name", ["articles"])

def target_fn(_history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray:
    target_window_days = 21
    if has_incomplete_training_window(_ctx, target_window_days):
        return None
    future = next_n_days(future, _ctx[SPLIT_TIMESTAMP], target_window_days)
    purchase_target, _ = future["transactions"].groupBy(TARGET_ENTITY).exists(groups=TARGET_NAMES)
    return purchase_target

For more examples, see the Target Function Examples page.

Configure and fit the model

In the final section of our scenario training script we:

Initialize the task (MultilabelClassificationTask) providing it with names of target categories as class_names.
Set Paths to the pre-trained foundation model (fm_path), and the directory to save the scenario model checkpoints (checkpoint_dir).
Configure TrainingParams - you can define your desired location to save the model with checkpoint_dir, modify the default learning_rate, epochs, set the GPU with devices etc.
Train the Model:
- Load the foundation model using load_from_foundation_model, providing the checkpoint_path, downstream_task, and target_fn.
- Call fit on the trainer with the specified training_params to commence training.

task = MultilabelClassificationTask(class_names=TARGET_NAMES)

fm_path = "/path/to/your/model/fm"
checkpoint_dir = "/path/to/store/your/downstream/scenario"

training_params = TrainingParams(
    learning_rate=0.0001,
    checkpoint_dir=checkpoint_dir,
    epochs=1,
    devices=[1],
)

if __name__ == "__main__":
    trainer = load_from_foundation_model(
        checkpoint_path=fm_path,
        downstream_task=task,
        target_fn=target_fn,
    )
    trainer.fit(training_params=training_params)

Save this script and run it to train the scenario model.

Evaluate the model

Once you have trained a downstream model, you will likely want to test its performance. To do this, you should prepare and execute a Python script with the following steps:

Import required functions and packages; at minimum, we need:
- load_from_checkpoint from monad.ui.module - to instantiate the testing module.
- TestingParams from monad.ui.config - to configure your evaluation.
  
  Please note that in the example below we import a few other, in order to use temporal split for testing set, and to specify the model output.
Instantiate the testing module by calling load_from_checkpoint and providing checkpoint_path (the location of your scenario model's checkpoints). Additionally, as we want to use a specific period for our test, we pass start_date and end_date as TEST in split.
Configure TestingParams - here we want to specify the output as DECODED, so we get interpretable likelihood values, and point BaseModel to the desired location to save our results (local_save_location).
Run evaluation by calling the test() method of the testing module and providing testing_params as its argument.

from monad.config import DataTimeSplit, TimeRange
from monad.ui.config import OutputType, TestingParams, DataMode
from monad.ui.module import load_from_checkpoint
from datetime import datetime

# declare variables
checkpoint_path = "/path/to/downstream/model/checkpoints" # location of scenario model checkpoints
save_path = "/path/to/test/predictions_and_ground_truth.tsv" # location to store evaluation results
test_start_date = datetime(2020, 9, 1) # first day of test period
test_end_date = datetime(2020, 9, 21) # last day of test period

# load scenario model to instantiate testing module
testing_module = load_from_checkpoint(
  checkpoint_path = checkpoint_path,
  split={DataMode.TEST: TimeRange(start_date=test_start_date, end_date=test_end_date)}
)

# define testing parameters
testing_params = TestingParams(
    local_save_location = save_path,
    output_type = OutputType.DECODED,
)

# run evaluation
testing_module.test(testing_params = testing_params)

Save this script and execute it to test your scenario model.

Generate predictions

Time for the final step of this tutorial: running predictions. As before, we need to prepare and run a Python script with the following elements:

Import required functions and packages; at minimum, we need:
- load_from_checkpoint from monad.ui.module - to instantiate the testing module.
- TestingParams from monad.ui.config - to configure your inference.
Instantiate the testing module by calling load_from_checkpoint and providing checkpoint_path (the location of your scenario model's checkpoints).
Configure TestingParams - here we want to specify the output as DECODED, so we get interpretable likelihood values, and point BaseModel to the desired location to save our results (local_save_location). We also pass the prediction_date - ie. the date we want to obtain our prediction for.
Run inference by calling the predict() method of the testing module and providing testing_params as its argument.

from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint
from datetime import datetime

# declare variables
save_path = "/path/to/predictions/predictions.tsv" # location to store predictions
checkpoint_path = "/path/to/downstream/model/checkpoints" # location of scenario model checkpoints
prediction_date = datetime(2020, 9, 22) # first day of prediction period

# load scenario model to instantiate testing module
testing_module = load_from_checkpoint(
  checkpoint_path = checkpoint_path,
)

# define testing parameters
testing_params = TestingParams(
  local_save_location = save_path,
  output_type = OutputType.DECODED,
  prediction_date=prediction_date
)

# run inference
testing_module.predict(testing_params = testing_params)

This concludes our tutorial using H&M Fashion Kaggle dataset.