Skip to content

Tutorial on Kaggle's H&M dataset

End-to-end walkthrough: from raw data to predictions using the H&M Personalized Fashion Recommendations Kaggle dataset.

Before you start

  • Complete the environment setup (Setup)
  • Confirm the container is running with /basemodel/ and /data/ mounted

1. Get the H&M dataset

  1. Sign up for the H&M competition on Kaggle
  2. Download only these three CSV files (skip the images/ folder to save space):
    • customers.csv
    • transactions_train.csv
    • articles.csv
  3. Place them in your data mount and convert to Parquet.

    Save this script as convert_to_parquet.py:

    # convert_to_parquet.py
    import pandas as pd
    
    for name in ["customers", "transactions_train", "articles"]:
        df = pd.read_csv(f"/data/hm/{name}.csv")
        df.to_parquet(f"/data/hm/{name}.parquet", index=False)
        print(f"✓ {name}: {len(df):,} rows")
    

    Run it:

    python convert_to_parquet.py
    

You should now have three .parquet files in /data/hm/.

2. Configure the foundation model

Create the configuration in two parts: data sources and training parameters.

Data sources

Create /basemodel/configs/hm_fm_config.yaml:

yaml
data_sources:

  - type: main_entity_attribute
    main_entity_column: customer_id
    name: customers
    data_location:
      database_type: parquet
      connection_params:
        path: "/data/hm/customers.parquet"
        cache_path: "/db_cache/"
      table_name: customers

  - type: attribute
    name: articles
    data_location:
      database_type: parquet
      connection_params:
        path: "/data/hm/articles.parquet"
        cache_path: "/db_cache/"
      table_name: articles

  - type: event
    main_entity_column: customer_id
    name: transactions
    date_column:
      name: t_dat
      format: '%Y-%m-%d'
    joined_data_sources:
      - name: articles
        join_on:
          - [article_id, article_id]
    data_location:
      database_type: parquet
      connection_params:
        path: "/data/hm/transactions_train.parquet"
        cache_path: "/db_cache/"
      table_name: transactions

What's happening: You create three data sources: Customers (main entity attribute table), Transactions (event data source, timestamped by t_dat), and Articles (attribute table joined to transactions via article_id).

Training parameters

Append to the same file:

yaml
data_params:
  data_start_date: 2018-09-20 00:00:00
  split:
    type: time
    training:
      start_date: 2018-09-20 00:00:00
    validation:
      start_date: 2020-07-01 00:00:00
    test:
      start_date: 2020-09-01 00:00:00

data_loader_params:
  batch_size: 256
  num_workers: 5

training_params:
  learning_rate: 0.0003
  epochs: 1

What's happening: Data is split chronologically into training set (runs from Sept 2018 until validation set start), validation (from July 2020 to test set start), and test set (from Sept 2020 to the last available timestamp).

3. Train the foundation model

Create /basemodel/configs/hm_pretrain.py:

Python
from monad.ui import pretrain
from pathlib import Path

pretrain(
    config_path=Path("/basemodel/configs/hm_fm_config.yaml"),
    output_path=Path("/basemodel/project/fm/"),
)
bash
python /basemodel/configs/hm_pretrain.py

Checkpoints will be saved to /basemodel/project/fm/. You can also train via the CLI — see Training Execution.

4. Train a scenario model

We are building a binary churn classifier: will the customer make any purchase in the next 21 days?

Create /basemodel/configs/hm_scenario_train.py:

Python
from datetime import timedelta
from pathlib import Path
from typing import Dict

import numpy as np

from monad.batch import SPLIT_TIMESTAMP
from monad.ui.config import TrainingParams
from monad.ui.module import BinaryClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window


# --- Names & Paths -----------------------------------------------------------

project_dir = Path("/basemodel/project").resolve()
scenario_name = "hm_churn_21d"

foundation_model_path = project_dir / "fm"
scenario_model_path = project_dir / "scenarios" / scenario_name


# --- Target Function ---------------------------------------------------------

# churn definition:
# 1: churned (no events in the future window)
# 0: not churned (at least one event in the future window)

TARGET_EVENT_TABLE = "transactions"
TARGET_WINDOW_DAYS = 21


def target_fn(history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray | None:

    # filters out users with too short remaining window
    if has_incomplete_training_window(_ctx, timedelta(days=TARGET_WINDOW_DAYS)):
        return None

    # filters out users with no history
    if history[TARGET_EVENT_TABLE].count() == 0:
        return None

    # trims the future to desired window
    future_window = future.interval_from(
        _ctx[SPLIT_TIMESTAMP],
        timedelta(days=TARGET_WINDOW_DAYS),
    )

    # churn label
    y = 0 if future_window[TARGET_EVENT_TABLE].count() > 0 else 1
    return np.array([y], dtype=np.float32)


# --- Training ----------------------------------------------------------------

task = BinaryClassificationTask()

training_params = TrainingParams(
    checkpoint_dir=scenario_model_path,
    learning_rate=0.0001,
    epochs=3,
    devices=[0],
    limit_train_batches=5,                # smoke test — remove for full run
    limit_val_batches=5,                  # smoke test — remove for full run
)

trainer = load_from_foundation_model(
    checkpoint_path=foundation_model_path,
    downstream_task=task,
    target_fn=target_fn,
)

trainer.fit(training_params=training_params, overwrite=True)
bash
python /basemodel/configs/hm_scenario_train.py

What's happening: You define a target function that labels each customer as churned or not based on a 21-day window, then fine-tune the foundation model into a binary classifier. Target functions and scenario parameters are covered in depth in the Scenarios guide.

5. Evaluate the model

Create /basemodel/configs/hm_scenario_test.py:

Python
from datetime import datetime
from pathlib import Path

from monad.ui.config import MetricParams, OutputType, TestingParams
from monad.ui.module import load_from_checkpoint


# --- Names & Paths -----------------------------------------------------------

scenario_checkpoint_dir = Path("/basemodel/project/scenarios/hm_churn_21d").resolve()
predictions_path = scenario_checkpoint_dir / "test" / "predictions_and_ground_truth.tsv"


# --- Test Set Definition -----------------------------------------------------

prediction_date = datetime(2020, 9, 5)
limit_test_batches = 100                  # smoke test — remove for full run


# --- Metrics -----------------------------------------------------------------

metrics = [
    MetricParams(alias="auroc", metric_name="AUROC", kwargs={"task": "binary"}),
    MetricParams(alias="recall", metric_name="Recall", kwargs={"task": "binary"}),
    MetricParams(alias="precision", metric_name="Precision", kwargs={"task": "binary"}),
]


# --- Test --------------------------------------------------------------------

testing_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)

testing_params = TestingParams(
    output_type=OutputType.DECODED,
    local_save_location=predictions_path,
    prediction_date=prediction_date,
    metrics=metrics,
    limit_test_batches=limit_test_batches,  # smoke test — remove for full run
)

testing_module.test(testing_params=testing_params)
bash
python /basemodel/configs/hm_scenario_test.py

What's happening: You load the trained scenario checkpoint, run it against a held-out test set at a specific prediction date, and compute metrics (AUROC, recall, precision). Results and metrics are saved in TSV format. See Scenarios for the full evaluation workflow.

6. Generate predictions

Create /basemodel/configs/hm_scenario_predict.py:

Python
from datetime import datetime
from pathlib import Path

from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint


# --- Names & Paths -----------------------------------------------------------

scenario_checkpoint_dir = Path("/basemodel/project/scenarios/hm_churn_21d").resolve()
predictions_path = scenario_checkpoint_dir / "predictions" / "predictions.tsv"


# --- Prediction Set Definition -----------------------------------------------

prediction_date = datetime(2020, 9, 23)


# --- Predict -----------------------------------------------------------------

prediction_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)

testing_params = TestingParams(
    output_type=OutputType.DECODED,
    local_save_location=predictions_path,
    prediction_date=prediction_date,
)

prediction_module.predict(testing_params=testing_params)
bash
python /basemodel/configs/hm_scenario_predict.py

What's happening: You load the same scenario checkpoint and generate predictions for all entities as of a given date. Output is saved in TSV format. See Scenarios — Inference for prediction options and output formats.