Tutorial on Kaggle's H&M dataset
End-to-end walkthrough: from raw data to predictions using the H&M Personalized Fashion Recommendations Kaggle dataset.
Before you start
- Complete the environment setup (Setup)
- Confirm the container is running with
/basemodel/and/data/mounted
1. Get the H&M dataset
- Sign up for the H&M competition on Kaggle
- Download only these three CSV files (skip the
images/folder to save space):customers.csvtransactions_train.csvarticles.csv
-
Place them in your data mount and convert to Parquet.
Save this script as
convert_to_parquet.py:# convert_to_parquet.py import pandas as pd for name in ["customers", "transactions_train", "articles"]: df = pd.read_csv(f"/data/hm/{name}.csv") df.to_parquet(f"/data/hm/{name}.parquet", index=False) print(f"✓ {name}: {len(df):,} rows")Run it:
You should now have three .parquet files in /data/hm/.
2. Configure the foundation model
Create the configuration in two parts: data sources and training parameters.
Data sources
Create /basemodel/configs/hm_fm_config.yaml:
data_sources:
- type: main_entity_attribute
main_entity_column: customer_id
name: customers
data_location:
database_type: parquet
connection_params:
path: "/data/hm/customers.parquet"
cache_path: "/db_cache/"
table_name: customers
- type: attribute
name: articles
data_location:
database_type: parquet
connection_params:
path: "/data/hm/articles.parquet"
cache_path: "/db_cache/"
table_name: articles
- type: event
main_entity_column: customer_id
name: transactions
date_column:
name: t_dat
format: '%Y-%m-%d'
joined_data_sources:
- name: articles
join_on:
- [article_id, article_id]
data_location:
database_type: parquet
connection_params:
path: "/data/hm/transactions_train.parquet"
cache_path: "/db_cache/"
table_name: transactions
What's happening: You create three data sources: Customers (main entity attribute table), Transactions (event data source, timestamped by t_dat), and Articles (attribute table joined to transactions via article_id).
Training parameters
Append to the same file:
data_params:
data_start_date: 2018-09-20 00:00:00
split:
type: time
training:
start_date: 2018-09-20 00:00:00
validation:
start_date: 2020-07-01 00:00:00
test:
start_date: 2020-09-01 00:00:00
data_loader_params:
batch_size: 256
num_workers: 5
training_params:
learning_rate: 0.0003
epochs: 1
What's happening: Data is split chronologically into training set (runs from Sept 2018 until validation set start), validation (from July 2020 to test set start), and test set (from Sept 2020 to the last available timestamp).
3. Train the foundation model
Create /basemodel/configs/hm_pretrain.py:
from monad.ui import pretrain
from pathlib import Path
pretrain(
config_path=Path("/basemodel/configs/hm_fm_config.yaml"),
output_path=Path("/basemodel/project/fm/"),
)
Checkpoints will be saved to /basemodel/project/fm/. You can also train via the CLI — see Training Execution.
4. Train a scenario model
We are building a binary churn classifier: will the customer make any purchase in the next 21 days?
Create /basemodel/configs/hm_scenario_train.py:
from datetime import timedelta
from pathlib import Path
from typing import Dict
import numpy as np
from monad.batch import SPLIT_TIMESTAMP
from monad.ui.config import TrainingParams
from monad.ui.module import BinaryClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events, has_incomplete_training_window
# --- Names & Paths -----------------------------------------------------------
project_dir = Path("/basemodel/project").resolve()
scenario_name = "hm_churn_21d"
foundation_model_path = project_dir / "fm"
scenario_model_path = project_dir / "scenarios" / scenario_name
# --- Target Function ---------------------------------------------------------
# churn definition:
# 1: churned (no events in the future window)
# 0: not churned (at least one event in the future window)
TARGET_EVENT_TABLE = "transactions"
TARGET_WINDOW_DAYS = 21
def target_fn(history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray | None:
# filters out users with too short remaining window
if has_incomplete_training_window(_ctx, timedelta(days=TARGET_WINDOW_DAYS)):
return None
# filters out users with no history
if history[TARGET_EVENT_TABLE].count() == 0:
return None
# trims the future to desired window
future_window = future.interval_from(
_ctx[SPLIT_TIMESTAMP],
timedelta(days=TARGET_WINDOW_DAYS),
)
# churn label
y = 0 if future_window[TARGET_EVENT_TABLE].count() > 0 else 1
return np.array([y], dtype=np.float32)
# --- Training ----------------------------------------------------------------
task = BinaryClassificationTask()
training_params = TrainingParams(
checkpoint_dir=scenario_model_path,
learning_rate=0.0001,
epochs=3,
devices=[0],
limit_train_batches=5, # smoke test — remove for full run
limit_val_batches=5, # smoke test — remove for full run
)
trainer = load_from_foundation_model(
checkpoint_path=foundation_model_path,
downstream_task=task,
target_fn=target_fn,
)
trainer.fit(training_params=training_params, overwrite=True)
What's happening: You define a target function that labels each customer as churned or not based on a 21-day window, then fine-tune the foundation model into a binary classifier. Target functions and scenario parameters are covered in depth in the Scenarios guide.
5. Evaluate the model
Create /basemodel/configs/hm_scenario_test.py:
from datetime import datetime
from pathlib import Path
from monad.ui.config import MetricParams, OutputType, TestingParams
from monad.ui.module import load_from_checkpoint
# --- Names & Paths -----------------------------------------------------------
scenario_checkpoint_dir = Path("/basemodel/project/scenarios/hm_churn_21d").resolve()
predictions_path = scenario_checkpoint_dir / "test" / "predictions_and_ground_truth.tsv"
# --- Test Set Definition -----------------------------------------------------
prediction_date = datetime(2020, 9, 5)
limit_test_batches = 100 # smoke test — remove for full run
# --- Metrics -----------------------------------------------------------------
metrics = [
MetricParams(alias="auroc", metric_name="AUROC", kwargs={"task": "binary"}),
MetricParams(alias="recall", metric_name="Recall", kwargs={"task": "binary"}),
MetricParams(alias="precision", metric_name="Precision", kwargs={"task": "binary"}),
]
# --- Test --------------------------------------------------------------------
testing_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)
testing_params = TestingParams(
output_type=OutputType.DECODED,
local_save_location=predictions_path,
prediction_date=prediction_date,
metrics=metrics,
limit_test_batches=limit_test_batches, # smoke test — remove for full run
)
testing_module.test(testing_params=testing_params)
What's happening: You load the trained scenario checkpoint, run it against a held-out test set at a specific prediction date, and compute metrics (AUROC, recall, precision). Results and metrics are saved in TSV format. See Scenarios for the full evaluation workflow.
6. Generate predictions
Create /basemodel/configs/hm_scenario_predict.py:
from datetime import datetime
from pathlib import Path
from monad.ui.config import OutputType, TestingParams
from monad.ui.module import load_from_checkpoint
# --- Names & Paths -----------------------------------------------------------
scenario_checkpoint_dir = Path("/basemodel/project/scenarios/hm_churn_21d").resolve()
predictions_path = scenario_checkpoint_dir / "predictions" / "predictions.tsv"
# --- Prediction Set Definition -----------------------------------------------
prediction_date = datetime(2020, 9, 23)
# --- Predict -----------------------------------------------------------------
prediction_module = load_from_checkpoint(checkpoint_path=scenario_checkpoint_dir)
testing_params = TestingParams(
output_type=OutputType.DECODED,
local_save_location=predictions_path,
prediction_date=prediction_date,
)
prediction_module.predict(testing_params=testing_params)
What's happening: You load the same scenario checkpoint and generate predictions for all entities as of a given date. Output is saved in TSV format. See Scenarios — Inference for prediction options and output formats.