Example: H&M Fashion Dataset at Kaggle

In this tutorial, we will guide you though doing propensity predictions on one of the Kaggle datasets - namely this one : HM Personalized Fashion Recommendations

Step 1 - Foundation Model and Data Sources configuration

First step is to correctly configure datasource connection and the foundation model. In this example, we will be using Snowflake data warehouse, but it is possible to use any other supported connection type found in documentation.

datasources:
  - type: main_entity_attribute
    main_entity_column: customer_id
    name: customers
    data_location:
      source: snowflake
      connection_params:
        user: ${SNOWFLAKE_USER}
        password: ${SNOWFLAKE_PASSWORD}
        account: ${SNOWFLAKE_ACCOUNT}
        warehouse: ${SNOWFLAKE_WAREHOUSE}
        role: ${SNOWFLAKE_ROLE}
        database: HM_KAGGLE
        db_schema: PRIVATE
      table_name: customers
  - type: event
    main_entity_column: customer_id
    name: transactions
    date_column: t_dat
    text_columns:
      - prod_name
      - detail_desc
    data_location:
      source: snowflake
      connection_params:
        user: ${SNOWFLAKE_USER}
        password: ${SNOWFLAKE_PASSWORD}
        account: ${SNOWFLAKE_ACCOUNT}
        warehouse: ${SNOWFLAKE_WAREHOUSE}
        role: ${SNOWFLAKE_ROLE}
        database: HM_KAGGLE
        db_schema: PRIVATE
      table_name: transactions
      

data_params:
  data_start_date: 2018-09-20 00:00:00
  validation_start_date: 2020-09-01 00:00:00
  check_target_for_next_N_days: 21

loader_params:
  batch_size: 256
  num_workers: 10

training_params:
  learning_rate: 0.0001
  epochs: 3

hidden_dim: 2048

In this case, we have already joined data from ARTICLES table into the TRANSACTIONS table, so we only provide 2 data sources: one of event type and one of attribute type.

For the purpose of this tutorial, we leave the rest of parameters unchanged. More details about them can be found in documentation.

Naturally, under connection_paramsusers need to provide their credentials for example to snowflake or any other supported database. We can move on to the next stage:

Data Preprocessing and Foundation Model Training

At this point, the only thing we need to to is to run the BaseModel on the previously prepared configuration file.

Let's run the following CLI command:

python -m pretrain --config <path/to/config.yml> --features-path <path/to/store/pretrain/artifacts>

At this stage user can monitor what is happening by reviewing the logs outputted to console.

After a while, data preprocessing will be done and the FoundationModel will be ready. All necessary files will be stored in the artifacts folder defined in the --features-path above.

  • fm - foundation model folder with model checkpoint
  • features - folder with features transformations/embeddings

This is it! Foundation Model is ready. We are ready to move on to the next stage.

Step 2 - Downstream Model training

At this stage we need to prepare configuration for the Downstream task training phase. This is done by writing your own python script or by modifying existing templates.

In this case, let's say we want to calculate propensity to buy specifc categories per user.

Let's have a look at training script used in this tutorial:

from typing import Dict

import numpy as np

from monad.ui.config import MonadTrainingParams
from monad.ui.module import MultilabelClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events

TARGET_NAMES = [
    "Denim Trousers",
    "Swimwear",
    "Trousers",
    "Jersey Basic",
    "Ladies Sport Bottoms",
    "Basic 1",
    "Jersey fancy",
    "Blouse",
    "Shorts",
    "Trouser",
    "Ladies Sport Bras",
    "Casual Lingerie",
    "Expressive Lingerie",
    "Dress",
    "Dresses",
    "Tops Knitwear",
    "Skirt",
    "Nightwear",
    "Knitwear",
]
TARGET_ENTITY = "department_name"


def target_fn(_history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray:

    purchase_target, _ = future["transactions"].groupBy(TARGET_ENTITY).exists(groups=TARGET_NAMES)
    return purchase_target


task = MultilabelClassificationTask()
fm_path = "/data1/monad/inference/new_features_hm/fm"
num_outputs = len(TARGET_NAMES)
training_params = MonadTrainingParams(
    learning_rate=5e-5,
    checkpoint_dir="checkpoints/hm-propensity",
    epochs=1,
    devices=[1],
)

if __name__ == "__main__":
    trainer = load_from_foundation_model(
        checkpoint_path=fm_path, downstream_task=task, target_fn=target_fn, num_outputs=num_outputs
    )
    trainer.fit(training_params=training_params)

TARGET_ENTITY - in this case it is basically target column that has the interesting categories, that we define in TARGET_NAMES.

Next, we define the target_fn in a way, that describes users' purchase behaviour - making future transactions of specyfic brands in this case.

task = MultiLabelClassificationTask() defines the downstream learning task that will be used in training.

The rest of the script defines paths to foundation model, output path of model checkpoints and training parameters.

Once this file is properly prepared, we want to run it via python train.py and after it is done, it will create a new folder called

Step 3 - Predictions

Now the final step - run predictions. You will run it the same way as before, by preparing and running a python script. In our case it it will look like this:

from monad.ui.module import load_from_checkpoint
from monad.ui.config import MonadTestingParams

if __name__ == "__main__":
    checkpoint_dir = "checkpoint"
    testing_module = load_from_checkpoint(checkpoint_dir)
    testing_params = MonadTestingParams(
        save_path="checkpoint/preds.csv",
        limit_test_batches=100,
    )
    testing_module.predict(testing_params=testing_params)