Example: H&M Fashion Dataset at Kaggle
In this tutorial, we will guide you though doing propensity predictions on one of the Kaggle datasets - namely this one : HM Personalized Fashion Recommendations
Step 1 - Foundation Model and Data Sources configuration
First step is to correctly configure datasource connection and the foundation model. In this example, we will be using Snowflake data warehouse, but it is possible to use any other supported connection type found in documentation.
datasources:
- type: main_entity_attribute
main_entity_column: customer_id
name: customers
data_location:
source: snowflake
connection_params:
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
account: ${SNOWFLAKE_ACCOUNT}
warehouse: ${SNOWFLAKE_WAREHOUSE}
role: ${SNOWFLAKE_ROLE}
database: HM_KAGGLE
db_schema: PRIVATE
table_name: customers
- type: event
main_entity_column: customer_id
name: transactions
date_column: t_dat
text_columns:
- prod_name
- detail_desc
data_location:
source: snowflake
connection_params:
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
account: ${SNOWFLAKE_ACCOUNT}
warehouse: ${SNOWFLAKE_WAREHOUSE}
role: ${SNOWFLAKE_ROLE}
database: HM_KAGGLE
db_schema: PRIVATE
table_name: transactions
data_params:
data_start_date: 2018-09-20 00:00:00
validation_start_date: 2020-09-01 00:00:00
check_target_for_next_N_days: 21
loader_params:
batch_size: 256
num_workers: 10
training_params:
learning_rate: 0.0001
epochs: 3
hidden_dim: 2048
In this case, we have already joined data from ARTICLES table into the TRANSACTIONS table, so we only provide 2 data sources: one of event
type and one of attribute
type.
For the purpose of this tutorial, we leave the rest of parameters unchanged. More details about them can be found in documentation.
Naturally, under connection_params
users need to provide their credentials for example to snowflake or any other supported database. We can move on to the next stage:
Data Preprocessing and Foundation Model Training
At this point, the only thing we need to to is to run the BaseModel on the previously prepared configuration file.
Let's run the following CLI command:
python -m pretrain --config <path/to/config.yml> --features-path <path/to/store/pretrain/artifacts>
At this stage user can monitor what is happening by reviewing the logs outputted to console.
After a while, data preprocessing will be done and the FoundationModel will be ready. All necessary files will be stored in the artifacts
folder defined in the --features-path
above.
- fm - foundation model folder with model checkpoint
- features - folder with features transformations/embeddings
This is it! Foundation Model is ready. We are ready to move on to the next stage.
Step 2 - Downstream Model training
At this stage we need to prepare configuration for the Downstream task training phase. This is done by writing your own python script or by modifying existing templates.
In this case, let's say we want to calculate propensity to buy specifc categories per user.
Let's have a look at training script used in this tutorial:
from typing import Dict
import numpy as np
from monad.ui.config import MonadTrainingParams
from monad.ui.module import MultilabelClassificationTask, load_from_foundation_model
from monad.ui.target_function import Attributes, Events
TARGET_NAMES = [
"Denim Trousers",
"Swimwear",
"Trousers",
"Jersey Basic",
"Ladies Sport Bottoms",
"Basic 1",
"Jersey fancy",
"Blouse",
"Shorts",
"Trouser",
"Ladies Sport Bras",
"Casual Lingerie",
"Expressive Lingerie",
"Dress",
"Dresses",
"Tops Knitwear",
"Skirt",
"Nightwear",
"Knitwear",
]
TARGET_ENTITY = "department_name"
def target_fn(_history: Events, future: Events, _entity: Attributes, _ctx: Dict) -> np.ndarray:
purchase_target, _ = future["transactions"].groupBy(TARGET_ENTITY).exists(groups=TARGET_NAMES)
return purchase_target
task = MultilabelClassificationTask()
fm_path = "/data1/monad/inference/new_features_hm/fm"
num_outputs = len(TARGET_NAMES)
training_params = MonadTrainingParams(
learning_rate=5e-5,
checkpoint_dir="checkpoints/hm-propensity",
epochs=1,
devices=[1],
)
if __name__ == "__main__":
trainer = load_from_foundation_model(
checkpoint_path=fm_path, downstream_task=task, target_fn=target_fn, num_outputs=num_outputs
)
trainer.fit(training_params=training_params)
TARGET_ENTITY
- in this case it is basically target column that has the interesting categories, that we define in TARGET_NAMES
.
Next, we define the target_fn
in a way, that describes users' purchase behaviour - making future transactions of specyfic brands in this case.
task = MultiLabelClassificationTask()
defines the downstream learning task that will be used in training.
The rest of the script defines paths to foundation model, output path of model checkpoints and training parameters.
Once this file is properly prepared, we want to run it via python train.py
and after it is done, it will create a new folder called
Step 3 - Predictions
Now the final step - run predictions. You will run it the same way as before, by preparing and running a python script. In our case it it will look like this:
from monad.ui.module import load_from_checkpoint
from monad.ui.config import MonadTestingParams
if __name__ == "__main__":
checkpoint_dir = "checkpoint"
testing_module = load_from_checkpoint(checkpoint_dir)
testing_params = MonadTestingParams(
save_path="checkpoint/preds.csv",
limit_test_batches=100,
)
testing_module.predict(testing_params=testing_params)
Updated 2 months ago