Training the Model

Usage of pretrain function

⚠️

Check This First!

This article refers to BaseModel accessed via Docker container. Please refer to Snowflake Native App section if you are using BaseModel as SF GUI application.


Once you have successfully configured your data and model parameters in the YAML file, it is time to train your foundation model! With BaseModel implemented as Docker container you can do that in two ways:

  • run a pretrain Python function OR
  • run a monad.pretrain command in your terminal.

Both ways are explained in more detail below.

Start the training using Python function

The most basic syntax when using Python environment to launch the training is shown in example below:

from monad.ui import pretrain
from pathlib import Path


pretrain(
    config_path=Path("path/to/config.yaml"), 
    output_path=Path("path/to/store/pretrain/artifacts")
)

The pretrain function allows for various additional arguments to manage the training process. Below is the full list of configurations accepted at this stage.

⚠️

Note

Some of these arguments directly correspond to YAML parameters. If specified here, they will override the entries in the configuration file.

Parameters
  • config_path : str
    Required. No default.
    The path to YAML configuration file.

  • output_path : str
    Required. No default.
    The path to a folder you intend to store the results.

  • storage_config_path : str
    Optional. Default: None
    The option to configure the file system.

  • resume : boolean
    Optional. Default: False
    Whether to resume the training. If True, training will be resumed from the last checkpoint if such exists, an error will be thrown otherwise.

  • overwrite : boolean
    Optional. Default: False
    Whether to overwrite the previous training results. If True, results will be overwritten. Otherwise, if resume is not set and checkpoints from previous training are present, error will be raised.

  • callbacks : list[Callback]
    Optional. Default: Lightning factory default
    List of additional Pytorch Lightning callbacks to add to training.

  • pl_logger : str
    Optional. Default: None
    PyTorch Lightning logger to use.

  • uniqueness_threshold : float
    Required. Default: 0.9
    DO NOT USE - TEST OPTION ONLY. Maximum uniqueness ratio to hash a column.

  • nan_threshold : float
    Required. Default: 0.9
    Maximum fraction of missing values allowed in a column to process.


Initiate training in a command line

An alternative way is to use your terminal; in this case use the syntax below:

python -m monad.pretrain 
--config "path/to/config.yml" 
--features-path "path/to/store/pretrain/artifacts"
--overwrite

As before, you can add arguments to manage the training process, but the options are narrower:

Parameters
  • --config
    required
    Requires "str". The path to YAML configuration file. Equivalent to config_path in Python.

  • --features-path
    required
    Requires "str". The path to a folder you intend to store the results. Equivalent to output_path in Python.

  • --storage-config
    optional
    Requires "str". The option to configure the file system.

  • --resume
    optional
    If provided, training will be resumed from the last checkpoint if such exists, an error will be thrown otherwise.

  • --overwrite
    optional
    If provided, previous training results will be overwritten. Otherwise, if resume is not set and checkpoints from previous training are present, error will be raised.

⚠️

Note

Some of these arguments directly correspond to YAML parameters (some names are slightly different).
If specified here, they will override the entries in the configuration file.

End of Foundation Model training

The model will have finished training once console output states that the model checkpoints have been saved, and _FINISHED folder with best model is created under the output_path.