Change Log
Release 1.70
Unreleased — targets monad 1.7.0
This release adds offline batch prediction from the command line and Databricks as a prediction output, gives finer control over distributed-training timeouts and query chunking, makes seeding reproducible down to modality dropout, and improves GradientSHAP attribution quality — alongside reliability fixes for checkpoint resume and progress reporting.
New Features
-
Offline batch prediction from the command line
A newpython -m monad.run --predictstage scores new data with an already-trained checkpoint, without custom code. It takes a--checkpoint-pathand a--testing-params-pathYAML (prediction_date,output_type,entity_ids, and a local and/or remote save location). It runs on a single GPU, and local output is written as TSV. See Inference and Training Execution. -
Write predictions to Databricks
remote_save_locationnow supports Databricks in addition to Snowflake. The target table is created on demand and rows are appended in batches, with the batch size tunable through theDATABRICKS_WRITE_BATCH_SIZEenvironment variable (default 1000). See Writing Predictions. -
Configurable distributed-training timeouts
Two newTrainingParamsfields help long multi-GPU runs avoid spurious timeouts:nccl_timeoutextends the timeout for NCCL collective operations, andrank_sync_timeoutadds a dedicated per-step barrier that absorbs data-loading skew between ranks independently of the gradient synchronization. Both default to off and are ignored on a single device. See Distributed Training.
Improvements
-
Separate query-chunking controls for fit and data loading
The singlenum_query_chunkssetting is replaced by two:cleora_num_query_chunksfor the fit (embedding) phase anddata_loading_num_query_chunksfor the train, validation, test and predict queries, so database memory pressure can be tuned for each phase independently. This is a breaking change — configurations that setnum_query_chunksmust be migrated to the new fields. Note that mid-epoch resume is not supported whendata_loading_num_query_chunksis greater than 1. Seequery_optimization. -
Reproducible modality dropout
Theseedparameter now also governs sketch-dropping randomness inside modalities, which was previously fixed internally. Seeded runs are now reproducible end to end. -
More meaningful GradientSHAP attributions
GradientSHAP now draws its baseline from a real background distribution sampled across the predict set, instead of an all-zero baseline. Attributions are more representative of the data — and, as a consequence, are now stochastic and seed-dependent, so they will differ from earlier releases. See GradientShapInterpreter. -
Resilient caching
Cache-write failures are now retried (up to three attempts) and then fall back to streaming data directly from the database, instead of failing the run. -
Smaller checkpoints and faster mid-epoch resume
Buffer state is serialized in a compact form, making checkpoints smaller and resuming training mid-epoch faster and lighter on memory. -
More accurate Cleora resource estimation
Per-column basket statistics are now computed in a single SQL pass, improving the accuracy of memory and worker sizing estimates.
Fixes
-
Checkpoint resume with time-series features
Fixed a failure when loading or resuming from a checkpoint for models that use time-series features. -
Progress bar respects the
entity_idsfilter
When a run is restricted withentity_ids, the progress bar's total entity count now reflects the filtered subset instead of the full entity set.
Release 1.50
May 2026
This release expands interpretability with a second attribution method and a client-ready visual report, adds composable pattern-matching utilities for target function authoring, makes the Cleora embedding dimension configurable, and brings reliability and performance improvements across Databricks, Cleora, fasttext, and DuckDB.
New Features
-
GradientSHAP attribution method
interpret()now supports a second attribution method, GradientSHAP, selected withmethod="gradient_shap". It can be faster than the default Integrated Gradients on many workloads and produces denser per-feature attributions, while Integrated Gradients remains the default for deterministic and audit-friendly workflows. See Attribution Methods for the full comparison and a switch example. -
SHAP-library-style report for client hand-off
A newsave_shap_plots=Trueflag oninterpret()renders a parallelshap/directory containing beeswarm, global bar, heatmap, top-N waterfall, and interactive force plots, plus a single staticshap_report.htmlindex that links everything together. The report is designed for client hand-off and requires the optionalshapextra (poetry install -E interpretability). See the new SHAP Report guide. -
Direct interpreter access for scripted attribution
Four new task-specific interpreter classes —GradientShapInterpreter,ClassificationGradientShapInterpreter,RecommendationGradientShapInterpreter, andRegressionGradientShapInterpreter— are exported frommonad.interpretabilityfor users who batch attribution outside the standardinterpret()pipeline or want to tune knobs liken_samples,stdevs, orseed. -
Standalone SHAP report helpers
Newattributions_to_shap_explanation()andsave_shap_report()helpers are exposed at the package root, so the SHAP-style report can be rendered from already-computed attributions without re-runninginterpret(). This is useful when integrating reports into custom batched pipelines. -
Pattern utilities for target function authoring
ThePatternAPI for matching events in target functions gains six new composable utilities —firstandlastproperties,not_followed_by,to_pattern,elapsed_time, andcount_within— reducing repetitive timestamp-handling code for negation patterns, occurrence selection, time-to-event, and frequency counting. -
Configurable Cleora embedding dimension
A newcleora_dimfield controls the output dimension of Cleora graph embeddings, letting you steer model input size and memory footprint per dataset instead of relying on the previously hard-coded default. -
Suppress selected foundation model features during fine-tuning
Selected foundation model modalities can now be ignored when fine-tuning a downstream model. This is useful when adapting a pretrained foundation model to a task that does not benefit from the full FM feature set. -
Pre-fit uniqueness check on the main entity column
BaseModel now verifies that the main entity attribute column contains unique values before training begins, catching schema mistakes early instead of mid-run. -
Skip already-computed features on resume
Resumed fits no longer re-run data sample analysis for features that were already computed in a previous run, shortening warm-restart cycles on large datasets.
Improvements
-
Resource estimation is now opt-in
The resource estimation step introduced in Release 1.30 is no longer enabled by default, reducing time-to-first-run for users who do not need an estimate on every fit. It can still be enabled explicitly when needed. -
Columnar streaming from Databricks
Data is now streamed from Databricks in columnar format, reducing transfer overhead and improving throughput on large reads. -
Databricks reliability and extensibility
The Databricks integration now retries transient failures automatically, and extra parameters can be passed through to the Databricks SQL connector for advanced configuration. -
Faster Cleora aggregation on large graphs
Cleora query planning and aggregation have been optimized to reduce runtime on large graphs. -
Lower fasttext memory and CPU footprint
fasttext training now caps CPU usage and limits the volume of text it consumes per run, reducing RAM pressure and contention on machines with many cores. -
Lower DuckDB memory footprint
DuckDB-backed workloads now use less RAM, making large fits more feasible on smaller machines.
Fixes
-
Correct NULL handling in Cleora subqueries
Fixed an issue where NULL values in Cleora subqueries were not handled correctly, which could produce inaccurate downstream aggregation results. Users who ran Cleora-based pipelines on datasets containing NULLs in graph inputs are advised to re-run affected fits to ensure result correctness. -
Faster query planning via duplicate CTE elimination
Fixed a planner issue where common table expressions could be emitted twice during query concatenation, causing the same work to be executed redundantly. Affected queries now run faster with no change in output. -
Third-party dependency refresh
Upstream dependencies have been updated to their latest compatible versions for improved security and stability. No breaking changes are expected.
Release 1.30
March 2026
This release introduces real-time training progress visibility, pre-training resource estimation, a quick validation mode, and richer data profiling reports — alongside inference performance improvements and expanded database support.
New Features
-
Live training progress tracking
Training now displays a live progress bar showing entity counts, so you can monitor long-running jobs without checking logs. Progress is reported for both foundation model training and inference. -
Resource estimation before training
You can now estimate memory and compute requirements before launching a full training run. This helps you choose the right hardware configuration and avoid out-of-memory failures on large datasets. -
Quick check mode
A new fast validation mode lets you run a quick sanity check on your configuration and data pipeline before committing to a full training run. Quick check applies data limits automatically so you get rapid feedback on configuration errors or data issues. -
Enriched data profiling reports
Thefitreport now includes inferred column types, lists of skipped and special columns, and actionable recommendations. This makes it easier to validate your data setup and catch misconfigured columns before training. -
Automatic redundancy detection for categorical columns
BaseModel now detects redundant categorical columns (e.g. two columns that are exact mappings of each other) during the fit phase and reports them, helping you simplify your data schema. -
Suggested config generation
After the fit stage, BaseModel now generates asuggested_config.yamlfile that applies the column report findings to your original config. Detected time-series columns are added assql_lambdaswithcolumn_type_overrides, and redundant bijection columns are added todisallowed_columns. It's ready to use as aconfig.yaml. After the review you can use it as your new config. -
Automatic DataLoader calibration
BaseModel can now automatically find the optimalnum_workersandprefetch_factorfor the DataLoader before foundation model training. When enabled, the system benchmarks multiple configurations and selects the most efficient one — improving data loading throughput without manual tuning.
Improvements
-
Faster and more efficient inference pipeline
The inference pipeline has been restructured for better throughput and lower latency, with improved memory handling during data decoding. -
Flexible entity split percentages
Entity split ratios for training, validation, and test sets now accept decimal values (e.g.0.7, 0.15, 0.15), giving you finer control over data partitioning. -
Configurable validation batch size
A newval_batch_sizeparameter allows you to set a separate batch size for validation, useful when validation data has different memory characteristics than training data. -
Parquet data source entity ID support
Parquet data sources now support entity ID filtering, reaching full feature parity with other supported database connectors. -
Clearer error messages
Error messages across training and inference have been consolidated and improved, providing more context and actionable guidance when something goes wrong. -
Improved text feature handling
Text feature processing is now more robust, with better normalization of text embeddings for more consistent model performance. -
Horizontal scaling and model quantization for inference
Inference deployments now supportnum_replicasfor horizontal scaling. Inference can also load quantized models for reduced memory footprint. -
Automatic GPU device selection
training_params.devicesnow defaults to"auto", which automatically selects the least-occupied GPU. Falls back to CPU if no GPUs are available. Existing configs with explicit device values are unaffected.
Fixes
-
Dataset seed consistency
Fixed an issue where the random seed was not incremented correctly between epochs, which could lead to less varied sampling across training runs. -
Batch limit enforcement
Fixed an issue where dataset element limits were not always enforced at epoch boundaries, potentially causing longer-than-expected training epochs. -
Third-party package updates
Updated packages to improve performance, strengthen security, and ensure compatibility with the latest features.
Release 1.20
November 17, 2025
This release introduces more flexible time-window handling, improved checkpointing and resume behavior, richer metric support, and continued enhancements to image-based modeling.
New Features
-
Mid-epoch resume support when using multiple GPU Training can now be resumed mid-epoch also when training in parallel on multiple GPUs. By default, checkpoints are created between training and validation. Users can also configure checkpointing every n steps, enabling faster recovery from interruptions without restarting an entire epoch.
-
Flexible time window selection for history and future Replacing
next_n_daysandnext_n_hourswithinterval_from, a more general time-window utility that allows users to define arbitrary time intervals using precise durations. This enables selecting both historical and future periods with higher precision and clearer semantics, especially for short or irregular time horizons. Previously used target functions require refactoring — please refer to documentation and recipes. -
Ranking metrics support (MRR, NDCG) Added built-in support for common ranking metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG), simplifying evaluation of recommendation and ranking scenarios.
Improvements
-
NaN-robust aggregations by default Aggregation functions (
sum,mean,min,max) now ignore NaN values by default, leading to more stable and predictable results when working with incomplete or noisy data. -
Flexible expressions across grouping and filtering Grouping and filtering operations, as well as aggregation methods, now accept Python callables everywhere a column name was previously required. This allows users to compute values dynamically — such as
trans['price'] * trans['quantity']— and use them directly for grouping, filtering, or aggregation. -
Image features in shared entities Shared entities now support image embeddings, allowing visual features to be unified across multiple data sources just like text or categorical attributes.
-
More precise training window validation The
has_incomplete_training_windowfunction now supports finer-grained time units, allowing checks in minutes, hours, or days. Previously used target functions require refactoring — please refer to documentation and recipes.
Fixes
-
Databricks timestamp handling Fixed an issue where timestamps could be misinterpreted when reading data from Databricks sources.
-
Third-party package updates Updated packages to improve performance, strengthen security, and ensure compatibility with the latest features.
Release 1.00
November 3, 2025
This major release introduces image embedding support, improved data streaming efficiency, and enhanced caching performance monitoring, alongside multiple stability and documentation updates.
With the introduction of image embeddings, multiple core refactors, and the publication of the API Reference, BaseModel officially reaches version 1.00.
New Features
-
Image embedding support Users can now add images, and BaseModel automatically generates image embeddings that integrate seamlessly with behavioral, text, and tabular data for complete multimodal modeling.
-
Unix timestamp support Users can now use Unix timestamps directly in time-related functions for greater flexibility in data processing.
Improvements
-
Improved data streaming efficiency Reduced memory usage and increased performance for large datasets, resulting in smoother and faster data handling.
-
Revamped timezone handling Enhanced timestamp alignment across multiple data sources for consistent temporal comparisons.
-
Robust handling of missing numerical data More stable aggregation and event computations when numerical values are partially missing.
-
Optimized data transformations Improved efficiency when processing large data structures within pipelines.
-
Simplified async stream handling Streamlined background data operations for greater reliability and maintainability.
-
Improved query consistency More predictable and stable query behavior across data modules, enhancing reliability in data access.
-
Additional caching performance benchmarks Improved cache performance benchmarking across supported databases, enabling further optimization.
Fixes
-
Trainer loss logging Fixed an issue where
train_loss_epochcould log as NaN during certain training configurations. -
Time series count handling Corrected how time series model manages series of counts, ensuring accurate scaling and alignment.
-
Checkpoint reliability Fixed a checkpointing issue that could prevent model state from saving correctly during long training sessions.
-
Minor bugs and code maintenance Fixed various small issues and improved overall code stability.
Documentation
-
Comprehensive API Reference Users can now access a complete API Reference section describing all classes and functions available in BaseModel.
-
New guides and FAQ Added a new FAQ section and a detailed guide on Data Types & Features.
Release 0.20
October 9, 2025
This technical release focuses on stability and model robustness improvements.
New Features
- Custom metrics support
Users can now define and register custom evaluation metrics within downstream models. This includes full compatibility with
torchmetrics, duplicate-name checks, and consistent integration across training, validation, and monitoring phases.
Improvements
-
Improved stability and speed of real-time inference Optimized the inference server interface and pipeline for more stable initialization, better resource utilization, and faster CI execution.
-
Improved numerical stability of training on highly multimodal data Enhanced buffer sampling and shuffling to ensure better coverage of training examples, smoother convergence, and improved overall training stability.
-
Improved regression and classification predictions Revised the random splitting strategy and applied a uniform mixture to raw scores, leading to more balanced score distributions and reduced training bias.
Fixes
-
Sketch width and depth alignment Prevented potential crashes caused by conflicting sketch dimensions when handling certain class counts.
-
Date parsing for time series Fixed an issue where date columns were not parsed or sanitized in time-series data when a date format was provided.
-
One-hot recommendation metrics for low candidate counts Prevented crashes on
OneHotRecommendationTaskmodels when the candidate pool was smaller than k. -
Text embedding stability Fixed occasional crashes during text model training under specific edge conditions.
-
Short time-series handling Resolved a crash that occurred when the time-series length was shorter than the kernel size.
-
Insufficient training data handling Introduced graceful exit with a clear message when available data is insufficient to fill a full batch across all selected devices.
Release 0.19
September 1, 2025
New Features
-
Time window slicing for event data Event data sources can now be restricted to a defined start and end timestamp with the new
slice_time_windowfunction. This allows analyses and training runs to focus on specific periods without extra preprocessing. -
Direct access to date columns Event data sources now expose a
datecolumn (timestamps), allowing time-based filtering and grouping withfilter()andgroupBy()functions.
Improvements
-
Simplified interpretation output The
interpretfunction now produces cleaner results by removing non-essential fields. -
Improved window shuffling defaults The buffer size has been changed to 100k, enhancing randomness in shuffled windows and improving generalization during training.
Fixes
-
Loss computation Refined loss calculations in downstream models.
-
Module visibility Ensured consistent access to essential modules.
-
Date handling Added support for numeric date formats and fixed timestamp edge cases.
-
Split point generation Restricted to explicitly defined data sources.
Documentation
- Expanded sample target functions A new set of ready-to-use target functions has been added to help prototype and compare approaches more quickly.
Release 0.18
August 7, 2025
New Features
-
New recommendation task for limited item pool When fine-tuning for recommendation problems, users can now choose from two specialized classes:
OneHotRecommendationTask(fixed-size vector for the total number of recommendable entities) andRecommendationTask(probabilistic sketch representation). The one-hot variant is especially beneficial when the number of recommendable entities is relatively low (e.g. <5,000). -
Loss weighting for One-Hot Recommendation and Multilabel Classification Users can now optionally return weights in the target function to control the relative importance of individual target elements.
-
Entity filtering Users can now define which targets should be included or excluded from predictions and ground truth using
predictions_to_include_fnandpredictions_to_exclude_fnparameters. -
Adjustable number of split points for random sampling strategy The maximum number of split points per observation used during a single epoch has been increased from 1 to the square root of the number of event timestamps when
target_sampling_strategy="random". -
Flexible entity split Entity split parameters such as
training,validation,test, andtraining_validation_endcan now be changed during scenario model training. -
Sketch merging support Sketches derived from shared entity columns can now be added to other sketches, enabling hybrid representations that combine different behavioral signals.
-
Retention policy support during training Users can now use the
entity_history_limitparameter to define the maximum history time range for a single observation per data source during training. -
Extended logging in ML experiment tracking tools The number of validation batches is now logged in lifecycle management tools such as Neptune and MLflow.
Improvements
- Consistent entity split
Entity split into training, validation, and test is now consistent between
fitandtrain_foundation_modelphases.
Fixes
-
Timezone in date columns Fixed an error when date columns containing time zone caused errors during the fit phase.
-
Snowflake token authorization Resolved an issue when Snowflake token authorization was used unconditionally if a token was present.
-
Multi-GPU checkpoint overwrite Fixed an issue when the overwrite setting removed checkpoint files during multi-GPU training.
-
Cache creation Restored cache creation when enabled in configuration.
Release 0.17
July 4, 2025
New Features
-
Hybrid train/test split Users can now combine entity-based training and validation splits with a time-based test set, mirroring production scenarios more closely.
-
Limited end date of training and validation Introduced the
training_validation_endparameter to limit the latest date included in training and validation splits. -
Flexible training validation interval Introduced
check_val_every_n_stepsandcheck_val_every_n_epochsparameters for more granular control over validation frequency. -
Reproducible results A
seedparameter has been added to key methods (fit_behavioral_representation,train_foundation_model,pretrain,fit,evaluate,predict, andtest) to ensure consistent outputs across runs. -
Flexible Kerberos configuration Separate realm for Kerberos can now be defined with
kinit_realmparameter while realm for connection string can be defined in the ini file.
Improvements
-
Refactored run continuation logic The
overwriteandresumeparameters must now be passed directly to thefitmethod rather than being read fromTrainingParams. -
Refined interpretability date specification The target date for interpretation should now be provided via the
prediction_dateparameter. -
Normalized feature importance in interpretability Feature importance scores are now normalized based on each feature's input size, enabling fairer comparisons.
-
Improved Parquet cache behavior Cache is automatically refreshed if the source Parquet file has changed — no manual deletion required.
-
Accelerated training Multiple internal optimizations have led to 2–3× faster performance on benchmark datasets, including accelerated data loading, more efficient handling of time-based features, and streamlined validation logic.
Fixes
-
Complex column types Resolved an issue where complex column types (e.g. lists of strings) caused errors during preprocessing.
-
Prediction memory usage Fixed excessive memory usage during the prediction phase.
Release 0.15
May 7, 2025
New Features
-
Entity-based splits Users can now separate training, validation, and testing sets based on time range or entity ID.
-
New method for selecting entity IDs A new
entity_idsparameter can be added to YAML configuration to act as a global filter during foundation model training. The same parameter is available inTrainingParamsandTestingParamsfor scenario-level control. The previously usedentities_ids_subqueryparameter has been removed.
Improvements
-
Refactored feature loading pipeline Improvements to ensure stability and clarity of error messages. Key changes:
use_recency_sketchesanduse_last_basket_sketchesmust now be passed directly topretrain()ortrain_foundation_model;features_pathcan no longer be modified at scenario training stage; thedata_loading_paramsblock has been removed from the configuration file. -
Enhanced BigQuery connector Users can now specify a different project as computation engine and a different one as data location.
-
Standardized dependency error messages All optional dependency checks now generate standardized error messages with clear instructions.
Fixes
-
Validation set date Fixed an error when validation set starting date could result in empty history.
-
Test set date handling Fixed an error when the starting date of the testing set could be treated as history.
-
Low-cardinality categorical columns Fixed an error when a column with less than two values overridden to categorical type caused preprocessing to fail.
-
Interpretability duplicated modalities Fixed an error when interpretability returned duplicated modality names.
-
Classification threshold requirement Fixed an error where classification tasks required thresholds even when output type was set to
SEMANTIC. -
Package updates Updated packages to improve performance, security, and compatibility.
Release 0.14
April 10, 2025
New Features — Core BaseModel Repository
-
Modular foundation model training The two components of the pretrain stage — data preprocessing & representation fitting, and FM training — can now be run independently via
fit_behavioral_representationandtrain_foundation_model. -
Flexible prediction outputs Introduced different types of predicted output defined by mandatory
output_typeparameter. Addedreadout_sketchandread_target_entity_idsfunctions to map recommendation outputs to feature values. -
Enhanced model training with early stopping Introduced
early_stoppingparameter to prevent overfitting. -
Expanded model interpretability Introduced
interpret_entityfunction to compute event-level attributions for a single main entity. -
Automated model testing Introduced
testmethod to compute metrics based on predictions and ground truth. -
Flexible BigQuery connection Added
project_idparameter to define project different from the one in the service account.
New Features — GUI Application (Snowflake Native)
-
Cascading run execution Enables dependent jobs/runs to trigger in sequence.
-
Run and job status tracking Added detailed status tracking for better monitoring.
-
New table designs Updated tables with improved layout and readability.
-
Validation improvements Enhanced input and data validation across the platform.
-
Multi-GPU training Enabled distributed model training across multiple GPUs.
-
Listing state restoration Automatically restores UI state when returning to listings.
Fixes
-
Distributed training entity loading Fixed an issue where all main entities were loaded on one GPU during distributed training.
-
Multiclass default metric Fixed an issue where default multiclass metric was returning an error.
-
Time-series interpretability Fixed an issue where interpretability attributions for time-series features were empty.
Dependencies
- The
dasklibrary is no longer a dependency.
Release 0.13
March 17, 2025
New Features
-
Expanded model scalability Added support for FSDP2 to enable distributed training across multiple GPUs.
-
Flexible inference customization Introduced
targets_to_includeandtargets_to_excludeparameters for more control over inference outputs. -
Enhanced diagnostics and issue tracking Improved exception handling and logging during inference for better troubleshooting.
-
Advanced interpretability for time series Enabled support for interpreting time-series variables in model explanations.
Fixes
-
Time-series model resume Fixed an error that prevented models with time-series features from resuming properly.
-
Foundation model head parameter Fixed an issue where the parameter controlling the use of the foundation model head was not properly passed.
-
Package updates Updated packages to improve performance, security, and compatibility.
Release 0.12
February 13, 2025
New Features
-
Enhancements to Foundation Model training Improvements to the foundation model training process, leading to better performance in downstream applications. Key changes include simplified configuration by removing certain time-based parameters, an updated optimizer that eliminates manual learning rate scheduling, improved feature representation through automated parameter tuning, and performance gains from optimized data handling.
-
Improved Parquet file support Faster and more scalable data processing with an upgraded backend engine, enhanced memory management, and expanded support for advanced query operations.
-
Interpretability Added support for event-level interpretability in classification and regression models.
Fixes
-
Column fit resume flag Fixed an issue where the
_FINISHEDflag for column fit tasks was occasionally set incorrectly, resulting in an unstable resume option. -
Training log output Fixed an issue where the log in
main.logwas empty, incomplete, or incorrectly written during model training. -
Next-n-hours timestamp exclusion Fixed an issue where the
next_n_hoursparameter used in the target function was excluding the starting timestamp. -
Package updates Updated packages to improve performance, security, and compatibility.
Release 0.11
February 13, 2025
New Features
-
Improved handling of time series (BETA) Users can now enable improved handling of time series by declaring selected numeric columns as time-series. This feature provides superior representation of event sequences and intervals.
-
Automated sanitization and qualification of column names in
where_conditionTheresolve()function can now be used inwhere_conditionto enhance consistency and reduce the risk of errors. -
Optimized memory utilization for Parquet data sources More stable handling of parquet files, including filtering data at an early stage and reading parquet files in chunks to reduce peak memory usage.
-
Enhanced history / future splitting Additional sampling strategy (
existing) supports more modeling scenarios, such as basket context for next purchase prediction. Regular timestamps are now used for split points instead of day timestamps. -
Enhanced interpretability of time-based features Provides deeper insight on the impact of time-based features by separating out periodical counts, sums, and means.
-
Event aggregations without grouping Users can now perform aggregation operations such as
sum(),count(),mean(),min(), andmax()in the target function without needing to group events. -
Capping number of CPU resources at fit stage Users can now limit the utilization of computation resources during the fit stage with the
num_cpusargument.
Fixes
-
Custom metric casting Fixed an issue where certain custom metrics were not automatically cast to the appropriate data type.
-
Feature saving after pretraining failure Fixed an issue where certain features were not saved after a pretraining failure.
-
Recommendation validation metrics Fixed an issue where the most frequently interacting entities could be partially ignored when calculating validation metrics in recommendation tasks.
-
Duplicate column names across data sources Fixed an issue where repeating column names across joined data sources might result in conflict.
-
NaN percentage calculation Fixed an issue where the percentage of NaN values was incorrectly calculated for columns containing both NaN values and empty strings.
-
CPU cap enforcement Fixed an issue where the CPU cap set with the
num_cpusargument was ignored. -
Predictions file suffix Fixed an issue where a
.csvsuffix was expected instead of.tsvfor the predictions file. -
File lock during event grouping Fixed an issue where a file lock set during event grouping resulted in a
FileExistsErrorin case of slow storage. -
Interpretability with shared entities Fixed an issue where
interpret()resulted in an error for data sources with shared entities. -
Interpretability with empty quantiles Fixed an issue where
interpret()resulted in an error in case of empty quantiles for groups with no events. -
Package updates Updated packages to improve performance, security, and compatibility.