New features:

  • Enhancements to Foundation Model training
    Improvements to the foundation model training process, leading to better performance in downstream applications.

    • Simplified configuration by removing the need to predefine certain time-based parameters incl.check_target_for_next_n_days parameter.
      This requires changes to previously written configuration files.
    • Optimized training with an updated optimizer that eliminates the need for manual learning rate scheduling.
    • Improved feature representation through automated parameter tuning.
    • Performance gains from optimized data handling and more efficient training strategies.

  • Improved Parquet File Support
    Faster and more scalable data processing with an upgraded backend engine.

    • Enhanced memory management and stability for large-scale data.
    • Expanded support for advanced query operations on parquet data sources.

  • Interpretability
    Added support for event-level interpretability in classification and regression models.

Fixes

  • Fixed an issue where the _FINISHED flag for column fit tasks was occasionally set incorrectly, resulting in an unstable resume option.

  • Fixed an issue where the log in main.log file was empty, incomplete or incorrectly written during model training.

  • Fixed an issue where `next_n_hours' parameter used in target function was excluding starting time stamp.

  • Updated packages to improve performance, improve security and for compatibility with the latest features.

New features:

  • Improved handling of time series (BETA)
    Users can now enable improved handling of time series by declaring selected numeric columns as time-series. This feature provides superior representation of event sequences and intervals.

  • Automated sanitization and qualification of column names in where_condition
    The resolve() function function can now be used in where_condition to enhance consistency and reduce the risk of errors.

  • Optimised memory utilization for datasources in parquet file format
    More stable handling of parquet files used as data sources, including filtering data at an early stage and reading parquet files in chunks to reduce peak memory usage.

  • Enhanced history / future splitting
    Additional sampling strategy ("existing") supports more modeling scenarios, such as basket context for next purchase prediction. Regular timestamps are now used for split points instead of day timestamps.

  • Enhanced interpretability of time-based features
    Provides deeper insight on the impact of time-based features by separating out periodical counts, sums, and means.

  • Event aggregations without grouping:
    Users can now perform aggregation operations such as sum(), count(), mean(), min() and max() in the target function without needing to group events.

  • Capping number of CPU resources at fit stage
    Users can now limit the utilization of computation resources during the fit stage with the num_cpus argument.

Fixes

  • Fixed an issue where certain custom metrics were not automatically cast to the appropriate data type.

  • Fixed an issue where certain features were not saved after a pretraining failure.

  • Fixed an issue where the most frequently interacting entities could be partially ignored when calculating validation metrics in recommendation tasks.

  • Fixed an issue where repeating column names across joined data sources might result in conflict.

  • Fixed an issue where the percentage of NaN values was incorrectly calculated for columns containing both NaN values and empty strings.

  • Fixed an issue where the CPU cap set with the num_cpus argument was ignored.

  • Fixed an issue where a .csv suffix was expected instead of .tsv for the predictions file.

  • Fixed an issue where a file lock set during event grouping resulted in a FileExistsError in case of slow storage.

  • Fixed an issue where interpret() resulted in an error for data sources with shared entities.

  • Fixed an issue where interpret() resulted in an error in case of empty quantiles for groups with no events.

  • Updated packages to improve performance, security, and compatibility with the latest features.

New features:

  • Extended data type error messages
    Users can now see the column and data source in case of data type errors.

  • Accelerated processing of text features
    Text features now employ proprietary serialized implementation.

  • Support for SQL lambdas when filtering entity IDs
    Users can now use SQL lambda when filtering using entities_ids_subquery.

Fixes:

  • Updated packages to improve performance, improve security and for compatibility with the latest features.

New features

  • Grouped Decimal Features in Interpretability
    Introduced the ability to handle and analyze grouped decimal features, enhancing model interpretability.

  • Event Attributions to interpret recommendation models
    Users can now trace back and understand how specific events influence model outputs and predictions.

  • Prediction Storage in Snowflake Database
    Added functionality to save predictions directly into a Snowflake database.

  • Data Source Name in Minimum Group Size Logs
    Added logging of the data source name when enforcing minimum group size requirements.

  • Join Functionality for Attribute Data Sources (enhanced)
    Expanded support to allow joining attribute data sources with multiple data sources.

  • Filtering on Extra Columns in Data Source Definition
    Users can now filter, group, and leverage extra columns passed in the data source definition.

  • New Parameter inDataParams:training_end_date
    Introduced the training_end_date parameter, providing more flexibility and control over model training timelines.

  • New Parameters inTestingParams: local_save_location,remote_save_location
    Introduced local_save_location and remote_save_location as parameters within TestingParams.

    🚧

    Note

    Please adapt your configuration file to reflect this syntax change.

  • Extended Group Max Retries
    Default values of group computation retries and retry interval have been increased. Default forGROUPS_N_RETRIES is now set to 20 and default for GROUPS_RETRY_INTERVAL is now set to 60. This reduces the likelihood of failures due to transient issues and improves overall robustness. For more information refer to Dividing event tables section.

  • Entity Number Limit for Target Function Validation
    The number of entities that can be used when validating target functions is now capped to ensure efficiency and prevent overload during the validation process.

  • Enhanced Debug Messages for Target Function Validation
    More comprehensive debug messages have been added during target function validation to assist in troubleshooting and increase transparency in the validation process.

Fixes

  • Fixed issues with None values in grouping.

  • Fixed regression loss calculation and logging.

  • Fixed errors in pandas query parsing.

  • Improved Neptune alerter logging.

  • Removed unused validations and loss functions.

  • Optimized memory usage in interpretability.

  • Fixed handling of missing metrics in Neptune.

  • Reduced memory consumption.

  • Improved directory creation based on cache path.

  • Enhanced schema selection in Hive builder.

  • Handled potential NaN values in decimal calculations.

Docs

  • Updated the documentation navigation to be more readable and user-friendly.

  • Added Recipes section for easy reference when building target functions.

Features

  • Add max groups to event data source config
  • Support grouping for decimal modality
  • Implement groups for feature stats

Fixes

  • Align predict between classification and recommendation
  • Allow loss weighting in multiclass classification
  • Make clickhouse dialect provider support nullable columns
  • Max splitpoints set and logging configuration
  • Fix interpretability for decimal features

New features

  • Add dimension checks in interpretability
  • Allow to set maximum percentage of nulls in a column
  • Forbid undefined fields in configs
  • Handle none as entity_id in parquet files
  • Create new data source definition.
  • New config.yaml design.
  • Enable caching queried data
  • Handle duplicated column names when joining tables.
  • Support parquet data source.
  • Fix monad metrics
  • Validate allowed_columns.
  • Support lambdas at config level
  • Implement mechanism for metric initialization
  • Add joins to benchmarking configs
  • Cast main_entity_id to string
  • Validate columns uniqueness
  • Allow defining lambdas in extra columns
  • Add recommendations to interpretability
  • Set max number of expressions via environment variable
  • Verify if data source name contains any forbidden sequences

Fixes

  • Add recency modality slices to feature value interpretability
  • Allow join_on column in select
  • Allow None value for limit_train_batches
  • Always use stored config at pretraining phase
  • Changed defaults for loader params
  • Check data source type before accessing date column
  • Fix Recommendation model
  • Fix to date parsing in hive
  • Make snowflake config work with new setup
  • Use alias and table name correctly
  • Fix metrics in training params
  • Append suffix to with clause alias
  • Fix detecting cyclic joins.

0.6.0 (2024-04-23)

Features

  • Add BM colors to interpretability plot
  • Add interpret function for use in scripts
  • Add methods for weighting training examples
  • Adjust hive to use ini files
  • Enable setting 'ignore_entities_without_events' flag.
  • Extract queries from connectors
  • Create common mechanism for query execution
  • Refactor query builders
  • Add treemap visualization
  • Add treemap generation from predefined hierarchy
  • Replace sampling method with actual sampling
  • Make attribution average optional.
  • Introduce Python 3.11
  • Add regression task to interpretability
  • Support training resuming
  • Create chunks based on partition column
  • Support booleans in fit stage

Fixes

  • Add quotation marks around table names in dialect providers
  • Add quotation marks to entity ids subquery
  • Add reset method to LongCastingMetric
  • Add return statement to FM get trainable module
  • Cast Hive decimal columns to float
  • Change cache dir type
  • Fixing id info parsing
  • Handle empty iterator while caching
  • Hash sketches hashing function and tests
  • Add options to change interpretability sample size
  • Fix time shift when caching datetimes.
  • Handle decimal types in Hive training iterator.
  • Fix ignore_entities_without_events flag
  • Fix combining tiles with the same name and different id
  • Catching prediction on None object and fixing runtime threshold
  • Remove dask-ml, bump ray, use compatible dask version
  • Set enable_checkpointing flag accordingly to the callbacks setup
  • Small fix in one-hot-encoders
  • Stop logging warnings for uppercase unquoted columns in snowflake

0.5.0 (2024-01-18)

Features

  • Add interpretability
  • Add logging column names
  • Add resume option for columns & fix minor bug related to text columns processing
  • Add target filtering to the inference module.
  • Use PyODBC for connecting with Hive.

Fixes

  • Chunking in hive queries fixed
  • Convert max num columns to int
  • Fix cleora circular dependency imports