Frequently asked questions

In this section we cover everything you need to know about BaseModel.ai.

What You need to start working with BaseModel.ai?

  1. at least 1 event data source (behavioral entity id, timestamp, attributes); 2. at least 100k interactions observed per month (tied to behavioral profiles); 3. at least 10k individual behavioral profiles (customers, subscribers, members, etc.)

What competencies are necessary to deploy and run BaseModel?

You need: a. A person who can deploy a Docker container to a GPU-equipped VM; b. A person who can grant (read-only) access to necessary tables in the data warehouse & fill the data source configuration for BaseModel; c. A person who can define business objectives and prediction targets

We’re already using Google Vertex / Amazon Sagemaker / Azure ML / H2O / DataBricks. Don’t they offer the same things that BaseModel does?

First of all, that’s great news – BaseModel is integrated with Google Vertex, Amazon Sagemaker, Azure ML and (soon) DataBricks! Our models can be created, managed, deployed, scaled, and monitored from within these environments. All above solutions without BaseModel require you or your ML/DS teams to manually prepare the data, produce appropriate transformations, joins, feature embeddings and encodings, feature aggregates, choose or design a model architecture, loss function, data split strategy, training & evaluation protocols for every single problem you want to solve. Any additional data sources, schema changes or even slight modifications to the original problem statement that may come up in the future, require additional maintenance work to keep those pipelines up to date. BaseModel solves all those tasks in an elegant, unified way, delivering unprecedented quality by training a self-supervised foundation model on all your behavioral data automatically. Then, the only thing necessary to obtain a ready-to-use fine-tuned model is to define the prediction objective.

What are the final artifacts that BaseModel generates?

The final artifacts are trained, supervised Pytorch models, which can be deployed anywhere. Predictions can be served either individually or in batches and persisted back to the data warehouse or elsewhere (e.g., a message queue).

We use XGBoost/CatBoost/LightGBM or DataRobot models for behavioral modeling. What are the advantages of BaseModel?

GBDT (Gradient Boosted Decision Tree) models require you or your ML/DS teams to manually prepare the data, produce appropriate transformations, joins, feature embeddings and encodings, feature aggregates, tune hyperparameters, choose a loss function, data split strategy, training & evaluation protocols for every single problem you want to solve. Any additional data sources, schema changes or even slight modifications to the original problem statement that may come up in the future, require additional maintenance work to keep those pipelines up to date. BaseModel eliminates these problems. GBDTs are discriminative models, which work well on well-structured problems, for example classification based on some attributes (age, height, total spending). You can think of this problem as 0-dimensional. The input to a GBDT is a flat file, with a single target column for each row. Sequence modeling, as in large language models can be thought of as a 1-dimensional problem, which is not suitable for GBDTs, because the input is a sequence of observations, (e.g., text tokens), and such data cannot easily be represented as a flat file suitable for GBDTs. Generalized behavioral modeling can be thought of as an N-dimensional problem. The data is multi-source, multi-modal, exhibits graph and hypergraph interaction structures in addition to temporal aspects. Translating rich and diverse temporal hypergraph interaction data into a flat file is a formidable challenge. A product catalog may contain millions of SKUs, a website may have hundreds of thousands of URLs, millions of telco subscribers may interact with each other, forming user-item, user-user, item-item graphs, or hypergraphs with tens of millions of nodes and billions of edges. Manual feature aggregation treats the problem as if it were possible to describe the richness of information with a few dozen coarse-grained features. This is not the case without incurring a significant loss of information. Manually created aggregate features are dependent on the data scientist’s efforts, diligence, and creativity. Features suitable for e.g., churn prediction for a debit card, will be vastly different from the features for propensity prediction for FX brokerage services. Features for predicting re-purchase time of baked goods will be quite different from features for predicting expected future spending for a brand of detergents, or probability of first-time purchases of a never-before bought product type. Such manual per-use case feature creation can not only miss crucial information about the customer from all available data sources, but also has an exceedingly high maintenance cost. Every change in the underlying data source schemas, every additional field or data source requires work to keep the data engineering pipelines up-to-date and operational. If neglected, the quality of models can silently deteriorate. BaseModel eliminates all these problems by considering the totality of available information and deriving universal understanding that is not limited to a few dozen manually created features. Internal representations in BaseModel can have hundreds of thousands of features created on-the-fly during model training and inference, from the freshest available data. This allows us to represent behavioral profiles with extremely high accuracy & capture fine detail. Subsequent fine-tuning can easily adapt the universal foundation model to specific use-cases without any manual work.
Is BaseModel a Transformer? What is the architecture?
BaseModel architecture has some analogies and similarities to the transformer architecture, but it is specially designed to solve the aggregation problem of multi-modal, multi-relational, rich event data with interaction graphs, hypergraphs, numeric, categorical, metadata, attributes, texts, and images. The aggregation problem can be defined as representing a variable-length history of a person within a business ecosystem, as a single (very wide and mostly sparse) vector in such a way, that singular events, their attributes, and metadata can be reconstructed from the representation with good accuracy. These representations can be thought of as “heatmaps” of behavior in extremely high dimensional spaces. The neural networks BaseModel uses, are trained to predict the “heatmap” of future behavior, conditioned on historic behavior. Resulting “heatmaps” of future behaviors can be probed for specific information, resulting in probability density estimates for various behaviors. Additionally, BaseModel can be modulated with exogeneous context applied to the moment of prediction, e.g., weather conditions at a point in (space)time, information about major sports events, currently running marketing campaigns, competitor activity, product availability etc. The exogeneous context modulation mechanism allows BaseModel to include all available information which can affect behaviors.

We have some additional aggregate features, without underlying events. Can we include them in BaseModel data sources?

Yes. Aggregate features, which have no underlying events, can be treated as entity attributes. This often happens in the case of socio-demographic data, product attributes, object metadata.

We do not have any event data, can we use BaseModel?

No, it would not make sense. Such cases are well covered by GBDTs.

Isn't training a foundation model expensive and slow?

No. While large language models cannot be trained efficiently as of today, BaseModel is hyper optimized for behavioral data. Full training from scratch on a large enterprise data warehouse, spanning 10M behavioral profiles and 1 year worth of data can be completed within 2 hours on a single GPU. (See detailed table above)

Is BaseModel a vector database?

No. Vector databases are storage and retrieval engines for vector embeddings. BaseModel is neither a vector database, nor does it require one to operate.

How difficult is BaseModel deployment?

BaseModel is a self-contained Docker container, can be deployed in the cloud or on-premises. If your organization is already using Docker or Kubernetes, deployment will be extremely easy.

Is BaseModel a generative model?

In the strict sense, BaseModel is a generative model. It is trained as an autoregressive model that outputs a compressed representation of extremely high dimensional probability distributions. Sampling long trajectories of complex events from such distributions is an open problem, much harder than sampling the next token from a fixed dictionary in large language models. BaseModel can answer counterfactual questions, run simulations and handle “what-if” scenarios.

Is BaseModel a feature store? Aren’t they cool?

No & no. Feature stores are repositories of pre-calculated, manually created features. Features in such containers are updated periodically and cannot capture fine-grained temporal details with a good resolution. BaseModel calculates all features from raw data on-the-fly, in real-time, during pre-training, fine-tuning and inference. They are streamed directly to the GPU to maximize compute utilization and reap all the benefits of modern deep learning hardware. This approach has a few amazing advantages: ‍ - BaseModel behavioral profiles are never stale, they always reflect all the newest events and behaviors that have been observed and entered the data warehouse. - A feature-store approach may miss crucial decision points, with features periodically recalculated (e.g., at midnight every day). Imagine a scenario where a customer visits the website, calls the call center, and visits the physical branch, finalizing their journey with a subscription all in a single day. The feature store will reflect that with 2 snapshots of user features: from before that day, and from after that day. Thanks to on-the-fly feature calculation, BaseModel is performing temporal splits between any 2 events during pre-training and fine-tuning, allowing to intercept the user’s journey at any point in time and e.g., provide insights to the call-center employee based on most recent website behavior. - BaseModel can generate rich and broad behavioral profile representations. A typical feature store may hold hundreds of manually created features (usually up to 5000 due to technical limitations). When the space of behaviors involves millions of product, attribute, location combinations, representing a person with a few hundred hand-crafted features would lead to extremely suboptimal results. BaseModel can operate with hundreds of thousands of features, leading to unprecedented resolution and quality of behavioral profiles with coverage of multiple data sources. - Since the behavioral profiles are transient (materialized from raw data on-the-fly) and do not need to be persisted at any time, any storage, latency or throughput concerns are eliminated. Your GPUs will not be bottlenecked by network I/O.

Does BaseModel send or copy my data anywhere?

No. All computation happens within your infrastructure. BaseModel does not use any additional databases, it reads directly from your data warehouse. The only data we collect is diagnostic information to help us identify problems and improve the product. Even this can be turned off for maximum security, air-gapped deployments

How often should BaseModel be retrained?

BaseModel always uses the freshest available data, straight from the original data source. Hence, predictions always reflect the most up-to-date state of the behavioral profiles, regardless of when the model was trained. Nonetheless, there are seasonal effects that make periodic retraining of models a good practice. For some industries, like fast fashion, both the product landscape and customer preferences can change very quickly. In specialized applications, some of these effects can be remediated by an additional re-scoring/re-ranking model that is trained on-line. We typically recommend BaseModel retraining times between 1 week and 1 month.

How much data is required to train BaseModel?

The simplest use-cases require: ‍ At least 1 event data source (behavioral profile id, timestamp, other attributes and metadata) At least 100,000 events observed per month (tied to behavioral profiles) At least 10,000 individual behavioral profiles (customers, subscribers, members, ...) The required timespan of data depends on the frequency of interactions with your business. For banking, telco, FMCG, where people interact with your business daily or weekly, 3mo+ worth of data may be enough. For businesses with less frequent interactions, like fashion, insurance, automotive we typically recommend 1yr+ worth of data for optimal results.

Does the modelled entity have to be an identified customer?

BaseModel is flexible in this regard. We define the “primary entity” as a unique identifier, which undergoes interactions and generates observations, which can be linked to it within the data warehouse. While typical “primary entities” are: ‍ - Users - Customers - Subscribers - Members - Patients ‍ they can also be things like: ‍ - Contract IDs - IP addresses - Hashed credit card numbers - Telephone numbers - Device IDs - Smart device / IoT (Internet of Things) IDs - and many more

Does BaseModel scale well to really big data?

BaseModel is extremely efficient even on a single VM with a single GPU for proof-of-concept. For large-scale production workloads, we support multi-GPU and multi-node training for all models. Once the foundation model has been trained, supervised downstream models can be further fine-tuned on multiple disjoint machines or clusters.

Can I use BaseModel to optimize my accounting / manufacturing process / predict the stock market?

No. BaseModel is dedicated exclusively to predicting future behaviors of individuals on a granular level. That requires event-level data attached to the entities we model (i.e., user, customer, subscriber...). In large enterprises, there are many processes which can be optimized. Some of them inherently happen on an aggregate level, where granular event data is unavailable or cannot be attached to a meaningful entity. These are cases which we do not aim to solve. Simple, standard approaches like GBDTs usually work well in these scenarios.

Does BaseModel offer model management such as: model registry, deployment and monitoring?

Our sole focus is on the quality and efficiency of the models we provide. We delegate model management to external solutions of your choice, such as Kubeflow, Azure ML & MLOps AWS Sagemaker, Google Vertex, DataBricks and similar.

Does BaseModel combine training data from multiple companies? Can it work even if Ido not have my own data?

No. BaseModel is a private foundation model, which means it is trained exclusively on the data you provide. You can even use it on-premises, in a secure air-gapped environment. BaseModel is trained from scratch (the process is RIDICULOUSLY fast) on data you have access to. Every data warehouse looks different, starting with the schema, ending with the breadth, and meaning of data inside it. Using a single model for multiple companies would not make sense. Using a single, unified architecture makes a lot of sense.

Can BaseModel just create features to populate my feature store?

While theoretically possible, this is a bad idea. By putting raw features into a feature store, you would lose many benefits provided by BaseModel: a. Extremely fine-grained, high dimensional representations (100k+ dimensions); b. Fine-grained temporal splits between any 2 historical events; c. Extremely optimized neural architectures suitable for conditional density estimation; d. Capability for precise modulation of the neural networks with exogeneous information; e. What we recommend instead is including your pre-computed features as additional inputs to BaseModel.

Are BaseModel results interpretable?

Yes, much more so than classic models! Classic models based on aggregate features can only provide attribution to the level of an aggregate (e.g., “total spending”, “number of products bought”). BaseModel works with raw event-level data and can provide attribution for every prediction, down to the level of a single event. In other words, if you are wondering what contributed to a particular score or prediction, BaseModel can return the (positive or negative) contributions of every single event attached to the behavioral profile. This allows for unprecedented explainability and insight into the underlying decision process.

Can I have a conversation with BaseModel like with ChatGPT, Bard, LLama?

No. GPT-class models are language foundation models, while BaseModel is a behavioral foundation model. Large language models are great at understanding and generating texts. BaseModel is great at understanding behavioral event data and predicting future events and behaviors. To picture the differences, imagine the contents of Wikipedia versus the contents of an enterprise data warehouse. Both BaseModel and GPT-class models are self-supervised foundation models, outputting probability distributions, and both can be further fine-tuned for supervised tasks.

I have a rule-based model. Why is BaseModel better?

Rule-based systems rely fully on the observations made by the domain experts. They cannot discover any new insights based on the data, because no learning takes place. They are the opposite of AI. Nowadays, rule-based systems are almost never used for behavioral problems in the industry.

I have an XGBoost model based on aggregates. Why is BaseModel better?

XGBoost works extremely well on well-structured problems, where features are given as-is, for example classification based on attributes (age, color, height, etc.). You can think of this problem as 0-dimensional. Per analogy, sequence modeling would be a 1-dimensional problem: there is only past and future. Behavioral data belongs to the class of N-dimensional problems, and it is extremely unstructured.

It has an inherent form of a graph, which means that it is very irregular. Nodes have arbitrarily many connections in possibly tens of thousands of directions. Additionally, the graphs have a temporal nature (1 extra dimension).

Translating a graph into numeric data is a complex problem. Generally it cannot be done with a simple translation into a low-dimensional array of clearly interpretable numbers.

Graph nodes have dozens of named properties, such as node degree, and many more unnamed and unknown properties which are task-dependent.

Connectivity is arbitrary and impossible to capture with low dimensional features when we have millions of nodes (customers and/or items) and each can be connected with any number of others.

Feature aggregation treats behavioral event modeling problems AS IF they were well-structured and 0-dimensional (so that you can use them with XGBoost), however this is not their inherent nature. That is why aggregation must lose a significant amount of info (think about a forced compression from N-dimensional space to 0-dimensional space).

Practical limitations:

Aggregates are dependent on data scientist effort. Usually are very domain-specific, so must be built for every problem separately. A lot of work is spent on design and testing of each feature (even 5-6 months total)

Aggregates sometimes need updates, again must be done by a data scientist

Sometimes, aggregated features must be added to the data store by data engineering pipeline – it is very difficult to do ad-hoc analysis of feature influence on the model. As a results, it often leads to technical debt, because DS create features in notebooks for their use case (as per one of our discovery interviews)

DS teams tend to create features per use case, which usually are not understood by other teams due to lack of documentation, so in practice many versions of the same or very similar aggregates are created within a company.