Skip to content

Frequently Asked Questions

Getting started

What do I need to start working with BaseModel?

Three things:

  • Data — at least one event table with entity IDs, timestamps, and ≥ 100k interactions/month across ≥ 10k entities. See Requirements.
  • Infrastructure — a Docker-capable environment with GPU meeting the hardware requirements.
  • A data scientist — someone who can configure data sources, define prediction targets, and operate the training process.
What skills or roles are needed?

At minimum a data scientist. Depending on your organization you may also need:

  • DevOps / Infra — to deploy the container and manage GPU resources
  • Data Engineer — to grant read-only warehouse access and configure data sources
  • Analyst / Product Owner — to define business objectives and evaluate results
How long does it take to get BaseModel running?

If infrastructure and data access are in place, BaseModel can be operational within a few hours. Most teams have a foundation model training on day one. The first scenario model (e.g. churn prediction) is typically ready within a week.

Can BaseModel be used without event data?

No. BaseModel requires timestamped event streams linked to identifiable entities. Static-only data is better served by traditional models like GBDTs.

Data

Which types of events work best?

Any recurring, timestamped interaction tied to an entity ID: transactions, page views, calls, service activations, support tickets, campaign responses, etc. See Data Connect Sources.

How much historical data do I need?

Depends on interaction frequency:

  • High-frequency (banking, telco, FMCG) — 3+ months
  • Low-frequency (fashion, insurance, automotive) — 1+ year

See Requirements for full volume guidelines.

Can I add static or aggregated features?

Yes:

  • Temporally aggregated features (e.g. weekly spend) → treat as timestamped events
  • Static features (e.g. demographics, product metadata) → include as entity or item attribute tables
How are missing values handled?

BaseModel does not impute missing values. Columns with missing data exceeding a defined threshold are excluded during training. Check the foundation model's fit stage logs for warnings.

How are timezones processed?

All timestamps are converted to UTC. If a timezone is specified, BaseModel translates accordingly. If no timezone is provided, timestamps are interpreted as UTC.

How do I prevent data leakage?

Data splits are enforced chronologically by design. Training and test sets must be separated in time. If this rule is violated, BaseModel raises an error. See Foundation Model — Basic Configuration for split setup.

Training

How do I troubleshoot target function problems?

Run verify_target() on your target function before training to catch runtime errors, data type mismatches, or excessive None returns. See Function Validation.

How do I decide when to stop training?

Training can run for a fixed number of epochs or use configurable early stopping. Track metrics via MLflow or another logger to monitor convergence. See Training Controls.

Can BaseModel handle class imbalance?

BaseModel learns from naturally imbalanced data effectively through its representation learning. When needed, you can apply sampling or loss weighting and compare results using validation metrics.

How can I make training faster?

In priority order:

  1. Optimize data — use interpretation to identify high-impact columns, remove duplicates, experiment with shorter time windows
  2. Parallelise — scale across GPUs, tune number of workers
  3. Enable caching — reuse preprocessed batches between runs
  4. Reduce model complexity — lower graph quality or network size (may reduce accuracy)

See Training Scaling.

Can BaseModel forecast further into the future?

Yes — the target function defines a prediction interval relative to the prediction date. For example, predict churn between 30 and 90 days out. The prediction date itself must not be set beyond your last available data.

Deployment & integration

Does BaseModel send or copy my data anywhere?

No. All computation happens inside your infrastructure. BaseModel reads directly from your data warehouse or local files.

How does BaseModel integrate with ML platforms?

BaseModel connects to all major data platforms (Snowflake, Databricks, BigQuery, ClickHouse, Hive, Azure Synapse) and integrates with MLflow for experiment tracking. See Monitoring with MLflow.

What artifacts does BaseModel produce?

Feature embeddings, model checkpoints, logs, configuration files, and prediction outputs (TSV format). See Inference for output details.

How scalable is BaseModel?

It runs efficiently on a single GPU but scales linearly with multi-GPU and multi-node setups. Preprocessing can run on CPU-only environments; training and inference require GPUs.

How often should I retrain?

Typically every 4 weeks to account for new entities, seasonal patterns, and behavioral shifts. The optimal frequency depends on your data refresh rate and domain dynamics.

Interpretability & compliance

How interpretable are the results?

BaseModel provides event-level attribution: which individual interactions contributed most to a prediction. This goes beyond aggregate feature importance and maps directly to behavioral drivers. See Interpretation.

Can I explain a prediction made months ago?

Yes — as long as the model checkpoint and corresponding data are still available. Preserving checkpoints for audit purposes is the client's responsibility.

How is data privacy handled?

All processing happens within your infrastructure. BaseModel does not transfer data externally, does not perform federated learning, and does not train across clients. Data governance and compliance are the client's responsibility.