Sensor Offline Detection

Task type: BinaryClassificationTask Industry: IoT / Manufacturing

Unplanned sensor downtime in manufacturing environments leads to blind spots in process monitoring, delayed fault detection, and potential safety incidents. By predicting which sensors are likely to experience extended offline periods, maintenance teams can schedule proactive replacements or firmware updates before critical monitoring gaps occur.

What makes this advanced? Sorted event gap detection — the target function sorts sensor signal events by timestamp, then iterates through consecutive pairs checking both the status value and the time gap between events to identify sustained offline periods.

Prerequisites

Before writing a target function you need:

A trained foundation model built on event data that includes the relevant data sources.
The monad library installed in your environment.
Data source(s): sensor_signal

Target Function

The target function tells monad how to label each entity for training. It receives four arguments:

Argument	Type	Description
`history`	`Events`	All events before the temporal split.
`future`	`Events`	All events after the temporal split.
`attributes`	`Attributes`	Static entity attributes.
`ctx`	`Dict`	Context dictionary containing `SPLIT_TIMESTAMP`, data mode, etc.

The function must return one of:

np.array([1], dtype=np.float32) — positive case
np.array([0], dtype=np.float32) — negative case
None — exclude this entity from training

Full Example

Python

import numpy as np
from datetime import timedelta
from typing import Dict

from monad.ui.target_function import Events, Attributes
from monad.ui.target_function import SPLIT_TIMESTAMP
from monad.ui.target_function import has_incomplete_training_window


# === Configuration ===
MAX_OFFLINE_HOURS = 12
TARGET_WINDOW_DAYS = 30
SENSOR_DATA_SOURCE = "sensor_signal"
STATUS_COLUMN = "status"

def sensor_offline_target_fn(
    history: Events,
    future: Events,
    attributes: Attributes,
    ctx: Dict,
) -> np.ndarray | None:
    """Predict if sensor stays offline for 12+ continuous hours."""

    max_offline = timedelta(hours=MAX_OFFLINE_HOURS).total_seconds()
    split_ts = ctx[SPLIT_TIMESTAMP]

    if has_incomplete_training_window(ctx, required_length=timedelta(days=TARGET_WINDOW_DAYS)):
        return None

    # 1. Trim future signals to the target window
    future = future.interval_from(split_ts, timedelta(days=TARGET_WINDOW_DAYS - 1, hours=12))
    signals = future[SENSOR_DATA_SOURCE]

    # 2. No signals at all = assume offline
    if signals.count() == 0:
        return np.array([1], dtype=np.float32)

    # 3. Sort events by timestamp with their status
    timestamps = signals.timestamps
    status = signals[STATUS_COLUMN]
    events = sorted(zip(timestamps, status))

    # 4. Check for offline gaps exceeding threshold
    for i in range(len(events) - 1):
        ts1, st1 = events[i]
        ts2, _ = events[i + 1]
        if st1 == "offline" and (ts2 - ts1) >= max_offline:
            return np.array([1], dtype=np.float32)

    return np.array([0], dtype=np.float32)

Step-by-Step Breakdown

① Trim future signals to the target window

Python

future = future.interval_from(split_ts, timedelta(days=TARGET_WINDOW_DAYS - 1, hours=12))
signals = future[SENSOR_DATA_SOURCE]

The future events are trimmed to a 30-day window starting from the split timestamp. This ensures the model only considers near-term sensor behavior for its prediction.

② Handle no-signal case

Python

if signals.count() == 0:
    return np.array([1], dtype=np.float32)

If no signals are received at all during the target window, the sensor is assumed to be offline for the entire period — a clear positive case.

③ Sort events by timestamp with their status

Python

timestamps = signals.timestamps
status = signals[STATUS_COLUMN]
events = sorted(zip(timestamps, status))

Events are paired with their status values and sorted chronologically. This is necessary because events may not arrive in strict timestamp order across distributed IoT systems.

④ Iterate and detect offline gaps

Python

for i in range(len(events) - 1):
    ts1, st1 = events[i]
    ts2, _ = events[i + 1]
    if st1 == "offline" and (ts2 - ts1) >= max_offline:
        return np.array([1], dtype=np.float32)

The function walks through consecutive event pairs. When an event has status "offline" and the next event arrives 12+ hours later, the sensor experienced a sustained offline period. Only the first event's status matters — the gap represents the duration the sensor stayed in that state.

Training

Once the target function is defined, fine-tune a downstream model:

Python

from pathlib import Path
from monad.ui.config import TrainingParams, MetricParams, MetricMonitoringMode
from monad.config.early_stopping import EarlyStopping

from monad.ui.module import load_from_foundation_model, BinaryClassificationTask

module = load_from_foundation_model(
    checkpoint_path=Path("./foundation_model"),
    downstream_task=BinaryClassificationTask(),
    target_fn=sensor_offline_target_fn,
)

training_params = TrainingParams(
    checkpoint_dir=Path("./<this_model>"),
    learning_rate=1e-4,
    epochs=20,
    devices=[0],
    metrics=[
        MetricParams(alias="auroc", metric_name="AUROC", kwargs={"task": "binary"}),
        MetricParams(alias="auprc", metric_name="AveragePrecision", kwargs={"task": "binary"}),
        MetricParams(alias="recall", metric_name="Recall", kwargs={"task": "binary"}),
        MetricParams(alias="precision", metric_name="Precision", kwargs={"task": "binary"}),
    ],
    metric_to_monitor="val_auroc_0",
    metric_monitoring_mode=MetricMonitoringMode.MAX,
    early_stopping=EarlyStopping(min_delta=1e-4, patience=5),
)

module.fit(training_params, seed=42)

Evaluation

Python

from pathlib import Path
from datetime import datetime, timezone
from monad.ui.module import load_from_checkpoint
from monad.ui.config import TestingParams, MetricParams, OutputType

module = load_from_checkpoint(Path("./<this_model>"))

testing_params = TestingParams(
    prediction_date=datetime(2024, 5, 1, tzinfo=timezone.utc),
    output_type=OutputType.DECODED,
    devices=[0],
    metrics=[
        MetricParams(alias="auroc", metric_name="AUROC"),
        MetricParams(alias="auprc", metric_name="AveragePrecision"),
        MetricParams(alias="recall", metric_name="Recall"),
    ],
)

results = module.test(testing_params)

Prediction

Python

from pathlib import Path
from datetime import datetime, timezone
from monad.ui.module import load_from_checkpoint
from monad.ui.config import TestingParams, OutputType

module = load_from_checkpoint(Path("./<this_model>"))

testing_params = TestingParams(
    local_save_location=Path("./predictions.tsv"),
    output_type=OutputType.DECODED,
    prediction_date=datetime(2024, 6, 1, tzinfo=timezone.utc),
    devices=[0],
)

predictions = module.predict(testing_params)

Recommended Metrics

Metric	Why it matters
AUROC	Measures overall ranking quality.
AUPRC	More informative when the positive class is rare.
Recall	Proportion of actual positives caught.
Precision	Proportion of predicted positives that are correct.
F1 Score	Harmonic mean of precision and recall.

Production Tips

Calibrate the offline threshold to your SLA. 12 hours is a reasonable default, but critical sensors in safety-relevant processes may need a much shorter threshold (e.g., 1-2 hours).
Account for expected maintenance windows. Scheduled downtime should not be labeled as unexpected offline events. Filter out known maintenance periods before labeling, or add a maintenance calendar as an attribute.
Handle clock drift in edge devices. IoT sensors often have imprecise clocks. Ensure timestamps are NTP-synchronised or add a tolerance margin to the gap threshold.
Monitor class balance across sensor types. Battery-powered sensors naturally go offline more often than wired ones. Consider training separate models or adding sensor type as a feature.
Validate with incident logs. Cross-reference predicted offline events against actual maintenance tickets to ensure the model is capturing genuine failures, not just signal noise.