Drift Detection Notebook (PSI, JS Divergence, KS + Score Drift)

What drift is

In production ML, drift is any persistent change in the data or system behavior that makes your model’s past assumptions less true today.

Types of drift you actually see

Data drift (input drift)
The distribution of inputs changes. Example: more non-English text, different message lengths, more URLs, new slang.
Label drift (target drift)
The base rate of the label changes. Example: the fraction of truly violating content increases due to an event, even if the input features “look normal.”
Concept drift
The relationship between inputs and the label changes. Example: a new euphemism appears, so “benign-looking” text is now harmful. This is the hardest drift: you often need human labels to confirm it.
Pipeline drift (system drift)
Something upstream changes and silently shifts behavior. Example: language detector updated, tokenization changed, new preprocessing bug, routing changed by region.

A key operational stance: drift is not always bad. Sometimes it reflects real-world changes your system should adapt to. The goal is to detect meaningful changes and respond appropriately.

Why drift matters in production

Drift matters because production ML systems are rarely judged on “offline accuracy.” They are judged on harm, cost, and reliability.

In moderation and safety systems, drift can mean:

False negatives increase: harmful content slips through.
False positives increase: legitimate users get blocked, appeal volume rises, trust drops.
Capacity pressure: human review queues spike.
Policy risk: decision behavior changes without an intentional policy update.

In practice, drift monitoring is a stability and safety layer:

It’s a guardrail against silent degradation.
It’s an early warning that your world changed.
It helps you decide whether the right response is: investigate, recalibrate thresholds, retrain, or rollback.

Monitoring strategy

A practical production strategy monitors multiple signals because no single metric reliably catches all failure modes.

1) Input drift

Track changes in the input distributions:

Numeric: text length, URL count, emoji rate, punctuation rate, token counts
Categorical: language, region, surface, device class, content type
Embeddings (optional later): representation drift (more expensive, more powerful)

2) Score drift

Track changes in model score distributions:

Overall score distribution drift
Per-category score drift (e.g., hate, harassment, self-harm)
Tail rates above decision thresholds (not covered here) (e.g., score > 0.9)
“Uncertainty rate” (not covered here) (scores near 0.5 can indicate boundary shifts or ambiguity)

3) Decision drift

Track changes in operational decisions:

Allow rate, block rate, escalate rate
Decision changes by slice (language, region, surface)
Rate limiting / fallback behavior can create decision drift even when scores are stable

4) Human-signal drift

Track signals from humans and downstream systems:

Appeal rate, overturn rate, reviewer disagreement
Time-to-resolution, backlog growth
“Policy exceptions” increasing

The idea: drift alerts should be interpretable. If an alert fires, you want to quickly answer:

What changed?
Where did it change?
Does it increase harm or cost?
What should we do next?

Metrics

No metric is perfect. In practice you use a small set that is:

cheap to compute
stable enough under sampling noise
interpretable for on-call debugging

PSI (Population Stability Index) for binned numeric features

PSI compares how a numeric feature’s binned distribution changes between a baseline and current window.
It grows as probability mass shifts across bins, making it a simple, interpretable “how different is this feature now?” signal.

Good for: numeric features where bins are meaningful (or quantile bins)
Intuition: compares baseline vs current bin proportions; grows when mass shifts

PSI is commonly interpreted with rough heuristics:

< 0.1: small change
0.1–0.25: moderate
0.25: large
(These are not universal truths; you tune them for your system.)

JS divergence for categorical distributions

JS divergence (Jensen–Shannon Divergence) measures how different two probability distributions are (commonly categorical like language/region), in a symmetric and bounded way.

Good for: language distribution drift, region drift, surface drift
Symmetric and bounded (more stable than KL in practice)

KS test for numeric features

KS test (Kolmogorov–Smirnov two-sample test) Quantifies the maximum distance between two empirical CDFs for a numeric feature, detecting distribution changes without choosing bins. In production, treat the KS statistic as the drift magnitude and use persistence over p-values for alerting.

Good for: detecting distribution changes without choosing bins
Outputs a statistic and a p-value, but in production you usually care more about:
- effect size (the KS statistic)
- stability across windows (p-values can be misleading at scale)

Score-quantile distribution drift (per category)

Score distribution drift tracks how key score percentiles (p50/p90/p95) move over time within each class/category slice.
This is often more actionable than input drift because it directly reflects changes that can shift decisions and queue load.
For moderation-style systems, score drift is often more actionable than raw input drift.

Track per category (and per slice when possible):

quantile shifts (p50, p90, p95)
tail rate above threshold (e.g., P(score > 0.9))
uncertainty rate (e.g., P(0.45 <= score <= 0.55))

Why these matter:

quantiles tell you where the distribution moved
tail rates tell you whether decisions will likely change
uncertainty rate can spike when the model sees unfamiliar inputs or boundary shifts

Thresholding strategy

Thresholds are where “drift monitoring” becomes “operational monitoring.”

Baselines

Baselines should reflect:

recent reality (e.g., last 14–28 days)
stable periods (avoid known incidents if you want “normal”)
the same slicing keys you monitor (language/region/surface)

Percentile-based thresholds (pragmatic default)

Instead of hardcoding “PSI > 0.2 means alert,” a robust approach is:

compute metric values for historical windows vs baseline
set warn/alert thresholds as percentiles of observed variability

Example:

warn if metric exceeds p95 of historical values
alert if metric exceeds p99 of historical values

Persistence windows (reduce noise)

A single noisy window should rarely page someone. Use persistence rules like:

alert only if condition holds for 3 consecutive windows
warn if 2 out of last 3 windows exceed warn threshold

Severity tiers

A typical tiering:

Info: drift detected, log and track
Warn: investigate during business hours
Alert: trigger on-call triage

Triage runbook

When a drift alert fires, the goal is not “prove drift exists.” The goal is to identify the cause, scope, and risk quickly.

Step 0: Sanity check the monitoring job

Did the pipeline run on time?
Is sample size reasonable?
Did upstream schemas change?
Did the baseline window update correctly?

Step 1: Identify which signal fired

Input drift only?
Score drift only?
Decision drift?
Human-signal drift?

This matters because:

input drift without score drift may be benign
score drift without input drift can indicate pipeline changes or calibration issues
decision drift can indicate threshold changes or routing changes

Step 2: Localize the drift with slicing

Slice by:

language
region
surface (web, app)
content category
traffic source Look for “one slice blew up” vs “everything moved.”

Step 3: Inspect top contributing features

For numeric features with PSI/KS:

overlay histograms: baseline vs current
check whether the shift is in the tail or the center

For categorical JS drift:

list top categories by absolute delta
check for new categories or missing ones

Step 4: Connect drift to decisions

For moderation:

did tail rate above threshold increase?
did block rate increase?
did escalation rate increase? If score drift does not affect decisions, it may not require urgent action.

Step 5: Check pipeline and product changes

Common causes:

language detector update
text normalization change
new UI entry point changes message style
region routing changes
sampling changes

Step 6: Decide the response

Use the action ladder below.

How to trigger actions

You don’t want drift monitoring to be a “retrain machine.” You want it to be a decision support system.

Investigate (default)

Trigger when:

drift is localized to a slice
decision rates are stable
human signals are stable

Actions:

open an incident ticket
sample examples from drifting slices
verify pipeline versions
check if a known event explains the change

Rollback or mitigate

Trigger when:

score drift + decision drift causes harm or cost spikes
pipeline drift is suspected (recent deploy correlated with drift)

Actions:

rollback the last pipeline/model change
enable fallback or conservative thresholds temporarily
tighten human review sampling for the drifting slices

Retrain or recalibrate

Trigger when:

drift persists
human signals worsen (appeals, overrides)
offline evaluation on newly labeled data confirms degradation

Actions:

collect labels for drifting segments
retrain with refreshed data
recalibrate thresholds per slice if appropriate
validate fairness and policy consistency

Limitations and future improvements

Limitations of this approach

Input drift does not guarantee performance drift.
Concept drift typically requires fresh labels or reliable human signals.
Multiple testing across many slices can create false alarms if you don’t control alerting.
Thresholds tuned on one period can be brittle during seasonal events or launches.

What I would add at scale

streaming sketches (approx quantiles, count-min) for real-time metrics
embedding drift for semantic shifts
active learning hooks to prioritize labeling of drifting segments
automated correlation with deploys and feature flags
per-slice calibration monitoring (ECE, Brier)
richer incident automation (links to dashboards, sampled examples, pipeline diffs)

Page Tags:

machine learning

mlops

monitoring

observability

drift monitoring

data drift

score drift

thresholding

alerting

slicing

python

pandas

numpy

scipy

matplotlib

notebook

experiments

trust and safety

content moderation

Table of contents

What drift is

Types of drift you actually see

Why drift matters in production

In moderation and safety systems, drift can mean:

Monitoring strategy

1) Input drift

2) Score drift

3) Decision drift

4) Human-signal drift

Metrics

PSI (Population Stability Index) for binned numeric features

JS divergence for categorical distributions

KS test for numeric features

Score-quantile distribution drift (per category)

Thresholding strategy

Baselines

Percentile-based thresholds (pragmatic default)

Persistence windows (reduce noise)

Severity tiers

Triage runbook

Step 0: Sanity check the monitoring job

Step 1: Identify which signal fired

Step 2: Localize the drift with slicing

Step 3: Inspect top contributing features

Step 4: Connect drift to decisions

Step 5: Check pipeline and product changes

Step 6: Decide the response

How to trigger actions

Investigate (default)

Rollback or mitigate

Retrain or recalibrate

Limitations and future improvements

Limitations of this approach

What I would add at scale