Harmful Content Detection (Text-only, Pre-Publish Moderation)

Case study goal: Design an end-to-end machine learning system that blocks harmful posts before publishing, using low-latency inference, a policy-driven decision layer, and feedback loops for continuous improvement.

At a glance

Problem: Detect policy-violating content in user posts before they go live.
Scope (V1): Text-only moderation at publish time.
Primary actions: Allow, Block, Request edit (optional), Log + audit.
Latency SLO (target): P95 ≤ 300ms end-to-end decision for text posts.
Key design choices:
- Policy engine decoupled from ML models (thresholds/actions versioned and auditable)
- Inference cascade (rules → fast ML → heavy ML on uncertain cases)
- Human-in-the-loop for appeals + sampling + training labels (not synchronous gating in V1)

Problem definition & requirements

Restate the problem

We need a system that intercepts post creation, evaluates text for policy violations, and prevents publishing when content violates community guidelines.

Functional requirements

Score each new post against a policy taxonomy (e.g., harassment, hate, self-harm, sexual content, threats).
Return a decision before publish:
- Allow (safe)
- Block (clear violation)
- Request edit (borderline / “rewrite to comply”) (optional UX path)
Produce an audit record for every decision:
- post_id, user_id, timestamp
- policy_version, model_version
- decision, category scores, thresholds applied
Support appeals and reversals (with durable traceability).
Enable labeling pipelines from moderation + appeals outcomes.

Non-functional requirements

Latency: P95 ≤ 300ms for text-only decisioning (including feature fetches).
Availability: 99.9%+ for the gating path; degrade safely.
Throughput: handle burst traffic; scale horizontally.
Consistency: deterministic decisions given the same model/policy versions.
Evolvability: policy rules change frequently; changes must be deployable quickly.

Constraints & assumptions (explicit)

Pre-publish enforcement means we cannot rely on long-running inference in the hot path.
Text-only V1; multimedia comes later.
Policy outcomes must be explainable at the policy level (category + guideline reference), even if the model is complex.

High-level architecture

Key idea

Use a low-latency moderation gate in front of publishing, powered by:

a fast rules layer for high-precision patterns
an ML cascade for generalization and adversarial robustness
a policy engine that maps ML scores to actions

Online (pre-publish) path

User submits post text to Post API
Post API calls Moderation Gateway (synchronous)
Moderation Gateway:
- normalizes text + detects language
- runs Rules/Heuristics
- runs Fast ML model
- conditionally runs Heavy ML model (only when needed)
Policy Engine applies policy thresholds and returns decision
Post API:
- publishes (Allow), or
- rejects (Block), or
- returns edit request (optional)
Decision + scores are logged to Audit Log

Offline path (training + analytics)

Ingest audit logs, moderator labels, appeals outcomes into Data Lake/Warehouse
Run feature pipelines + training jobs
Register models + evaluation artifacts
Deploy via shadow → canary → rollout

Diagram

Data & feature design

Data sources (V1)

Raw post text
Metadata: language, locale/region, time, client type
User/account signals (optional, carefully constrained):
- account age, prior enforcement counts, rate-limit signals

Design choice: In V1, I bias toward content-first modeling.
User/account signals are useful for triage and rate limiting, but can create fairness issues if used too heavily in core content classification.

Feature extraction

Online (synchronous)

Text normalization:
- unicode normalization, whitespace cleanup
- URL expansion (domain extraction)
- basic obfuscation handling (e.g., repeated punctuation, leetspeak patterns)
Language detection (fast)
Tokenization/encoding for ML model
Optional low-latency user features from an online feature store

Offline (asynchronous)

Full text featurization for training
Hard-negative mining datasets
Drift analysis datasets

Feature storage

Online feature store (KV): small set of stable, low-latency user/account features
Offline store: parquet/warehouse tables for reproducible training

Modeling approach

V1 strategy: start simple, then earn complexity

Pre-publish enforcement forces a pragmatic approach: the model must be fast, stable, and calibratable.

Baseline models (V1)

Rules/heuristics (high precision)
- explicit banned phrases
- known slurs/threat patterns
- spam URL/domain rules
Fast text classifier
- options:
  - TF-IDF + logistic regression (fast, interpretable)
  - small/distilled transformer (better generalization)
Heavy model (optional in V1, used sparingly)
- a stronger transformer (still CPU-friendly or GPU-batched)
- only invoked for uncertain cases or high-impact surfaces

Output format

Multi-label category probabilities (e.g., hate, harassment, self-harm)
Severity score (optional) or severity derived via category thresholds

Why a cascade?

Most content is clearly safe → fast pass
A small slice is clearly harmful → rules + fast model catch
The ambiguous middle → heavy model (bounded by a budget)

Diagram

Training & evaluation

Label sources

Moderator labels (highest quality)
Appeals outcomes (critical for reducing false positives)
User reports (noisy; used with debiasing and sampling strategies)

Data splits

Time-based splits to avoid leakage and simulate real production drift:
- Train: weeks 1–6, Validate: week 7, Test: week 8
Slice evaluations:
- language/locale
- short vs long posts
- obfuscation-heavy text (emoji, punctuation, leetspeak)

Metrics (decision-centric)

Because we enforce via thresholds, model quality is not just “accuracy”:

Precision/Recall per category (especially for severe harms)
PR-AUC for imbalanced classes
Calibration (reliability/ECE), because thresholds assume meaningful probabilities
Overturn / appeal rates (operational quality signal)

Thresholding & policy mapping

Thresholds live in the Policy Engine, not inside model code.
Category thresholds differ by harm type:
- e.g., very low tolerance for credible threats; more nuance for sensitive-but-allowed content.
Maintain policy_version + model_version in every logged decision.

Model serving & inference

Serving goals (V1)

Predictable low latency
Safe degradation
Versioned, auditable decisions

Serving components

Moderation Gateway: orchestrates normalization → cascade → policy decision
Model Serving:
- CPU-first for fast model
- heavy model behind a budget (CPU or GPU-batched)
Policy Engine:
- thresholds + actions
- region/locale variants
- explainable policy labels

Latency budget example (P95 target 300ms)

20ms: text normalization + language detect
40ms: rules + fast model inference
40–150ms: heavy model (only for a small %)
20ms: policy decision + logging (async logging when possible)

Caching (optional)

Cache scores for:
- exact duplicates (hash)
- near duplicates (optional later with embeddings)
Cache is versioned by (model_version, policy_version).

Monitoring & feedback loops

What to monitor (V1)

System health

latency (P50/P95/P99) by stage
error rates, timeouts, fallbacks invoked
throughput and queue backpressure

Model health (proxy signals)

score distribution drift (by category)
changes in “uncertainty rate” (heavy model invocation rate)
appeal success rate and moderator overturn rate
sampled audits of “allowed” posts to estimate false negatives

Retraining triggers

scheduled retrains (e.g., weekly)
drift thresholds exceeded
spike in false positives (appeals/overturns)
new policy category introduced

Human-in-the-loop (V1 posture)

In V1, human moderation is not synchronous with publishing. Instead it supports:

appeals processing
QA sampling
generating high-quality labels for future improvements

Diagram: feedback loop

Scalability, reliability & tradeoffs

Scalability

Stateless Moderation Gateway → horizontal autoscaling
Use an event bus for:
- async logging
- offline training ingestion
Heavy model capacity planning:
- keep invocation rate bounded (e.g., ≤ 5% of posts)

Reliability & safe degradation

Pre-publish means we must define “what happens if moderation fails?”

Fail closed (safer): block publishing if moderation is down
- best for safety-critical contexts, but hurts UX
Fail open (riskier): allow publishing if moderation is down
- unacceptable for many platforms/categories
Pragmatic V1 recommendation: Fail “soft-closed”:
- allow only for trusted users / low-risk surfaces, otherwise block
- show user a retry message for temporary outage
- rate limit suspicious bursts

Key tradeoffs

Safety vs UX: strict blocking reduces harm but increases false positives and user friction.
Speed vs accuracy: cascades improve both, but add system complexity.
Policy flexibility vs simplicity: decoupled policy engine is slightly more work up front, but pays off long-term.

Security, privacy & ethics

Security

Rate-limit moderation endpoints to prevent probing.
Protect model internals; expose only policy-level outcomes externally.
RBAC for moderation tools and audit logs.

Privacy

Minimize stored raw text; prefer hashed identifiers + necessary excerpts for audit/appeals.
Encrypt at rest/in transit; retention policies by region.

Fairness & bias

Evaluate error rates by language/locale slices.
Avoid over-weighting user-history features in core classification without strong justification.
Appeals provide a corrective channel; overturned decisions feed retraining.

V2 extension: images/video (architecture changes)

V1 is synchronous text gating. Multimedia typically requires hybrid moderation because media inference is slower.

What changes in V2

Add an Async Media Moderation Pipeline:
- OCR (text-in-image), vision embeddings, keyframe sampling for video
Change enforcement model to:
- Text pre-publish gate
- Media post-publish fast follow (seconds-level), with rapid enforcement actions
Add Fusion logic in decision layer:
- combine text score + OCR score + vision score
- policy engine remains the same interface

Diagram: V1 vs V2

Diagram: V2

Summary

This design delivers a practical, production-oriented harmful content detection system for text-only, pre-publish enforcement:

Fast, scalable gating with predictable latency
Cascade inference to balance accuracy and cost
Policy engine separation for fast policy iteration and auditability
Monitoring + feedback loops to improve over time
A clean path to multimodal expansion without rewriting the core decisioning layer

Appendix: “What I’d say in an interview” (short callouts)

Why decouple policy from model?
Policies evolve weekly; retraining shouldn’t be required to adjust enforcement thresholds. Separation also improves auditability.
How do you prevent unsafe behavior during outages?
Define explicit failover modes. For pre-publish, default to soft-closed behavior with graceful UX + trust-tier exceptions.
How do you tune thresholds?
Use calibration + PR curves per category. Tune to minimize false negatives for severe harms, and use appeals/overturns to correct false positives.

Page Tags:

ml systems

system design

content moderation

trust and safety

text classification

multi label classification

policy engine

thresholding

calibration

model serving

low latency

monitoring

data drift

human in the loop

security

privacy

architecture

case study

Table of contents

At a glance

Problem definition & requirements

Restate the problem

Functional requirements

Non-functional requirements

Constraints & assumptions (explicit)

High-level architecture

Key idea

Online (pre-publish) path

Offline path (training + analytics)

Diagram

Data & feature design

Data sources (V1)

Feature extraction

Feature storage

Modeling approach

V1 strategy: start simple, then earn complexity

Baseline models (V1)

Output format

Why a cascade?

Diagram

Training & evaluation

Label sources

Data splits

Metrics (decision-centric)

Thresholding & policy mapping

Model serving & inference

Serving goals (V1)

Serving components

Latency budget example (P95 target 300ms)

Caching (optional)

Monitoring & feedback loops

What to monitor (V1)

Retraining triggers

Human-in-the-loop (V1 posture)

Diagram: feedback loop

Scalability, reliability & tradeoffs

Scalability

Reliability & safe degradation

Key tradeoffs

Security, privacy & ethics

Security

Privacy

Fairness & bias

V2 extension: images/video (architecture changes)

What changes in V2

Diagram: V1 vs V2

Diagram: V2

Summary

Appendix: “What I’d say in an interview” (short callouts)