Harmful Content Detection (Text-only, Pre-Publish Moderation)
Case study goal: Design an end-to-end machine learning system that blocks harmful posts before publishing, using low-latency inference, a policy-driven decision layer, and feedback loops for continuous improvement.
Table of contents
- At a glance
- Problem definition & requirements
- High-level architecture
- Data & feature design
- Modeling approach
- Training & evaluation
- Model serving & inference
- Monitoring & feedback loops
- Scalability, reliability & tradeoffs
- Security, privacy & ethics
- V2 extension: images/video (architecture changes)
- Summary
- Appendix: “What I’d say in an interview” (short callouts)
At a glance
- Problem: Detect policy-violating content in user posts before they go live.
- Scope (V1): Text-only moderation at publish time.
- Primary actions: Allow, Block, Request edit (optional), Log + audit.
- Latency SLO (target): P95 ≤ 300ms end-to-end decision for text posts.
- Key design choices:
- Policy engine decoupled from ML models (thresholds/actions versioned and auditable)
- Inference cascade (rules → fast ML → heavy ML on uncertain cases)
- Human-in-the-loop for appeals + sampling + training labels (not synchronous gating in V1)
Problem definition & requirements
Restate the problem
We need a system that intercepts post creation, evaluates text for policy violations, and prevents publishing when content violates community guidelines.
Functional requirements
- Score each new post against a policy taxonomy (e.g., harassment, hate, self-harm, sexual content, threats).
- Return a decision before publish:
- Allow (safe)
- Block (clear violation)
- Request edit (borderline / “rewrite to comply”) (optional UX path)
- Produce an audit record for every decision:
- post_id, user_id, timestamp
- policy_version, model_version
- decision, category scores, thresholds applied
- Support appeals and reversals (with durable traceability).
- Enable labeling pipelines from moderation + appeals outcomes.
Non-functional requirements
- Latency: P95 ≤ 300ms for text-only decisioning (including feature fetches).
- Availability: 99.9%+ for the gating path; degrade safely.
- Throughput: handle burst traffic; scale horizontally.
- Consistency: deterministic decisions given the same model/policy versions.
- Evolvability: policy rules change frequently; changes must be deployable quickly.
Constraints & assumptions (explicit)
- Pre-publish enforcement means we cannot rely on long-running inference in the hot path.
- Text-only V1; multimedia comes later.
- Policy outcomes must be explainable at the policy level (category + guideline reference), even if the model is complex.
High-level architecture
Key idea
Use a low-latency moderation gate in front of publishing, powered by:
- a fast rules layer for high-precision patterns
- an ML cascade for generalization and adversarial robustness
- a policy engine that maps ML scores to actions
Online (pre-publish) path
- User submits post text to Post API
- Post API calls Moderation Gateway (synchronous)
- Moderation Gateway:
- normalizes text + detects language
- runs Rules/Heuristics
- runs Fast ML model
- conditionally runs Heavy ML model (only when needed)
- Policy Engine applies policy thresholds and returns decision
- Post API:
- publishes (Allow), or
- rejects (Block), or
- returns edit request (optional)
- Decision + scores are logged to Audit Log
Offline path (training + analytics)
- Ingest audit logs, moderator labels, appeals outcomes into Data Lake/Warehouse
- Run feature pipelines + training jobs
- Register models + evaluation artifacts
- Deploy via shadow → canary → rollout
Diagram
Data & feature design
Data sources (V1)
- Raw post text
- Metadata: language, locale/region, time, client type
- User/account signals (optional, carefully constrained):
- account age, prior enforcement counts, rate-limit signals
Design choice: In V1, I bias toward content-first modeling.
User/account signals are useful for triage and rate limiting, but can create fairness issues if used too heavily in core content classification.
Feature extraction
Online (synchronous)
- Text normalization:
- unicode normalization, whitespace cleanup
- URL expansion (domain extraction)
- basic obfuscation handling (e.g., repeated punctuation, leetspeak patterns)
- Language detection (fast)
- Tokenization/encoding for ML model
- Optional low-latency user features from an online feature store
Offline (asynchronous)
- Full text featurization for training
- Hard-negative mining datasets
- Drift analysis datasets
Feature storage
- Online feature store (KV): small set of stable, low-latency user/account features
- Offline store: parquet/warehouse tables for reproducible training
Modeling approach
V1 strategy: start simple, then earn complexity
Pre-publish enforcement forces a pragmatic approach: the model must be fast, stable, and calibratable.
Baseline models (V1)
- Rules/heuristics (high precision)
- explicit banned phrases
- known slurs/threat patterns
- spam URL/domain rules
- Fast text classifier
- options:
- TF-IDF + logistic regression (fast, interpretable)
- small/distilled transformer (better generalization)
- options:
- Heavy model (optional in V1, used sparingly)
- a stronger transformer (still CPU-friendly or GPU-batched)
- only invoked for uncertain cases or high-impact surfaces
Output format
- Multi-label category probabilities (e.g., hate, harassment, self-harm)
- Severity score (optional) or severity derived via category thresholds
Why a cascade?
- Most content is clearly safe → fast pass
- A small slice is clearly harmful → rules + fast model catch
- The ambiguous middle → heavy model (bounded by a budget)
Diagram
Training & evaluation
Label sources
- Moderator labels (highest quality)
- Appeals outcomes (critical for reducing false positives)
- User reports (noisy; used with debiasing and sampling strategies)
Data splits
- Time-based splits to avoid leakage and simulate real production drift:
- Train: weeks 1–6, Validate: week 7, Test: week 8
- Slice evaluations:
- language/locale
- short vs long posts
- obfuscation-heavy text (emoji, punctuation, leetspeak)
Metrics (decision-centric)
Because we enforce via thresholds, model quality is not just “accuracy”:
- Precision/Recall per category (especially for severe harms)
- PR-AUC for imbalanced classes
- Calibration (reliability/ECE), because thresholds assume meaningful probabilities
- Overturn / appeal rates (operational quality signal)
Thresholding & policy mapping
- Thresholds live in the Policy Engine, not inside model code.
- Category thresholds differ by harm type:
- e.g., very low tolerance for credible threats; more nuance for sensitive-but-allowed content.
- Maintain policy_version + model_version in every logged decision.
Model serving & inference
Serving goals (V1)
- Predictable low latency
- Safe degradation
- Versioned, auditable decisions
Serving components
- Moderation Gateway: orchestrates normalization → cascade → policy decision
- Model Serving:
- CPU-first for fast model
- heavy model behind a budget (CPU or GPU-batched)
- Policy Engine:
- thresholds + actions
- region/locale variants
- explainable policy labels
Latency budget example (P95 target 300ms)
- 20ms: text normalization + language detect
- 40ms: rules + fast model inference
- 40–150ms: heavy model (only for a small %)
- 20ms: policy decision + logging (async logging when possible)
Caching (optional)
- Cache scores for:
- exact duplicates (hash)
- near duplicates (optional later with embeddings)
- Cache is versioned by (model_version, policy_version).
Monitoring & feedback loops
What to monitor (V1)
System health
- latency (P50/P95/P99) by stage
- error rates, timeouts, fallbacks invoked
- throughput and queue backpressure
Model health (proxy signals)
- score distribution drift (by category)
- changes in “uncertainty rate” (heavy model invocation rate)
- appeal success rate and moderator overturn rate
- sampled audits of “allowed” posts to estimate false negatives
Retraining triggers
- scheduled retrains (e.g., weekly)
- drift thresholds exceeded
- spike in false positives (appeals/overturns)
- new policy category introduced
Human-in-the-loop (V1 posture)
In V1, human moderation is not synchronous with publishing. Instead it supports:
- appeals processing
- QA sampling
- generating high-quality labels for future improvements
Diagram: feedback loop
Scalability, reliability & tradeoffs
Scalability
- Stateless Moderation Gateway → horizontal autoscaling
- Use an event bus for:
- async logging
- offline training ingestion
- Heavy model capacity planning:
- keep invocation rate bounded (e.g., ≤ 5% of posts)
Reliability & safe degradation
Pre-publish means we must define “what happens if moderation fails?”
- Fail closed (safer): block publishing if moderation is down
- best for safety-critical contexts, but hurts UX
- Fail open (riskier): allow publishing if moderation is down
- unacceptable for many platforms/categories
- Pragmatic V1 recommendation: Fail “soft-closed”:
- allow only for trusted users / low-risk surfaces, otherwise block
- show user a retry message for temporary outage
- rate limit suspicious bursts
Key tradeoffs
- Safety vs UX: strict blocking reduces harm but increases false positives and user friction.
- Speed vs accuracy: cascades improve both, but add system complexity.
- Policy flexibility vs simplicity: decoupled policy engine is slightly more work up front, but pays off long-term.
Security, privacy & ethics
Security
- Rate-limit moderation endpoints to prevent probing.
- Protect model internals; expose only policy-level outcomes externally.
- RBAC for moderation tools and audit logs.
Privacy
- Minimize stored raw text; prefer hashed identifiers + necessary excerpts for audit/appeals.
- Encrypt at rest/in transit; retention policies by region.
Fairness & bias
- Evaluate error rates by language/locale slices.
- Avoid over-weighting user-history features in core classification without strong justification.
- Appeals provide a corrective channel; overturned decisions feed retraining.
V2 extension: images/video (architecture changes)
V1 is synchronous text gating. Multimedia typically requires hybrid moderation because media inference is slower.
What changes in V2
- Add an Async Media Moderation Pipeline:
- OCR (text-in-image), vision embeddings, keyframe sampling for video
- Change enforcement model to:
- Text pre-publish gate
- Media post-publish fast follow (seconds-level), with rapid enforcement actions
- Add Fusion logic in decision layer:
- combine text score + OCR score + vision score
- policy engine remains the same interface
Diagram: V1 vs V2
Diagram: V2
Summary
This design delivers a practical, production-oriented harmful content detection system for text-only, pre-publish enforcement:
- Fast, scalable gating with predictable latency
- Cascade inference to balance accuracy and cost
- Policy engine separation for fast policy iteration and auditability
- Monitoring + feedback loops to improve over time
- A clean path to multimodal expansion without rewriting the core decisioning layer
Appendix: “What I’d say in an interview” (short callouts)
-
Why decouple policy from model?
Policies evolve weekly; retraining shouldn’t be required to adjust enforcement thresholds. Separation also improves auditability. -
How do you prevent unsafe behavior during outages?
Define explicit failover modes. For pre-publish, default to soft-closed behavior with graceful UX + trust-tier exceptions. -
How do you tune thresholds?
Use calibration + PR curves per category. Tune to minimize false negatives for severe harms, and use appeals/overturns to correct false positives.