Introduction
ML systems fail silently - predictions degrade without obvious errors. DoorDash's ML monitoring system catches these issues before they impact food delivery. This case study explores their approach to ML observability.
Why ML Monitoring is Different
Traditional Software
- Clear errors: Exceptions, crashes, timeouts
- Binary states: Working or broken
- Deterministic: Same input, same output
ML Systems
- Silent failures: Wrong predictions, no errors
- Continuous degradation: Quality gradually declines
- Non-deterministic: Inherent variance
DoorDash's Monitoring Framework
Three Pillars
1. Data Quality -> Are inputs correct?
2. Model Quality -> Are predictions accurate?
3. System Health -> Is infrastructure working?
Data Monitoring
Input Validation
class DataMonitor:
def check_features(self, features):
checks = [
self.check_schema(features),
self.check_distributions(features),
self.check_missing_values(features),
self.check_outliers(features)
]
return all(checks)
Distribution Drift
- Statistical tests (KS, Chi-squared)
- Threshold-based alerts
- Historical comparisons
Model Monitoring
Prediction Distribution
Track:
- Mean and variance of predictions
- Score distributions
- Prediction entropy
Outcome Monitoring
Delayed feedback loop:
- Predictions at time T
- Outcomes at time T + delivery
- Compare predictions vs. actuals
A/B Test Monitoring
- Metric differences between variants
- Statistical significance tracking
- Guardrail metrics
System Monitoring
Standard Metrics
- Latency (P50, P99)
- Throughput
- Error rates
- Resource utilization
ML-Specific Metrics
- Feature computation time
- Model inference time
- Cache hit rates
Alerting Strategy
Tiered Alerts
| Severity | Response Time | Example |
|---|---|---|
| P0 | 5 min | Model serving down |
| P1 | 1 hour | Significant drift detected |
| P2 | 1 day | Minor quality degradation |
Avoiding Alert Fatigue
- Aggregate related alerts
- Clear escalation paths
- Actionable alert descriptions
Case Study: ETA Prediction
Monitored Metrics
- Prediction accuracy by region
- Feature staleness
- Model latency
Caught Issues
- Feature pipeline delay - Detected via staleness monitoring
- Regional data issues - Detected via accuracy monitoring
- Traffic spikes - Detected via latency monitoring
Best Practices
- Start with business metrics then work backwards
- Alert on actionable conditions only
- Invest in root cause analysis tools
- Regular review of alert effectiveness
Master ML monitoring in production systems with our practical courses.