case study 2024-09-20 11 min read

DoorDash ML Monitoring: Building Observability for ML Systems

How DoorDash monitors their ML systems to ensure reliability and catch issues before they impact customers.

DoorDash monitoring observability ML ops reliability

Introduction

ML systems fail silently - predictions degrade without obvious errors. DoorDash's ML monitoring system catches these issues before they impact food delivery. This case study explores their approach to ML observability.

Why ML Monitoring is Different

Traditional Software

  • Clear errors: Exceptions, crashes, timeouts
  • Binary states: Working or broken
  • Deterministic: Same input, same output

ML Systems

  • Silent failures: Wrong predictions, no errors
  • Continuous degradation: Quality gradually declines
  • Non-deterministic: Inherent variance

DoorDash's Monitoring Framework

Three Pillars

1. Data Quality    -> Are inputs correct?
2. Model Quality   -> Are predictions accurate?
3. System Health   -> Is infrastructure working?

Data Monitoring

Input Validation

class DataMonitor:
    def check_features(self, features):
        checks = [
            self.check_schema(features),
            self.check_distributions(features),
            self.check_missing_values(features),
            self.check_outliers(features)
        ]
        return all(checks)

Distribution Drift

  • Statistical tests (KS, Chi-squared)
  • Threshold-based alerts
  • Historical comparisons

Model Monitoring

Prediction Distribution

Track:

  • Mean and variance of predictions
  • Score distributions
  • Prediction entropy

Outcome Monitoring

Delayed feedback loop:

  • Predictions at time T
  • Outcomes at time T + delivery
  • Compare predictions vs. actuals

A/B Test Monitoring

  • Metric differences between variants
  • Statistical significance tracking
  • Guardrail metrics

System Monitoring

Standard Metrics

  • Latency (P50, P99)
  • Throughput
  • Error rates
  • Resource utilization

ML-Specific Metrics

  • Feature computation time
  • Model inference time
  • Cache hit rates

Alerting Strategy

Tiered Alerts

Severity Response Time Example
P0 5 min Model serving down
P1 1 hour Significant drift detected
P2 1 day Minor quality degradation

Avoiding Alert Fatigue

  • Aggregate related alerts
  • Clear escalation paths
  • Actionable alert descriptions

Case Study: ETA Prediction

Monitored Metrics

  • Prediction accuracy by region
  • Feature staleness
  • Model latency

Caught Issues

  1. Feature pipeline delay - Detected via staleness monitoring
  2. Regional data issues - Detected via accuracy monitoring
  3. Traffic spikes - Detected via latency monitoring

Best Practices

  1. Start with business metrics then work backwards
  2. Alert on actionable conditions only
  3. Invest in root cause analysis tools
  4. Regular review of alert effectiveness

Master ML monitoring in production systems with our practical courses.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.