How Top ML Teams Work: Inside Airbnb, Uber, and Stripe

What "High-Performing ML Team" Actually Means

Most discussions about ML teams focus on tools and architecture. The more important variables are organizational: who owns what, how decisions are made, how new models get shipped, and how failures are handled.

Three companies — Airbnb, Uber, and Stripe — have published unusually detailed accounts of how their ML organizations work. What emerges is a set of shared practices that explain why their ML investments generate outsized returns.

Airbnb: The Data Science Taxonomy

Airbnb's most influential contribution to ML team design is their "Data Science Taxonomy" — a framework for distinguishing between different types of data science work and staffing them differently.

The Three Types

Analytics: Understanding what happened and why. SQL-heavy, focused on business questions, uses dashboards and reports. This is a BI/analyst function.

Inference: Causal questions. Did this product change cause this outcome? What's the lift from this pricing experiment? This requires statistical expertise and experimental design knowledge.

Prediction: Building models for operational decisions. Fraud detection, search ranking, price recommendations, host quality scoring. This is ML engineering work.

Airbnb's insight: these three types have different tooling needs, different skills, and different success metrics. Teams that conflate them end up hiring the wrong people for each function and using the wrong tools for each problem type.

What This Means for Hiring

A team that needs fraud detection doesn't need a statistician — it needs an ML engineer who can build and maintain a production classifier. A team that needs to understand why bookings dropped 15% doesn't need an ML engineer — it needs a causal inference specialist.

Airbnb built separate career ladders for each type. This reduced the common problem where a great ML engineer is evaluated against analytics output and finds it unmotivating, or vice versa.

The Standardization Bets

Airbnb invested heavily in shared infrastructure to reduce duplicated work across ML teams:

Zipline: Their feature engineering platform. Centralized feature computation and serving, with point-in-time correctness. Teams contribute features to a shared registry and reuse each other's work.
Bighead: Their ML development environment. Consistent tooling for training, evaluation, and deployment across teams.
Minerva: Their metric platform. Single source of truth for business metrics, used by both analytics and ML teams.

The bet: standardization costs 30% of velocity in the short term and multiplies velocity 3x in the long term, because teams stop reinventing the wheel and building incompatible systems.

Uber: The Platform-First Model

Uber's ML organization is famous for investing in platform before application teams. Their approach: build the infrastructure that makes ML development 10x faster, then expect product teams to use it.

Michelangelo: The ML Platform

Uber built Michelangelo as an internal ML-as-a-service platform. The design goals:

Any team at Uber should be able to train, evaluate, and deploy a model without specialized infrastructure knowledge
Models trained in the platform should be automatically logged, versioned, and comparable
Serving should be a solved problem — teams shouldn't build serving infrastructure, they should use the platform's

The result: Uber went from building ML systems one at a time to having hundreds of production models maintained with shared tooling. The platform team multiplied the output of every product team.

The Platform Team vs. Product Team Split

Uber's organizational model separates ML teams into two types:

ML Platform Team: Builds and maintains Michelangelo. Measures success by engineer-hours saved across all product teams. Evaluates their work based on adoption, reliability, and velocity impact.

Product ML Teams: Build application-specific models (surge pricing, fraud, rider recommendations, ETA prediction). They don't maintain infrastructure — they're consumers of the platform.

This split has a real tradeoff: platform teams need to build abstractions that work for many use cases, which sometimes means they can't optimize for any specific use case. Product teams sometimes need to work around platform limitations. Uber manages this with a tiered support model: common use cases are first-class platform features, uncommon use cases have escape hatches.

The Feature Store Investment

Uber's Palette (their feature store) is one of the most widely-cited examples of feature reuse at scale. The key properties:

Offline store: Large-scale batch features for model training
Online store: Low-latency serving of precomputed features at inference time
Point-in-time correctness: No data leakage — the training pipeline can only see features that were available at the time of the label
Feature discovery: An internal catalog where teams can find and reuse features from other teams

The feature store paid off when Uber's fraud team discovered that the ETA team's driver behavior features were strong signals for fraud detection. Reuse across teams — something that's impossible without centralized infrastructure — generated quality improvements that wouldn't have happened in siloed architectures.

Stripe: ML on a Trust and Safety Foundation

Stripe's ML organization is shaped by a specific constraint: the primary use case is fraud detection, where mistakes have direct financial consequences, and where adversaries adapt to model outputs.

The Adversarial Challenge

Most ML systems deal with a static data distribution — user preferences change slowly, item features are stable. Fraud ML operates in an adversarial environment where professional fraudsters analyze model behavior and adapt.

This changes the architecture in several ways:

Feature engineering is adversarial: Stripe's fraud features include signals designed to be hard for fraudsters to spoof — device fingerprints, behavioral biometrics, network graph signals. Features that are easy to observe and replicate are less valuable.

Feedback loops are dangerous: If fraudsters can probe your model through real transactions, they can learn the decision boundary. Stripe uses a combination of rule-based and model-based decisions to make the decision surface less legible.

Model updates need to be fast: When a new fraud pattern emerges, the time-to-production for a model update is a business-critical metric. Stripe invests in fast retraining pipelines specifically because the adversarial environment changes faster than a weekly retraining cycle.

How Stripe Ships ML Models

Stripe's engineering blog describes a deliberate, cautious deployment process:

Shadow mode: New models run in parallel with the production model for weeks without making decisions. This generates a large dataset of "what would this model have decided?" which is analyzed for unexpected behavior before any traffic is shifted.

Gradual rollout: Traffic is shifted in small increments (1%, 5%, 20%, 50%, 100%) with evaluation between each step. Automated rollback triggers fire if key metrics regress.

Ongoing red-teaming: Stripe employs teams that actively try to fool their own fraud models to find blind spots before external adversaries do.

The Explanation Requirement

Stripe's models need to explain their decisions. Not for research purposes — for regulatory compliance. When a payment is declined, Stripe may need to provide the cardholder and the card network with an explanation.

This constraint shapes the model architecture: black-box deep models are often replaced or supplemented with interpretable models or post-hoc explanation layers. It also shapes the feature engineering: features that can be explained to a human are preferred over features that are statistically powerful but semantically opaque.

What These Three Companies Have in Common

Despite different domains, the ML teams at Airbnb, Uber, and Stripe share several structural practices:

1. Centralized Feature Infrastructure

All three invested in shared feature stores. The key insight: feature engineering work done by one team is wasted if it can't be reused. Central feature infrastructure converts siloed work into shared leverage.

2. Platform Teams as Force Multipliers

All three have dedicated ML platform teams. These teams are measured differently from product ML teams — their KPI is how much faster they make everyone else, not how many models they ship.

3. Shadow Mode Before Production

All three run new models in shadow mode before committing to them. This is a form of production testing that generates real-world data without real-world risk. It's the most common practice among top ML teams and the least common practice among less mature teams.

4. Post-Mortems on Model Failures

All three have cultures of blameless post-mortems on model failures, not just infrastructure incidents. When a model degrades or causes an incident, the team analyzes: what broke, why didn't monitoring catch it earlier, and what change prevents the same class of failure in the future.

5. Tight Feedback Loops

The time from "this model has a problem" to "we have a fix in production" is a key velocity metric at all three companies. They invest in fast retraining, fast evaluation, and fast deployment specifically to shorten this loop.

To build similar infrastructure at your company, start with our MLOps Maturity Model guide and our breakdown of Feature Stores.