design pattern 2025-03-31 13 min read

Production ML Anti-Patterns: What Goes Wrong in Real Systems

Learn the most common ML engineering mistakes in production: training-serving skew, silent model degradation, evaluation shortcuts, and the patterns used at top ML teams to avoid them.

MLOps production ML best practices anti-patterns model monitoring

Why ML Production Failures Are Different

In software, most bugs are detectable: crashes, error logs, failing tests. ML failures are often silent. The model keeps running, keeps returning predictions, keeps logging metrics — but has quietly become worse. By the time you notice, you've served bad predictions for weeks.

This guide documents the most common production ML anti-patterns and how to avoid them.

Anti-Pattern 1: Training-Serving Skew

What it is: The preprocessing applied at serving time differs from what was applied during training. The model receives inputs that don't match the distribution it learned from.

Example:

# Training code
mean = X_train["tenure_days"].mean()  # = 365.0
std = X_train["tenure_days"].std()    # = 180.0
X_train["tenure_days_scaled"] = (X_train["tenure_days"] - mean) / std

# Serving code — different implementation, subtle bug
def preprocess_request(request):
    # Bug: different normalization values hardcoded, or not normalized at all
    return {
        "tenure_days_scaled": request["tenure_days"] / 365  # WRONG
    }

Fix: Serialize the preprocessing logic with the model, not separately.

# Training
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", GradientBoostingClassifier()),
])
pipeline.fit(X_train, y_train)
joblib.dump(pipeline, "model.joblib")  # Scaler params included

# Serving — preprocessing is part of the artifact
pipeline = joblib.load("model.joblib")
prediction = pipeline.predict(new_data)  # Scaler applied automatically

For deep learning models, save preprocessing params explicitly:

import json

preprocessing_config = {
    "feature_means": X_train.mean().to_dict(),
    "feature_stds": X_train.std().to_dict(),
    "categorical_vocab": {col: list(X_train[col].unique()) for col in cat_cols},
}

with open("preprocessing_config.json", "w") as f:
    json.dump(preprocessing_config, f)

Anti-Pattern 2: Label Leakage

What it is: Features used in training contain information about the label that won't be available at inference time.

Classic examples:

  • Predicting loan default using "collection_calls_received" (only exists for defaulted loans)
  • Predicting disease using a diagnosis code that's set at the same time as the label
  • Predicting user churn using activity features computed after the churn date

How to catch it:

# Red flag 1: suspiciously high AUC (>0.99 for a hard problem)
# Red flag 2: feature importance dominated by one unexpected feature
# Red flag 3: model degrades catastrophically on new time period data

# Systematic check: for each feature, ask "would I have this at prediction time?"
def check_temporal_leakage(df, feature_cols, label_col, event_date_col):
    for col in feature_cols:
        # Check if feature values are computed after the event date
        if df[col].notna().sum() > df[df[event_date_col] < df["feature_computed_at"]].shape[0]:
            print(f"WARNING: {col} may have temporal leakage")

Anti-Pattern 3: Ignoring Class Imbalance

What it is: Training on imbalanced classes without accounting for it, then evaluating with accuracy.

# 99% negative, 1% positive
# A model that always predicts negative achieves 99% accuracy
# But it's completely useless

# Wrong metric for imbalanced data:
accuracy_score(y_test, predictions)  # 99% — misleading

# Right metrics:
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, predictions))
# Shows precision, recall, F1 per class

print(f"ROC-AUC: {roc_auc_score(y_test, probabilities):.4f}")
print(f"PR-AUC: {average_precision_score(y_test, probabilities):.4f}")
# PR-AUC is especially informative for severe imbalance

Fixes:

# Class weights (easiest)
model = LogisticRegression(class_weight="balanced")
model = XGBClassifier(scale_pos_weight=(neg_count / pos_count))

# Threshold tuning (don't always use 0.5)
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_val, probabilities)
# Choose threshold that maximizes F1 or meets your business constraint
optimal_threshold = thresholds[np.argmax(2 * precision * recall / (precision + recall))]

Anti-Pattern 4: Evaluation on Shuffled Time-Series Data

What it is: Randomly splitting time-series data into train/test, which allows the model to "see" future data.

# WRONG for time-series data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
# Test set includes data from 6 months ago, train set includes data from today

# RIGHT: temporal split
cutoff_date = "2024-10-01"
train = df[df["date"] < cutoff_date]
test = df[df["date"] >= cutoff_date]

# Even better: include a gap to account for label delay
train = df[df["date"] < "2024-09-01"]
test = df[(df["date"] >= "2024-10-01") & (df["date"] < "2024-12-01")]

Anti-Pattern 5: No Monitoring in Production

What it is: Deploying a model and only checking if it's still running, not if it's still accurate.

Models degrade because:

  • Input distribution shifts (user behavior changes, new markets)
  • Label distribution shifts (concept drift — what "fraud" looks like changes)
  • Data pipeline issues (upstream schema changes, nulls, wrong values)

Minimum viable monitoring:

import pandas as pd
from scipy import stats

class ModelMonitor:
    def __init__(self, reference_data: pd.DataFrame):
        self.reference_stats = {
            col: {
                "mean": reference_data[col].mean(),
                "std": reference_data[col].std(),
                "p25": reference_data[col].quantile(0.25),
                "p75": reference_data[col].quantile(0.75),
            }
            for col in reference_data.select_dtypes("number").columns
        }

    def check_drift(self, current_data: pd.DataFrame, alpha: float = 0.05) -> dict:
        results = {}
        for col in self.reference_stats:
            if col not in current_data.columns:
                results[col] = {"status": "MISSING"}
                continue

            # Kolmogorov-Smirnov test: are the distributions the same?
            stat, p_value = stats.ks_2samp(
                self.reference_data[col].dropna(),
                current_data[col].dropna()
            )
            results[col] = {
                "drifted": p_value < alpha,
                "p_value": p_value,
                "current_mean": current_data[col].mean(),
                "reference_mean": self.reference_stats[col]["mean"],
            }
        return results

# Alert if more than 20% of features have drifted
monitor = ModelMonitor(training_data)
drift_results = monitor.check_drift(this_weeks_data)
drifted_features = [k for k, v in drift_results.items() if v.get("drifted")]

if len(drifted_features) / len(drift_results) > 0.2:
    send_alert(f"Significant drift detected in: {drifted_features}")

Anti-Pattern 6: Optimizing the Wrong Metric

What it is: The ML metric improves but the business metric doesn't, because the proxy metric doesn't capture what you actually care about.

Examples:

  • Optimizing click-through rate → users click but don't convert (misleading signal)
  • Optimizing completion rate → model recommends short, easy content (but not valuable)
  • Optimizing for accuracy on balanced test set → real-world imbalanced population performs poorly

Fix: before any modeling, align on the business metric. Define success criteria that go beyond ML metrics:

# Define evaluation protocol that matches deployment
# If you're A/B testing, define exactly what you'll measure
evaluation_criteria = {
    "primary_metric": "7_day_retention_lift",
    "guardrail_metrics": {
        "complaint_rate": {"threshold": "must_not_increase", "tolerance": 0.001},
        "p99_latency_ms": {"threshold": "must_be_below", "value": 200},
    },
    "minimum_sample_size": 10000,
    "minimum_duration_days": 7,
    "success_threshold": 0.02,  # Must improve retention by 2%+ to ship
}

Anti-Pattern 7: No Rollback Plan

What it is: Deploying a new model version without the ability to quickly revert.

# Bad: hard-coded model path
model = load_model("models/latest_model.pt")

# Good: versioned model registry with rollback
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Production model is always in the registry with explicit version
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")

# Rollback: one command
client.transition_model_version_stage(
    name="ChurnModel",
    version=current_version - 1,
    stage="Production",
    archive_existing_versions=True,
)

Keep the previous model version deployed in shadow mode during rollout. If metrics degrade, flip back in minutes.


Build production-grade ML systems with our guides to MLOps pipelines and ML system design.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.