MLOps Is DevOps for Models
If you've done DevOps, MLOps will feel familiar in intent and foreign in detail. The goals are the same: automate the path from code change to production, catch problems early, enable fast iteration.
What's different: in ML, the artifact is a model, not a binary. And models degrade in production not because of code bugs, but because the world changes. This adds an entire monitoring dimension that software pipelines don't have.
The ML Continuous Delivery Lifecycle
Data Changes
│
â–¼
[Data Validation] — Schema checks, distribution drift
│
â–¼
[Feature Pipeline] — Compute and store features
│
â–¼
[Model Training] — Train on new data
│
â–¼
[Model Validation] — Performance metrics, regression tests
│
â–¼
[Model Registry] — Versioned model storage
│
â–¼
[Staging Deployment] — Shadow mode, canary
│
â–¼
[Production] — Serve predictions
│
â–¼
[Monitoring] — Drift, performance, feedback
│
└──── triggers retraining ───┘
GitHub Actions for ML Pipelines
A minimal CI pipeline for an ML project:
# .github/workflows/ml_pipeline.yml
name: ML Pipeline
on:
push:
paths:
- 'src/**'
- 'params.yaml'
schedule:
- cron: '0 2 * * 1' # Weekly retraining
jobs:
validate-data:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run data validation
run: python src/validate_data.py
train:
needs: validate-data
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Train model
run: python src/train.py
- name: Upload model artifact
uses: actions/upload-artifact@v4
with:
name: model
path: models/
evaluate:
needs: train
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/download-artifact@v4
with:
name: model
- name: Evaluate model
run: python src/evaluate.py
- name: Gate on performance
run: |
python -c "
import json
with open('metrics/test_metrics.json') as f:
metrics = json.load(f)
assert metrics['auc'] >= 0.85, f'AUC {metrics["auc"]} below threshold'
assert metrics['auc'] >= baseline_auc - 0.01, 'Regression detected'
"
deploy:
needs: evaluate
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy to staging
run: python src/deploy.py --env staging
The performance gate (assert metrics['auc'] >= 0.85) is the ML equivalent of a test suite. Don't deploy a model that regresses.
Data Validation with Great Expectations
import great_expectations as gx
context = gx.get_context()
datasource = context.sources.add_pandas("my_data")
asset = datasource.add_dataframe_asset("training_data")
# Define expectations (your "schema tests")
suite = context.add_expectation_suite("training_suite")
validator = context.get_validator(
batch_request=asset.build_batch_request(dataframe=df),
expectation_suite=suite,
)
# Column existence
validator.expect_column_to_exist("user_id")
validator.expect_column_to_exist("label")
# Value constraints
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_values_to_not_be_null("label")
validator.expect_column_values_to_be_in_set("plan_type", ["free", "pro", "enterprise"])
# Distribution checks
validator.expect_column_mean_to_be_between("income", min_value=20000, max_value=200000)
validator.expect_column_proportion_of_unique_values_to_be_between("user_id", 0.99, 1.0)
# Run validation
results = validator.validate()
assert results.success, f"Data validation failed: {results}"
Catch data quality issues before they become model quality issues.
Experiment Tracking with MLflow
import mlflow
import mlflow.sklearn
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")
with mlflow.start_run():
# Log parameters
mlflow.log_params({
"model_type": "gradient_boosting",
"learning_rate": 0.05,
"n_estimators": 200,
"max_depth": 4,
})
# Train
model = GradientBoostingClassifier(...)
model.fit(X_train, y_train)
# Log metrics
mlflow.log_metrics({
"train_auc": roc_auc_score(y_train, model.predict_proba(X_train)[:,1]),
"val_auc": roc_auc_score(y_val, model.predict_proba(X_val)[:,1]),
"test_auc": roc_auc_score(y_test, model.predict_proba(X_test)[:,1]),
})
# Log model (with preprocessing pipeline)
mlflow.sklearn.log_model(
model,
"model",
signature=mlflow.models.infer_signature(X_train, model.predict(X_train)),
)
run_id = mlflow.active_run().info.run_id
# Later: register the best run
mlflow.register_model(f"runs:/{run_id}/model", "ChurnModel")
The Model Registry Pattern
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Transition model versions through stages
# Staging → Production (after validation)
client.transition_model_version_stage(
name="ChurnModel",
version=5,
stage="Production",
archive_existing_versions=True # Previous production → Archived
)
# Load production model
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")
predictions = model.predict(X_new)
The registry enforces the promotion workflow: a model can't jump straight to production without going through staging.
Model Serving Patterns
Online Serving (REST API)
# mlflow models serve -m "models:/ChurnModel/Production" -p 5001
# Or implement manually:
from fastapi import FastAPI
import mlflow.pyfunc
app = FastAPI()
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")
@app.post("/predict")
async def predict(request: dict):
df = pd.DataFrame([request])
return {"probability": float(model.predict(df)[0])}
Batch Scoring
# For high-volume, non-latency-sensitive use cases
def batch_score(date: str):
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")
# Load all users active in last 30 days
users = spark.sql(f"SELECT * FROM users WHERE last_active >= '{date}'")
# Score in Spark (parallel across partitions)
predictions = model.predict(users.toPandas())
# Write back
users.withColumn("churn_score", predictions).write.saveAsTable("churn_scores")
Production Monitoring
from evidently import Report
from evidently.presets import DataDriftPreset, TargetDriftPreset
# Compare reference data (training) vs current production data
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=training_df, current_data=production_df)
report.save_html("drift_report.html")
# Programmatic check
drift_results = report.as_dict()
if drift_results["metrics"][0]["result"]["dataset_drift"]:
trigger_retraining_pipeline()
alert_on_call_team("Data drift detected — retraining triggered")
The Key Differences from Software DevOps
| Software CI/CD | ML CI/CD |
|---|---|
| Test suite gates deploy | Metric threshold gates deploy |
| Bugs cause failures | Data drift causes silent degradation |
| Code is the artifact | Model (code + data + params) is the artifact |
| Rollback = revert code | Rollback = revert to previous model version |
| No scheduled retraining | Periodic or triggered retraining |
| Single versioning system | Separate versioning for data, code, models |
Explore the full production ML stack in our guide to large-scale ML system design.