MLOps for Software Engineers: CI/CD for Machine Learning

MLOps Is DevOps for Models

If you've done DevOps, MLOps will feel familiar in intent and foreign in detail. The goals are the same: automate the path from code change to production, catch problems early, enable fast iteration.

What's different: in ML, the artifact is a model, not a binary. And models degrade in production not because of code bugs, but because the world changes. This adds an entire monitoring dimension that software pipelines don't have.

The ML Continuous Delivery Lifecycle

Data Changes
     │
     ▼
[Data Validation]        — Schema checks, distribution drift
     │
     ▼
[Feature Pipeline]       — Compute and store features
     │
     ▼
[Model Training]         — Train on new data
     │
     ▼
[Model Validation]       — Performance metrics, regression tests
     │
     ▼
[Model Registry]         — Versioned model storage
     │
     ▼
[Staging Deployment]     — Shadow mode, canary
     │
     ▼
[Production]             — Serve predictions
     │
     ▼
[Monitoring]             — Drift, performance, feedback
     │
     └──── triggers retraining ───┘

GitHub Actions for ML Pipelines

A minimal CI pipeline for an ML project:

# .github/workflows/ml_pipeline.yml
name: ML Pipeline

on:
  push:
    paths:
      - 'src/**'
      - 'params.yaml'
  schedule:
    - cron: '0 2 * * 1'  # Weekly retraining

jobs:
  validate-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run data validation
        run: python src/validate_data.py

  train:
    needs: validate-data
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Train model
        run: python src/train.py
      - name: Upload model artifact
        uses: actions/upload-artifact@v4
        with:
          name: model
          path: models/

  evaluate:
    needs: train
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/download-artifact@v4
        with:
          name: model
      - name: Evaluate model
        run: python src/evaluate.py
      - name: Gate on performance
        run: |
          python -c "
          import json
          with open('metrics/test_metrics.json') as f:
              metrics = json.load(f)
          assert metrics['auc'] >= 0.85, f'AUC {metrics["auc"]} below threshold'
          assert metrics['auc'] >= baseline_auc - 0.01, 'Regression detected'
          "

  deploy:
    needs: evaluate
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to staging
        run: python src/deploy.py --env staging

The performance gate (assert metrics['auc'] >= 0.85) is the ML equivalent of a test suite. Don't deploy a model that regresses.

Data Validation with Great Expectations

import great_expectations as gx

context = gx.get_context()
datasource = context.sources.add_pandas("my_data")
asset = datasource.add_dataframe_asset("training_data")

# Define expectations (your "schema tests")
suite = context.add_expectation_suite("training_suite")

validator = context.get_validator(
    batch_request=asset.build_batch_request(dataframe=df),
    expectation_suite=suite,
)

# Column existence
validator.expect_column_to_exist("user_id")
validator.expect_column_to_exist("label")

# Value constraints
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.expect_column_values_to_not_be_null("label")
validator.expect_column_values_to_be_in_set("plan_type", ["free", "pro", "enterprise"])

# Distribution checks
validator.expect_column_mean_to_be_between("income", min_value=20000, max_value=200000)
validator.expect_column_proportion_of_unique_values_to_be_between("user_id", 0.99, 1.0)

# Run validation
results = validator.validate()
assert results.success, f"Data validation failed: {results}"

Catch data quality issues before they become model quality issues.

Experiment Tracking with MLflow

import mlflow
import mlflow.sklearn

mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("churn-prediction")

with mlflow.start_run():
    # Log parameters
    mlflow.log_params({
        "model_type": "gradient_boosting",
        "learning_rate": 0.05,
        "n_estimators": 200,
        "max_depth": 4,
    })

    # Train
    model = GradientBoostingClassifier(...)
    model.fit(X_train, y_train)

    # Log metrics
    mlflow.log_metrics({
        "train_auc": roc_auc_score(y_train, model.predict_proba(X_train)[:,1]),
        "val_auc": roc_auc_score(y_val, model.predict_proba(X_val)[:,1]),
        "test_auc": roc_auc_score(y_test, model.predict_proba(X_test)[:,1]),
    })

    # Log model (with preprocessing pipeline)
    mlflow.sklearn.log_model(
        model,
        "model",
        signature=mlflow.models.infer_signature(X_train, model.predict(X_train)),
    )

    run_id = mlflow.active_run().info.run_id

# Later: register the best run
mlflow.register_model(f"runs:/{run_id}/model", "ChurnModel")

The Model Registry Pattern

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Transition model versions through stages
# Staging → Production (after validation)
client.transition_model_version_stage(
    name="ChurnModel",
    version=5,
    stage="Production",
    archive_existing_versions=True  # Previous production → Archived
)

# Load production model
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")
predictions = model.predict(X_new)

The registry enforces the promotion workflow: a model can't jump straight to production without going through staging.

Model Serving Patterns

Online Serving (REST API)

# mlflow models serve -m "models:/ChurnModel/Production" -p 5001
# Or implement manually:

from fastapi import FastAPI
import mlflow.pyfunc

app = FastAPI()
model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")

@app.post("/predict")
async def predict(request: dict):
    df = pd.DataFrame([request])
    return {"probability": float(model.predict(df)[0])}

Batch Scoring

# For high-volume, non-latency-sensitive use cases
def batch_score(date: str):
    model = mlflow.pyfunc.load_model("models:/ChurnModel/Production")

    # Load all users active in last 30 days
    users = spark.sql(f"SELECT * FROM users WHERE last_active >= '{date}'")

    # Score in Spark (parallel across partitions)
    predictions = model.predict(users.toPandas())

    # Write back
    users.withColumn("churn_score", predictions).write.saveAsTable("churn_scores")

Production Monitoring

from evidently import Report
from evidently.presets import DataDriftPreset, TargetDriftPreset

# Compare reference data (training) vs current production data
report = Report(metrics=[DataDriftPreset(), TargetDriftPreset()])
report.run(reference_data=training_df, current_data=production_df)
report.save_html("drift_report.html")

# Programmatic check
drift_results = report.as_dict()
if drift_results["metrics"][0]["result"]["dataset_drift"]:
    trigger_retraining_pipeline()
    alert_on_call_team("Data drift detected — retraining triggered")

The Key Differences from Software DevOps

Software CI/CD	ML CI/CD
Test suite gates deploy	Metric threshold gates deploy
Bugs cause failures	Data drift causes silent degradation
Code is the artifact	Model (code + data + params) is the artifact
Rollback = revert code	Rollback = revert to previous model version
No scheduled retraining	Periodic or triggered retraining
Single versioning system	Separate versioning for data, code, models

Explore the full production ML stack in our guide to large-scale ML system design.

MLOps Is DevOps for Models

The ML Continuous Delivery Lifecycle

GitHub Actions for ML Pipelines

Data Validation with Great Expectations

Experiment Tracking with MLflow

The Model Registry Pattern

Model Serving Patterns

Online Serving (REST API)

Batch Scoring

Production Monitoring

The Key Differences from Software DevOps

Related Articles

Uber's Optimal Feature Discovery for Machine Learning

Uber's Continuous Model Deployment: ML DevOps at Scale

Towards Large-Scale Generative Ranking in Machine Learning

Want to Go Deeper?