design pattern 2025-04-15 13 min read

ML Model Serving Patterns: Online, Batch, Streaming, and Embedded Inference

A systems design guide to the four core ML serving patterns—online, batch, streaming, and embedded—with architecture diagrams, tradeoffs, and when to use each.

model serving inference batch prediction streaming MLOps system design

The Four Serving Patterns

ML serving is not one problem—it's four distinct problems with fundamentally different constraints. Picking the wrong pattern is one of the most common and costly mistakes in production ML.

Pattern Latency Throughput Complexity
Online < 100ms Low–Medium Medium
Batch Minutes–Hours Very High Low
Streaming Seconds High High
Embedded < 1ms Low Low

Pattern 1: Online (Synchronous) Serving

A client sends a request and waits for a prediction. The model runs on the hot path.

Architecture

Client → Load Balancer → Model Server (replicas) → Feature Store → Model
                                                           ↓
                                                    Prediction Cache

When to use

  • User-facing predictions (search ranking, recommendations, fraud detection)
  • Latency SLA < 200ms
  • Low to medium QPS (< 100K requests/second without horizontal scaling)

Implementation with FastAPI + vLLM pattern

# For traditional ML models
from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI()
model = joblib.load("model.pkl")
feature_client = FeatureStoreClient()

@app.post("/predict")
async def predict(request: PredictRequest):
    # 1. Fetch features (often the bottleneck — use async)
    features = await feature_client.get_features(
        entity_id=request.user_id,
        feature_names=FEATURE_LIST,
    )

    # 2. Build feature vector
    X = np.array([[features[f] for f in FEATURE_LIST]])

    # 3. Predict
    prediction = model.predict_proba(X)[0][1]

    return {"score": float(prediction), "model_version": MODEL_VERSION}

Latency budget breakdown (typical fraud detection)

Total budget:          100ms
  Feature fetch:        40ms  (network to feature store)
  Model inference:      15ms  (XGBoost or small NN)
  Overhead:             10ms  (serialization, routing)
  Buffer:               35ms

Scaling for traffic spikes

Horizontal pod autoscaling based on GPU utilization (for neural models) or request queue depth:

# k8s HPA for model server
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Pattern 2: Batch Serving

Run predictions over a large dataset on a schedule. No client waiting.

Architecture

Scheduler → Batch Job → Read from data warehouse
                      → Run model in parallel
                      → Write predictions back to data warehouse / feature store

When to use

  • Pre-computing scores for all users (email targeting, daily recommendations)
  • High volume where online latency SLA can't be met
  • Exploratory analysis or model evaluation

Implementation with Spark + MLflow

# batch_predict.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
import mlflow.pyfunc
import pandas as pd

spark = SparkSession.builder.appName("batch-inference").getOrCreate()

# Load model (broadcast to all executors)
model_uri = "models:/churn-model/production"
model = mlflow.pyfunc.load_model(model_uri)
broadcast_model = spark.sparkContext.broadcast(model)

@pandas_udf("double")
def predict_udf(features: pd.Series) -> pd.Series:
    m = broadcast_model.value
    X = pd.DataFrame(features.tolist())
    return pd.Series(m.predict(X))

# Read from data warehouse
df = spark.read.parquet("s3://data-warehouse/users/features/date=2025-04-15/")

# Generate predictions
predictions = df.withColumn(
    "churn_score",
    predict_udf(df["feature_vector"]),
)

# Write back
predictions.select("user_id", "churn_score").write.mode("overwrite").parquet(
    "s3://data-warehouse/predictions/churn/date=2025-04-15/"
)

Batch throughput tuning

  • Vectorization: Use NumPy/Pandas batch predict, not single-row predict in a loop
  • Parallelism: Spark partitions should equal 2–4× executor cores
  • Memory: For large models, broadcast only if model fits in executor memory; otherwise load per-partition

Pattern 3: Streaming (Near-Real-Time) Serving

Events flow through a message queue; predictions are computed within seconds of new data arriving.

Architecture

Event Source → Kafka → Stream Processor (Flink/Spark Streaming)
                              ↓
                    Feature Computation + Model Inference
                              ↓
                         Output Topic → Downstream Systems

When to use

  • Fraud detection where features depend on recent events (last 5 minutes of activity)
  • Content moderation (flag posts within seconds of publishing)
  • Real-time personalization with session-level features

The streaming feature problem

The hardest part of streaming inference isn't the model—it's feature computation. You need:

  1. Point-in-time correct features: No leakage from future events
  2. Windowed aggregations: "Number of transactions in last 10 minutes" requires stateful processing
  3. Low-latency feature store: Redis or DynamoDB, not BigQuery
# Flink Python API example
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import ProcessFunction

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(8)

class FraudScoringFunction(ProcessFunction):
    def open(self, runtime_context):
        import joblib
        self.model = joblib.load("/models/fraud_v3.pkl")
        self.feature_store = RedisFeatureStore(host="redis:6379")

    def process_element(self, transaction, ctx):
        # Fetch pre-computed user features from Redis
        user_features = self.feature_store.get(transaction["user_id"])

        # Combine with transaction features
        X = build_feature_vector(transaction, user_features)

        score = self.model.predict_proba([X])[0][1]

        if score > 0.85:
            yield {"transaction_id": transaction["id"], "action": "block", "score": score}
        else:
            yield {"transaction_id": transaction["id"], "action": "allow", "score": score}

transactions = env.from_source(kafka_source, ...)
results = transactions.process(FraudScoringFunction())
results.sink_to(kafka_sink)

env.execute("fraud-scoring")

Pattern 4: Embedded (Edge) Inference

The model runs inside the client application—no network call, no server.

When to use

  • Mobile apps where latency or offline support matters
  • Privacy requirements (data never leaves device)
  • Cost: serving infrastructure is expensive at scale

Model formats

ONNX           → Cross-platform, good for sklearn/PyTorch/TF models
TensorFlow Lite → Android/iOS, optimized for ARM
Core ML        → iOS/macOS, hardware-accelerated on Apple Silicon
GGUF           → LLMs on CPU/GPU via llama.cpp
ExecuTorch     → PyTorch-native mobile deployment (Meta)

Exporting to ONNX

import torch
import onnx

model = SentimentClassifier()
model.load_state_dict(torch.load("model.pth"))
model.eval()

dummy_input = torch.randint(0, 32000, (1, 128))  # batch=1, seq_len=128

torch.onnx.export(
    model,
    dummy_input,
    "sentiment.onnx",
    input_names=["input_ids"],
    output_names=["logits"],
    dynamic_axes={"input_ids": {0: "batch", 1: "seq_len"}},
    opset_version=17,
)

# Verify
onnx_model = onnx.load("sentiment.onnx")
onnx.checker.check_model(onnx_model)

Running with ONNX Runtime

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession(
    "sentiment.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

# ~5–20ms on CPU for a small model
output = session.run(
    ["logits"],
    {"input_ids": tokenized_input.numpy()},
)

Choosing the Right Pattern: Decision Tree

Is a human waiting for the result?
  YES → Is the latency SLA < 500ms?
          YES → Online serving
          NO  → Can you pre-compute? → YES → Batch (pre-compute before request)
  NO  → Are features time-sensitive (stale within minutes)?
          YES → Streaming
          NO  → Batch (scheduled job)

Does data leave the device create unacceptable risk/cost?
  YES → Embedded

Hybrid Patterns

Real systems often combine patterns:

  • Batch + Online: Batch computes user embeddings nightly; online ranking uses them in real-time
  • Streaming + Online: Streaming updates a real-time feature (recent activity count); online model reads it
  • Embedded + Online: Embedded model handles common cases; online model handles edge cases the small model is uncertain about (cascading)

Design the full ML system end-to-end with our ML Systems Design Patterns guide.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.