Feature Stores: The Database of Machine Learning

The Problem Feature Stores Solve

Imagine you build a churn prediction model. You engineer features: days since last login, number of support tickets, usage trend. The model performs well. You deploy it.

Three months later, a second team builds a revenue prediction model. They need the same features. They reimplement them — slightly differently. Now you have two versions of "days since last login," computed with different logic, stored in different tables, not necessarily consistent.

A year later, you want to retrain your churn model. But you can't reproduce the exact feature values you trained on — the feature computation logic has drifted, and historical data is stored differently.

This is the feature engineering mess that feature stores solve.

What a Feature Store Is

A feature store is a system for:

Defining feature computation logic centrally
Computing features consistently for training and serving
Storing features in both offline (historical) and online (low-latency) stores
Serving features at training time and inference time from the same definitions

Feature Store Architecture:

Raw Data Sources
(Databases, Streams, Files)
        │
        ▼
[Feature Definitions]       ← Centralized, versioned feature code
        │
   ┌────┴────┐
   ▼         ▼
[Offline]  [Online]
[Store]    [Store]
(S3,       (Redis,
Parquet)   DynamoDB)
   │         │
   ▼         ▼
Training   Serving
Jobs       API

The Online/Offline Split

This is the core architectural decision in any feature store:

Offline store: historical data for training. Can be slow (minutes to hours), high volume, columnar format.

Online store: current feature values for serving. Must be fast (<10ms), single-entity lookups.

# Online store lookup (happens at prediction time)
# "What are the current features for user 12345?"
features = feature_store.get_online_features(
    feature_refs=["user_features:days_since_login", "user_features:num_tickets"],
    entity_rows=[{"user_id": "12345"}]
)
# Returns in ~5ms from Redis

# Offline store lookup (happens at training time)
# "What were the features for all users on 2024-01-15?"
training_df = feature_store.get_historical_features(
    feature_refs=["user_features:days_since_login", "user_features:num_tickets"],
    entity_df=entities_with_timestamps,  # user_id + event_timestamp
)
# Returns hours of data, takes minutes

Point-in-Time Correctness

The hardest problem in feature stores: avoiding temporal leakage.

If you're training a model to predict churn at time T, your features should only contain data available before T. This sounds obvious but is easy to get wrong.

# WRONG: features computed as of today, labels from the past
# This leaks future data into features
bad_query = """
SELECT u.user_id,
       u.num_logins_last_30d,    -- computed TODAY
       l.churned                  -- label from 3 months ago
FROM users u
JOIN churn_labels l ON u.user_id = l.user_id
"""

# RIGHT: features computed as of the label date
# Feast handles this automatically with point-in-time joins
entity_df = pd.DataFrame({
    "user_id": user_ids,
    "event_timestamp": label_dates,  # Feature values as of THIS date
})

training_df = feature_store.get_historical_features(
    feature_refs=["user_features:num_logins_last_30d"],
    entity_df=entity_df,  # Feast joins features as of each row's timestamp
)

Implementing with Feast

Feast is the most widely-used open-source feature store.

Feature Definitions

# features/user_features.py
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float64, Int64

# Define the entity (primary key)
user = Entity(
    name="user_id",
    description="User ID",
)

# Data source
user_stats_source = FileSource(
    path="s3://your-bucket/user_stats.parquet",
    timestamp_field="event_timestamp",
)

# Feature view
user_stats_fv = FeatureView(
    name="user_features",
    entities=[user],
    ttl=timedelta(days=30),
    schema=[
        Field(name="days_since_login", dtype=Float64),
        Field(name="num_support_tickets", dtype=Int64),
        Field(name="usage_score", dtype=Float64),
        Field(name="subscription_months", dtype=Int64),
    ],
    online=True,
    source=user_stats_source,
)

Materializing Features (Offline → Online)

# Apply feature definitions
feast apply

# Materialize feature values to online store
feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S")

This job reads from the offline store, computes feature values, and writes current values to Redis.

Training Data Generation

from feast import FeatureStore
import pandas as pd

store = FeatureStore(repo_path=".")

# Entity dataframe with timestamps
entity_df = pd.DataFrame({
    "user_id": ["u1", "u2", "u3"],
    "event_timestamp": pd.to_datetime(["2024-01-15", "2024-02-01", "2024-03-10"]),
    "churned": [1, 0, 1],  # Your labels
})

# Get point-in-time features
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "user_features:days_since_login",
        "user_features:num_support_tickets",
        "user_features:usage_score",
    ],
).to_df()

# training_df now has features as of each user's event_timestamp
# No leakage — guaranteed

Serving Features

# In your prediction service
store = FeatureStore(repo_path=".")

def predict_churn(user_id: str) -> float:
    # Fetch current features from Redis (<10ms)
    features = store.get_online_features(
        features=[
            "user_features:days_since_login",
            "user_features:num_support_tickets",
            "user_features:usage_score",
        ],
        entity_rows=[{"user_id": user_id}],
    ).to_dict()

    feature_vector = [
        features["days_since_login"][0],
        features["num_support_tickets"][0],
        features["usage_score"][0],
    ]

    return model.predict_proba([feature_vector])[0, 1]

The same feature definitions are used for training data generation and for serving. This eliminates training-serving skew.

When You Actually Need a Feature Store

Feature stores add complexity. Don't use one until you have these symptoms:

Use a feature store when:

Multiple models share features (reduces duplication)
Training-serving skew is causing accuracy gaps in production
You need point-in-time correct training data
Features are expensive to compute and worth caching
Teams are duplicating feature engineering work

Don't use one when:

You have one model with simple features
Your features are just transformations of request-time data
You're still exploring / building a first version

For most ML projects, start with a simple preprocessing pipeline. Graduate to a feature store when reuse and consistency become genuine pain points.

Alternatives to Full Feature Stores

# Lightweight: store precomputed features in a database
# At training time:
features = db.query("SELECT user_id, days_since_login FROM user_features WHERE computed_at = '2024-01-15'")

# At serving time (same feature, same logic):
feature = redis.get(f"user:{user_id}:days_since_login")

This manual approach works for early-stage systems. The feature store becomes valuable when the number of features, models, and teams grows.

See feature stores in the context of full production ML systems in our ML system design guide.

The Problem Feature Stores Solve

What a Feature Store Is

The Online/Offline Split

Point-in-Time Correctness

Implementing with Feast

Feature Definitions

Materializing Features (Offline → Online)

Training Data Generation

Serving Features

When You Actually Need a Feature Store

Alternatives to Full Feature Stores

Related Articles

Near Real-Time Personalization at LinkedIn: The Feature Store Approach

Towards Large-Scale Generative Ranking in Machine Learning

Production ML: A Reality Check on MLOps Practices

Want to Go Deeper?