Introduction
ML systems are notoriously hard to test. This guide provides practical strategies for building confidence in your ML systems through comprehensive testing.
The Testing Pyramid for ML
Layer 1: Unit Tests
Test individual components:
def test_feature_transformation():
raw = {"price": 100, "quantity": 5}
features = transform_features(raw)
assert features["total"] == 500
assert features["price_normalized"] == pytest.approx(0.1)
Layer 2: Integration Tests
Test component interactions:
def test_feature_pipeline():
input_data = load_test_batch()
features = feature_pipeline.transform(input_data)
assert features.shape == (100, 50)
assert not features.isnull().any().any()
Layer 3: Model Tests
Test model behavior:
def test_model_predictions():
model = load_model("production")
test_cases = [
({"feature": 0}, "class_a"),
({"feature": 1}, "class_b"),
]
for features, expected in test_cases:
pred = model.predict(features)
assert pred == expected
ML-Specific Testing Patterns
Data Validation
import great_expectations as ge
def validate_training_data(df):
expectations = [
ge.expect_column_values_to_be_between("age", 0, 120),
ge.expect_column_values_to_not_be_null("target"),
ge.expect_column_unique_value_count_to_be_between("category", 5, 20),
]
return all(exp.success for exp in expectations)
Model Invariants
Test that model respects known invariants:
def test_model_invariants():
# Monotonicity: higher credit score -> lower risk
low_score = model.predict(credit_score=500)
high_score = model.predict(credit_score=800)
assert high_score["risk"] < low_score["risk"]
Slice Testing
Test performance on important subgroups:
def test_model_fairness():
for group in ["A", "B", "C"]:
subset = test_data[test_data["group"] == group]
accuracy = model.score(subset)
assert accuracy > 0.8, f"Low accuracy for group {group}"
Continuous Testing
Pre-commit Checks
- Feature schema validation
- Code linting
- Fast unit tests
PR Checks
- Full test suite
- Model quality benchmarks
- Integration tests
Pre-deployment
- Shadow deployment comparison
- Canary metrics
- Rollback triggers
Testing Infrastructure
Test Data Management
- Maintain representative test sets
- Version control test data
- Synthetic data for edge cases
Test Environments
- Staging mirrors production
- Isolated for experimentation
- Reproducible state
Common Pitfalls
- Testing on training data: Always use held-out data
- Ignoring edge cases: Test failure modes explicitly
- Flaky tests: ML has inherent variance
- Outdated test data: Keep test sets fresh
Learn testing best practices in our production ML courses.