Introduction
Transformers have dominated deep learning since 2017, but they carry a fundamental cost: self-attention scales quadratically with sequence length. For a sequence of length N, attention requires O(N²) compute and memory. At 100K tokens, that's 10 billion operations — per layer.
In 2023, Mamba introduced a new architecture based on selective state space models that achieves transformer-quality results with linear scaling in sequence length and up to 5x faster inference. In 2024, hybrid architectures combining Mamba with transformers (Jamba, Zamba) began appearing in production. Understanding how Mamba works has become essential knowledge for ML engineers building systems over long sequences.
The Problem: Quadratic Attention
Self-attention computes a query-key dot product between every pair of tokens:
Attention cost: O(N²·d)
N=1K tokens: 1M operations per layer
N=10K tokens: 100M operations per layer
N=100K tokens: 10B operations per layer
KV cache memory scales linearly with N (O(N·d·layers)), so serving 128K-context models like Claude requires hundreds of GB of KV cache per request. This is why long-context inference is expensive.
State Space Models: The Mathematical Foundation
State space models (SSMs) originate from control theory. The core idea: represent a sequence as a continuous dynamical system.
Continuous form:
h'(t) = A·h(t) + B·x(t) # hidden state update
y(t) = C·h(t) # output
Discrete form (for sequences):
h_t = Ā·h_{t-1} + B̄·x_t
y_t = C·h_t
where Ā, B̄ are discretized versions of A, B using step size Δ
The hidden state h acts like compressed memory of the entire past sequence. This is the key difference from attention: instead of storing all past keys and values, SSMs compress history into a fixed-size state vector.
S4: Structured State Spaces
The original S4 model (Gu et al., 2021) made SSMs trainable at scale by parameterizing A as a diagonal-plus-low-rank matrix. This enables efficient computation via convolutions during training:
During training: use convolution formulation (parallel, fast)
y = SSM_conv(u) ← all positions computed simultaneously
During inference: use recurrence formulation (O(1) per step)
h_t = Ā·h_{t-1} + B̄·x_t
y_t = C·h_t
Training is parallel like transformers; inference is sequential like RNNs but with a fixed memory footprint regardless of context length.
Mamba: Selective State Spaces
The fundamental limitation of S4 is that A, B, C are input-independent — the same state transition applies to every token regardless of content. This makes it hard to focus on specific information (like attention does).
Mamba's key innovation: make the SSM parameters input-dependent.
# Standard SSM: fixed parameters
A, B, C = fixed_matrices
h_t = A @ h_{t-1} + B @ x_t
# Mamba: selective SSM — parameters depend on the input
def selective_ssm(x):
B = linear_B(x) # B changes per token
C = linear_C(x) # C changes per token
Δ = softplus(linear_Δ(x)) # step size changes per token
# Discretize A with token-dependent Δ
Ā = exp(Δ · A)
B̄ = (Ā - I) · A⁻¹ · B
h_t = Ā @ h_{t-1} + B̄ @ x_t
return C @ h_t
By making B, C, and Δ input-dependent, Mamba can selectively remember important tokens (set Δ large → strong state update) and forget irrelevant ones (set Δ small → state barely updates). This is analogous to attention's ability to selectively focus.
The Mamba Block Architecture
A Mamba layer replaces the self-attention + FFN block in a transformer:
Mamba Block:
x_input
↓
[Linear projection] → two branches
Branch 1: Linear → Conv1D → SiLU → Selective SSM → output
Branch 2: Linear → SiLU (gating)
↓
[Element-wise multiply] → [Linear projection] → output
The 1D convolution before the SSM provides local context (similar to a receptive field), while the SSM captures long-range dependencies.
Hardware-Aware Implementation
Naive selective SSM implementation is slow because making parameters input-dependent breaks the efficient convolution trick. Mamba introduces a hardware-aware parallel scan algorithm:
Key insight: the recurrence h_t = Ā_t · h_{t-1} + B̄_t · x_t
is a parallel prefix scan — it can be computed in O(N log N) on GPU
using the associative scan algorithm, fully parallelizing the sequence dimension.
Memory trick: don't materialize intermediate states in HBM (GPU main memory)
— compute entirely in SRAM (registers), like FlashAttention.
This makes Mamba layers roughly 3-5x faster than equivalent attention layers on long sequences, closing the hardware efficiency gap.
Mamba-2: Structured State Space Duality
Mamba-2 (2024) revealed a surprising mathematical connection: selective SSMs and attention are instances of the same framework (Structured State Space Duality, SSD).
This unification means:
- Mamba-2 can use tensor parallelism strategies developed for transformers
- The same theoretical understanding applies to both
- Hybrid models can mix Mamba-2 and attention layers coherently
Hybrid Architectures: The Current State of the Art
Pure Mamba models slightly underperform transformers on tasks requiring precise in-context recall (e.g., "what was the user's name mentioned 50K tokens ago?"). This is because the compressed state can lose information.
Hybrid models address this by mixing Mamba and attention layers:
- Jamba (AI21 Labs): 1 attention layer per 7 Mamba layers. Achieves transformer-quality results with 3x higher throughput at long context.
- Zamba2: Shared attention layers + Mamba layers, reaching top-tier benchmark performance.
- Falcon Mamba: Fully SSM-based 7B model competitive with same-size transformers.
The emerging consensus: a small number of attention layers provides the precise recall SSMs lack, while Mamba layers provide efficient long-range context at fraction of the cost.
Hybrid architecture (Jamba-style):
Layer 1: Mamba
Layer 2: Mamba
Layer 3: Mamba
Layer 4: Mamba
Layer 5: Mamba
Layer 6: Mamba
Layer 7: Self-Attention ← 1 attention per 7 Mamba
Layer 8: Mamba
...
Mamba vs. Transformers: When to Choose Each
| Factor | Mamba/SSM | Transformer |
|---|---|---|
| Sequence length | Excels (linear scaling) | Degrades (quadratic) |
| Recall of specific tokens | Weaker | Strong |
| Inference memory per token | O(1) (fixed state) | O(N) (KV cache grows) |
| Long-generation throughput | 2-5x faster | Baseline |
| Ecosystem maturity | Emerging | Mature |
| Pretraining at scale | Competitive | Best-tested |
Use Mamba when: sequences are very long (>10K tokens), memory efficiency matters, or throughput at long context is the bottleneck.
Use transformers when: precise retrieval of specific tokens from context is required, or you need the widest ecosystem and tooling support.
Engineering Implications
For teams building long-context applications today:
- Watch hybrid models: Jamba and Zamba2 are production-ready and significantly cheaper to serve at 128K+ context than pure transformer models.
- KV cache alternatives: Mamba's O(1) inference state eliminates the KV cache memory problem entirely — for sequences where this is the bottleneck, Mamba-based serving is transformative.
- Fine-tuning SSMs: PEFT methods (LoRA) work on Mamba layers. The fine-tuning ecosystem is catching up rapidly.
Conclusion
Mamba and structured SSMs represent the most credible architectural challenge to transformers since transformers replaced RNNs. The selective mechanism makes them expressive; the hardware-aware scan makes them fast. The question isn't whether SSMs will matter — they already do in production hybrid models — but how the transformer-SSM balance will evolve as we push context windows to millions of tokens.
Want to understand the inference challenges that motivate Mamba? Read our guide on KV Cache Optimization for LLM Serving.