Inference: Memory Optimizations
Memory optimization techniques for inference:
KV Cache Optimizations
- Predictive Caching
- Parallel KV cache generation
- Cross layer KV sharing
- RadixAttention (reuse KV cache)
Model Compression
- AWQ
- SqueezeLLM
- 8-bit optimizers via block-wise quantization