Table of Contents

Large language model inference is fundamentally bottlenecked by serial token generation—each token depends on all previous tokens, forcing models to run dozens or hundreds of forward passes to produce meaningful output. Two techniques have emerged as game-changers for production deployments: speculative decoding, which parallelizes token generation using draft models, and efficient memory management via PagedAttention and continuous batching, which maximizes GPU utilization by eliminating KV cache waste.

These approaches have demonstrated speedups ranging from 2x to 24x in real-world benchmarks while maintaining identical output quality, making them essential tools when every millisecond matters in production environments.

What is Speculative Decoding?

Speculative decoding is an algorithmic technique that accelerates autoregressive transformer inference by computing multiple tokens in parallel without changing the output distribution. First introduced by Google DeepMind researchers in November 2022, the method leverages two key observations about LLM inference: not all tokens are equally difficult to generate, and modern AI hardware has substantial spare computational capacity during memory-bound operations.1

How Does Speculative Decoding Work?

The core insight behind speculative decoding is that hard language modeling tasks often contain easier subtasks that can be approximated by smaller, more efficient models. The algorithm works as follows:

  1. Draft Generation: A smaller “draft” model (or efficient approximation) rapidly generates K candidate tokens speculatively
  2. Parallel Verification: The larger “target” model evaluates all K draft tokens in a single forward pass
  3. Acceptance Sampling: Using a novel modified rejection sampling scheme, the target model accepts or rejects tokens while preserving its exact output distribution
  4. Rollback: Rejected tokens are discarded, and the process continues from the last accepted token

This approach guarantees identical outputs to standard decoding while potentially generating several tokens per model call. As the original DeepMind paper demonstrated, speculative decoding achieved 2x-3x acceleration on T5-XXL without any model modifications or retraining.1

DeepMind’s subsequent work on speculative sampling extended this to stochastic settings, achieving 2-2.5x speedups on Chinchilla 70B in distributed setups.2

💡 Key Insight: The latency of parallel scoring of short continuations from a draft model is comparable to sampling a single token from the target model. This parallelism is what enables the speedup.

Why Does Speculative Decoding Matter?

Production LLM serving faces a fundamental tension: larger models produce higher quality outputs but require more computation per token. Speculative decoding breaks this tradeoff by using the large model only for verification while the small model handles the bulk of token generation.

Google Research reports remarkable speedups across products using this technique. The method has been particularly effective because modern TPUs and GPUs can perform hundreds of operations for every byte read from memory, leaving ample computational headroom for parallel draft evaluation.3

What is PagedAttention and Efficient Memory Management?

While speculative decoding addresses computational efficiency, PagedAttention tackles the memory bottleneck that limits batch sizes in LLM serving. Introduced by researchers at UC Berkeley in June 2023, PagedAttention is inspired by virtual memory and paging techniques from operating systems.4

The KV Cache Problem

During autoregressive generation, transformers cache key and value tensors (the KV cache) from previous tokens to avoid recomputing attention. This cache:

  • Consumes significant memory: up to 1.7GB for a single sequence in LLaMA-13B
  • Grows dynamically with sequence length
  • Suffers from fragmentation and over-reservation in traditional systems

Existing systems waste 60-80% of GPU memory due to these inefficiencies, severely limiting batch sizes and throughput.4

How Does PagedAttention Work?

PagedAttention partitions the KV cache into fixed-size blocks that do not need to be contiguous in memory:

Traditional Attention: PagedAttention:
[AAAAAA] [A1][A2][A3]
[BBBBBB] [B1][B2] [B3]
[CCCCCC] [C1][C2][C3]
Contiguous required Non-contiguous blocks
Fragmentation: HIGH Fragmentation: <4%

Key innovations include:

  • Block-based storage: KV tensors are stored in fixed-size blocks managed via a block table
  • On-demand allocation: Physical blocks are allocated only as new tokens are generated
  • Memory sharing: Parallel sampling and beam search can share prompt blocks via reference counting and copy-on-write mechanisms

The vLLM serving engine built on PagedAttention achieves up to 24x higher throughput than HuggingFace Transformers and 3.5x higher throughput than the previous state-of-the-art Text Generation Inference.45

Continuous Batching: Maximizing GPU Utilization

Traditional batching waits for all requests in a batch to complete before starting new ones, leaving GPUs idle when sequences finish at different times. Continuous batching (also called in-flight batching or iteration-level scheduling) addresses this by:

  • Adding new requests to the batch as soon as others complete
  • Dynamically adjusting batch composition at every iteration
  • Maximizing GPU utilization across variable-length sequences

Anyscale benchmarks demonstrated that continuous batching delivers up to 23x throughput improvement over naive batching when combined with PagedAttention memory optimizations.6

ℹ️ Production Impact: LMSYS has used vLLM with continuous batching to serve Vicuna, Koala, and LLaMA models to millions of users with university-sponsored GPUs, achieving up to 30x higher throughput than their initial HuggingFace backend.4

Comparing the Two Approaches

AspectSpeculative DecodingPagedAttention + Continuous Batching
Primary Bottleneck AddressedCompute parallelismMemory efficiency
Speedup Achieved2-3xUp to 24x throughput
Output GuaranteeIdentical distributionIdentical outputs
Hardware RequirementsRequires draft model capacityWorks with existing models
Best Use CaseLatency-sensitive applicationsHigh-throughput serving
Integration ComplexityModerate (requires draft model)Low (drop-in replacement)
Memory ReductionNone60-80% waste eliminated

These techniques are complementary rather than competing. Production systems like TensorRT-LLM and vLLM now implement both approaches, combining speculative decoding’s latency reduction with PagedAttention’s memory efficiency for compound speedups.7

Why Do These Techniques Matter for Production?

The Economics of Inference

LLM serving dominates compute costs for production AI applications. When serving at scale, small efficiency gains translate to significant cost reductions:

  • TensorRT-LLM on H100: 8x performance speedup leads to 5.3x TCO reduction and 5.6x energy savings compared to A100 baselines7
  • Memory efficiency: Reducing KV cache waste from 60-80% to under 4% directly increases batchable sequences
  • Throughput vs. latency tradeoffs: Continuous batching enables higher throughput without sacrificing per-user latency

Real-World Deployment Metrics

Production systems track four key metrics for LLM serving:8

  1. Time To First Token (TTFT): How quickly users see initial output
  2. Time Per Output Token (TPOT): Perceived “speed” of generation
  3. Overall Latency: TTFT + (TPOT × output tokens)
  4. Throughput: Total output tokens per second across all users

Speculative decoding primarily improves TPOT by generating multiple tokens per forward pass. PagedAttention and continuous batching improve throughput by increasing batch sizes while maintaining latency targets.

⚠️ Important Caveat: Output length dominates overall response latency. As Databricks notes, “you can usually just take your expected/max output token length and multiply it by an overall average time per output token for the model” to estimate latency.8

Implementation in Modern Frameworks

Both techniques have been integrated into mainstream inference engines:

FrameworkSpeculative DecodingPagedAttentionContinuous Batching
vLLM✅ Supported✅ Native✅ Native
TensorRT-LLM✅ Supported✅ Native✅ In-flight
TGI✅ Supported✅ Integrated✅ Native
PyTorch (gpt-fast)✅ Native

The PyTorch team demonstrated that native PyTorch implementations using these optimizations can achieve nearly 10x speedup over baseline implementations in under 1,000 lines of code.9

Architectural Synergies

These inference optimizations work alongside architectural innovations to compound performance gains:

  • FlashAttention: IO-aware exact attention that reduces memory reads/writes between GPU HBM and SRAM, providing 15% speedup on BERT-large and 3x on GPT-210
  • Multi-Query Attention (MQA): Shares keys and values across attention heads, reducing memory bandwidth requirements during decoding11
  • Grouped-Query Attention (GQA): Generalizes MQA with an intermediate number of key-value heads, achieving quality close to multi-head attention with speed comparable to MQA12

Frequently Asked Questions

Q: Does speculative decoding change the model’s output quality? A: No. Speculative decoding uses a modified rejection sampling algorithm that guarantees identical output distribution to standard decoding. The technique only affects speed, not quality.1

Q: What hardware is required for these optimizations? A: Both techniques work on standard NVIDIA GPUs (A100, H100) and AMD GPUs. Speculative decoding requires sufficient GPU memory to run both draft and target models simultaneously. PagedAttention and continuous batching work with existing model serving infrastructure.57

Q: How much faster can these techniques make my LLM inference? A: Results vary by workload, but published benchmarks show 2-3x speedup from speculative decoding alone, 2-24x throughput improvement from PagedAttention with continuous batching, and up to 8x from combining both approaches.146

Q: Are these techniques compatible with quantized models? A: Yes. Both speculative decoding and PagedAttention work with quantized models (INT8, INT4, FP8, GPTQ, AWQ). Quantization further improves performance by reducing memory bandwidth requirements.57

Q: Can I use these techniques with any transformer model? A: Yes. These are model-agnostic optimizations. vLLM supports most popular open-source models including LLaMA, Falcon, BLOOM, GPT-NeoX, and Mistral. TensorRT-LLM supports 20+ architectures including LLaMA 2, GPT-3, and MPT.57


The combination of speculative decoding and efficient memory management represents a paradigm shift in LLM inference optimization. As these techniques mature and proliferate through frameworks like vLLM and TensorRT-LLM, production deployments can achieve the sub-100ms response times users expect while controlling infrastructure costs—a critical capability as AI applications scale to billions of users.

Footnotes

  1. Leviathan, Y., Kalman, M., & Matias, Y. (2022). Fast Inference from Transformers via Speculative Decoding. arXiv

    .17192 [cs.LG]. ICML 2023 Oral. 2 3 4

  2. Chen, C., Borgeaud, S., Irving, G., Lespiau, J. B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv

    .01318 [cs.CL].

  3. Google Research. (2024). Looking back at speculative decoding. Google Research Blog.

  4. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. vLLM Blog. 2 3 4 5

  5. vLLM GitHub Repository. vLLM Project. 2 3 4

  6. Anyscale. (2023). Achieve 23x LLM Inference Throughput & Reduce p50 Latency. Anyscale Blog. 2

  7. NVIDIA. (2023). NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs. NVIDIA Developer Blog. 2 3 4 5

  8. Databricks/MosaicML. LLM Inference Performance Engineering: Best Practices. Databricks Blog. 2

  9. PyTorch. (2023). Accelerating Generative AI with PyTorch II: GPT, Fast. PyTorch Blog.

  10. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv

    .14135 [cs.LG].

  11. Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv

    .02150 [cs.NE].

  12. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv

    .13245 [cs.CL]. EMNLP 2023.

Enjoyed this article?

Stay updated with our latest insights on AI and technology.