Speculative Decoding

What is Speculative Decoding?

Speculative decoding is an inference acceleration technique that uses a smaller draft model to propose multiple tokens in parallel, then verifies them against the larger target model in a single forward pass. This approach achieves 2-3x speedups without changing the output distribution of the target model.

The core insight is that small models are often correct for easy tokens (common words, syntax), and verification is cheaper than generation. By batching multiple token verifications into one forward pass of the large model, speculative decoding amortizes the cost of running the expensive model across several tokens simultaneously.

In practice, a draft model with 1-2 billion parameters proposes 4-8 candidate tokens autoregressively. The target model then evaluates all candidates in parallel, accepting tokens that match its own distribution and rejecting the rest. Generation continues from the first rejected position. The acceptance rate depends on how well the draft model approximates the target — typically 70-90% for well-matched pairs.

Why does Speculative Decoding matter?

Speculative decoding delivers latency reductions of 2-3x for large language model inference without any quality degradation. Unlike quantization or pruning, it produces mathematically identical outputs to standard autoregressive decoding, making it safe for production deployments where output quality cannot be compromised.

How is Speculative Decoding used in practice?

vLLM and TensorRT-LLM both support speculative decoding in production serving configurations. A typical setup pairs a 70B parameter target model with a 7B draft model, reducing time-to-first-token and inter-token latency for latency-sensitive applications like conversational AI and code completion.

What is Speculative Decoding?

Why does Speculative Decoding matter?

How is Speculative Decoding used in practice?

Related Terms

About the Author