Speculative Decoding
Speculative decoding is an
LLM inference acceleration technique introduced by Google researchers in 2022 and refined by DeepMind in 2023 that uses a small fast "draft model" to propose multiple candidate tokens which are then verified in parallel by the large "target model". The draft model speculatively generates a sequence of K tokens (typically 4-8), and the target model verifies them all in a single forward pass — accepting the longest prefix that matches what the target model would have produced. Successful speculation produces multiple tokens per target-model step, accelerating inference by 2x-3x with no quality loss because the output distribution exactly matches the target model's. The technique requires a draft model from the same family as the target model — Llama 3.1 8B as draft for Llama 3.1 70B, for example.
vLLM,
TensorRT-LLM, and DeepMind's reference implementation all support speculative decoding. Variants include Medusa (multi-head speculation), EAGLE, and Lookahead Decoding. AI governance teams encounter speculative decoding in inference infrastructure configuration; it does not affect output quality since the verified tokens exactly match the target model's distribution.
Speculative-decoding endpoints with Centralpoint: Centralpoint routes to inference endpoints using speculative decoding for faster response times, while consistently metering tokens at the target-model rate. The model-agnostic platform supports any backend — vLLM, TensorRT-LLM, hosted APIs — and deploys chatbots through one line of JavaScript on any portal.
Related Keywords:
Speculative Decoding,
,