Draft Model

A draft model is the small fast model used in speculative decoding to propose candidate tokens that the larger target model then verifies in parallel. The draft model must be from the same model family as the target (same tokenizer, similar architecture) so that the proposed tokens are evaluable by the target. Typical draft-target pairings include Llama 3.1 8B with Llama 3.1 70B, and Llama 3.2 1B with Llama 3.2 8B. The acceleration factor depends on how often draft proposals are accepted — well-aligned draft-target pairs achieve 60%-80% acceptance rates, producing 2x-3x speedups. Self-speculation techniques like Medusa eliminate the need for a separate draft model by adding extra prediction heads to the target itself. Some implementations support multi-token speculation depth (proposing 8+ tokens at a time) for further acceleration when acceptance rates are high. AI governance teams document the draft model used in speculative decoding setups for AI compliance traceability, though the technique produces output identical to the target model alone.

Draft-model acceleration in Centralpoint: Centralpoint routes to inference endpoints using draft-model speculation while metering tokens at the target-model rate consistently. The model-agnostic platform supports any backend — vLLM, TensorRT-LLM, hosted APIs — keeps prompts local, and deploys chatbots through one line of JavaScript on any portal.

Related Keywords:
Draft Model,,

Back