Spec Dec Roofline Model

γ*=0, 1.0x speedup

γ*=16 (max), 1.6x speedup

Model

Hardware

Sequence length 4,096 tok/seq

5122k8k32k131k

Batch 8 seqs

18162432

Acceptance probability 75%

Relative cost per token 10%

Block size 1

Acceptance probability 89%

Relative cost per block 10%

Block size 16

This modeling system uses roofline analysis to estimate the speedups from speculative decoding for different draft lengths applied to different models running on different hardware. It is only a model! It tends to underestimate the benefit when overhead is a major contributor to latency, e.g. small batch sizes on small models.

The roofline model used here was inspired by the work of Fergus Finn of Doubleword. In particular, the implementation was derived using his DeepSeek-V4 Flash B200 optimal draft length estimator as a reference.