Spec Dec Roofline Model
γ*=0, 1.0x speedup
γ*=16 (max), 1.6x speedup
75%
10%
1
89%
10%
16
This modeling system uses roofline analysis to estimate the speedups from speculative decoding for different draft lengths applied to different models running on different hardware. It is only a model! It tends to underestimate the benefit when overhead is a major contributor to latency, e.g. small batch sizes on small models.
The roofline model used here was inspired by the work of Fergus Finn of Doubleword. In particular, the implementation was derived using his DeepSeek-V4 Flash B200 optimal draft length estimator as a reference.
