Michael R. Metel
2024
Draft on the Fly: Adaptive Self-Speculative Decoding using Cosine Similarity
Michael R. Metel
|
Peng Lu
|
Boxing Chen
|
Mehdi Rezagholizadeh
|
Ivan Kobyzev
Findings of the Association for Computational Linguistics: EMNLP 2024
We present a simple on the fly method for faster inference of large language models. Unlike other (self-)speculative decoding techniques, our method does not require fine-tuning or black-box optimization to generate a fixed draft model, relying instead on simple rules to generate varying draft models adapted to the input context. We show empirically that our light-weight algorithm is competitive with the current SOTA for self-speculative decoding, while being a truly plug-and-play method.
Search