Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Sarath Chandar


Abstract
Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One particular issue stems from the high latency associated with auto-regressive generation in LLMs, rendering the largest LLMs difficult to use without advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger expert model’s generation, has helped alleviate this concern, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain of interest relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains as long as an individual draft model is effective. We observe these results hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.
Anthology ID:
2024.emnlp-main.332
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5817–5830
Language:
URL:
https://aclanthology.org/2024.emnlp-main.332
DOI:
Bibkey:
Cite (ACL):
Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, and Sarath Chandar. 2024. Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5817–5830, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models (Huang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.332.pdf