@inproceedings{huang-etal-2024-context,
title = "Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models",
author = "Huang, Jerry and
Parthasarathi, Prasanna and
Rezagholizadeh, Mehdi and
Chandar, Sarath",
editor = "Al-Onaizan, Yaser and
Bansal, Mohit and
Chen, Yun-Nung",
booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2024",
address = "Miami, Florida, USA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.emnlp-main.332",
pages = "5817--5830",
abstract = "Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One particular issue stems from the high latency associated with auto-regressive generation in LLMs, rendering the largest LLMs difficult to use without advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger expert model{'}s generation, has helped alleviate this concern, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain of interest relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains as long as an individual draft model is effective. We observe these results hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.",
}
<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="huang-etal-2024-context">
<titleInfo>
<title>Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models</title>
</titleInfo>
<name type="personal">
<namePart type="given">Jerry</namePart>
<namePart type="family">Huang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Prasanna</namePart>
<namePart type="family">Parthasarathi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Mehdi</namePart>
<namePart type="family">Rezagholizadeh</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Sarath</namePart>
<namePart type="family">Chandar</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2024-11</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</title>
</titleInfo>
<name type="personal">
<namePart type="given">Yaser</namePart>
<namePart type="family">Al-Onaizan</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Mohit</namePart>
<namePart type="family">Bansal</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yun-Nung</namePart>
<namePart type="family">Chen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Miami, Florida, USA</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
</relatedItem>
<abstract>Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One particular issue stems from the high latency associated with auto-regressive generation in LLMs, rendering the largest LLMs difficult to use without advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger expert model’s generation, has helped alleviate this concern, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain of interest relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains as long as an individual draft model is effective. We observe these results hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.</abstract>
<identifier type="citekey">huang-etal-2024-context</identifier>
<location>
<url>https://aclanthology.org/2024.emnlp-main.332</url>
</location>
<part>
<date>2024-11</date>
<extent unit="page">
<start>5817</start>
<end>5830</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models
%A Huang, Jerry
%A Parthasarathi, Prasanna
%A Rezagholizadeh, Mehdi
%A Chandar, Sarath
%Y Al-Onaizan, Yaser
%Y Bansal, Mohit
%Y Chen, Yun-Nung
%S Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
%D 2024
%8 November
%I Association for Computational Linguistics
%C Miami, Florida, USA
%F huang-etal-2024-context
%X Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One particular issue stems from the high latency associated with auto-regressive generation in LLMs, rendering the largest LLMs difficult to use without advanced computing infrastructure. Assisted decoding, where a smaller draft model guides a larger expert model’s generation, has helped alleviate this concern, but remains dependent on alignment between the two models. Thus if the draft model is insufficiently capable on some domain of interest relative to the target model, performance can degrade. Alternatively, one can leverage multiple draft models to better cover the expertise of the target, but when multiple black-box draft models are available, selecting an assistant without details about its construction can be difficult. To better understand this decision making problem, we observe it as a contextual bandit, where a policy must choose a draft model based on a context. We show that even without prior knowledge of the draft models, creating an offline dataset from only outputs of independent draft/target models and training a policy over the alignment of these outputs can accelerate performance on multiple domains as long as an individual draft model is effective. We observe these results hold on various settings with multiple assisted decoding candidates, highlighting its flexibility and the advantageous role that such decision making can play.
%U https://aclanthology.org/2024.emnlp-main.332
%P 5817-5830
Markdown (Informal)
[Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models](https://aclanthology.org/2024.emnlp-main.332) (Huang et al., EMNLP 2024)
ACL