Sujan Kumar Gonugondla
2024
BASS: Batched Attention-optimized Speculative Sampling
Haifeng Qian
|
Sujan Kumar Gonugondla
|
Sungsoo Ha
|
Mingyue Shang
|
Sanjay Krishna Gouda
|
Ramesh Nallapati
|
Sudipta Sengupta
|
Xiaofei Ma
|
Anoop Deoras
Findings of the Association for Computational Linguistics: ACL 2024
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15× speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what’s feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3× the highest of that of regular decoding and around 10× of single-sequence speculative decoding.
Token Alignment via Character Matching for Subword Completion
Ben Athiwaratkun
|
Shiqi Wang
|
Mingyue Shang
|
Yuchen Tian
|
Zijian Wang
|
Sujan Kumar Gonugondla
|
Sanjay Krishna Gouda
|
Robert Kwiatkowski
|
Ramesh Nallapati
|
Parminder Bhatia
|
Bing Xiang
Findings of the Association for Computational Linguistics: ACL 2024
Generative models, widely utilized in various applications, can often struggle with prompts corresponding to partial tokens. This struggle stems from tokenization, where partial tokens fall out of distribution during inference, leading to incorrect or nonsensical outputs. This paper examines a technique to alleviate the tokenization artifact on text completion in generative models, maintaining performance even in regular non-subword cases. The method, termed token alignment, involves backtracking to the last complete tokens and ensuring the model’s generation aligns with the prompt. This approach showcases marked improvement across many partial token scenarios, including nuanced cases like space-prefix and partial indentation, with only a minor time increase. The technique and analysis detailed in this paper contribute to the continuous advancement of generative models in handling partial inputs, bearing relevance for applications like code completion and text.
Search
Fix data
Co-authors
- Sanjay Krishna Gouda 2
- Ramesh Nallapati 2
- Mingyue Shang 2
- Ben Athiwaratkun 1
- Parminder Bhatia 1
- show all...