2024
pdf
bib
abs
Lightweight reranking for language model generations
Siddhartha Jain
|
Xiaofei Ma
|
Anoop Deoras
|
Bing Xiang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) can exhibit considerable variation in the quality of their sampled outputs. Reranking and selecting the best generation from the sampled set is a popular way of obtaining strong gains in generation quality. In this paper, we present a novel approach for reranking LLM generations. Unlike other techniques that might involve additional inferences or training a specialized reranker, our approach relies on easy to compute pairwise statistics between the generations that have minimal compute overhead. We show that our approach can be formalized as an extension of self-consistency and analyze its performance in that framework, theoretically as well as via simulations. We show strong improvements for selecting the best k generations for code generation tasks as well as robust improvements for the best generation for the tasks of autoformalization, summarization, and translation. While our approach only assumes black-box access to LLMs, we show that additional access to token probabilities can improve performance even further.
pdf
bib
abs
BASS: Batched Attention-optimized Speculative Sampling
Haifeng Qian
|
Sujan Kumar Gonugondla
|
Sungsoo Ha
|
Mingyue Shang
|
Sanjay Krishna Gouda
|
Ramesh Nallapati
|
Sudipta Sengupta
|
Xiaofei Ma
|
Anoop Deoras
Findings of the Association for Computational Linguistics: ACL 2024
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15× speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what’s feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3× the highest of that of regular decoding and around 10× of single-sequence speculative decoding.
pdf
bib
abs
CodeFort: Robust Training for Code Generation Models
Yuhao Zhang
|
Shiqi Wang
|
Haifeng Qian
|
Zijian Wang
|
Mingyue Shang
|
Linbo Liu
|
Sanjay Krishna Gouda
|
Baishakhi Ray
|
Murali Krishna Ramanathan
|
Xiaofei Ma
|
Anoop Deoras
Findings of the Association for Computational Linguistics: EMNLP 2024
Code generation models are not robust to small perturbations, which often lead to incorrect generations and significantly degrade the performance of these models. Although improving the robustness of code generation models is crucial to enhancing user experience in real-world applications, existing research efforts do not address this issue. To fill this gap, we propose CodeFort, a framework to improve the robustness of code generation models, generalizing a large variety of code perturbations to enrich the training data and enabling various robust training strategies, mixing data augmentation, batch augmentation, adversarial logits pairing, and contrastive learning, all carefully designed to support high-throughput training. Extensive evaluations show that we increase the average robust pass rates of baseline CodeGen models from 14.79 to 21.74. We notably decrease the robustness drop rate from 95.02% to 54.95% against code-syntax perturbations.
2011
pdf
bib
A Fast Re-scoring Strategy to Capture Long-Distance Dependencies
Anoop Deoras
|
Tomáš Mikolov
|
Kenneth Church
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing