Huanran Zheng

2025

Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree
Huanran Zheng | Xiaoling Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Speculative Decoding (SD) is a promising method for reducing the inference latency of large language models (LLMs). A well-designed draft model and an effective draft candidate tree construction method are key to enhancing the acceleration effect of SD. In this paper, we first propose the Effective Draft Decoder (EDD), which treats the LLM as a powerful encoder and generates more accurate draft tokens by leveraging the encoding results as soft prompts. Furthermore, we use KL divergence instead of the standard cross-entropy loss to better align the draft model’s output with the LLM. Next, we introduce the Pruned Candidate Tree (PCT) algorithm to construct a more efficient candidate tree. Specifically, we found that the confidence scores predicted by the draft model are well-calibrated with the acceptance probability of draft tokens. Therefore, PCT estimates the expected time gain for each node in the candidate tree based on confidence scores and retains only the nodes that contribute to acceleration, pruning away redundant nodes. We conducted extensive experiments with various LLMs across four datasets. The experimental results verify the effectiveness of our proposed method, which significantly improves the performance of SD and reduces the inference latency of LLMs.

2024

pdf bib abs

Low-rank adaptation (LoRA) and its mixture-of-experts (MOE) variants are highly effective parameter-efficient fine-tuning (PEFT) methods. However, they introduce significant latency in multi-tenant settings due to the LoRA modules and MOE routers added to multiple linear modules in the Transformer layer. To address this issue, we propose Mixture of Low-Rank Adaptation (MiLoRA), a novel and efficient LoRA variant. MiLoRA differs from previous MOE-style LoRA methods by considering each LoRA module as an expert and employing a prompt-aware routing mechanism. This mechanism calculates expert routing results once before generating the first new token and reuses these results for subsequent tokens, reducing latency. Extensive experiments and analysis on commonsense reasoning tasks, math reasoning tasks, and widely used LLM evaluation benchmarks demonstrate that MiLoRA consistently outperforms strong PEFT baselines with comparable tunable parameter budgets. Additionally, MiLoRA significantly reduces latency in multi-tenant settings compared to previous LoRA-based methods.

pdf bib abs

SCA: Selective Compression Attention for Efficiently Extending the Context Window of Large Language Models
Huanran Zheng | Wei Zhu | Xiaoling Wang
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have achieved impressive performance across various domains, but the limited context window and the expensive computational cost of processing long texts restrict their more comprehensive application. In this paper, we propose Selective Compression Attention (SCA), a general and effective method to expand the context window and reduce memory footprint by compressing the KV cache of LLMs. Specifically, through preliminary experiments, we found that the KV cache contains many similar vectors, resulting in information redundancy, which can be compressed by retaining representative vectors and discarding others. Therefore, SCA continuously selects the most distinctive vectors to keep through a greedy algorithm, reducing information loss during compression. Extensive experiments on various tasks verify the effectiveness of our method. Compared with existing methods, SCA can significantly reduce the impact on model performance under the same compression ratio. Furthermore, the context window of LLMs can be efficiently expanded using SCA without any training, which can even achieve better performance than specially fine-tuned long context models.

2022

pdf bib abs

Candidate Soups: Fusing Candidate Results Improves Translation Quality for Non-Autoregressive Translation
Huanran Zheng | Wei Zhu | Pengfei Wang | Xiaoling Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Non-autoregressive translation (NAT) model achieves a much faster inference speed than the autoregressive translation (AT) model because it can simultaneously predict all tokens during inference. However, its translation quality suffers from degradation compared to AT. And existing NAT methods only focus on improving the NAT model’s performance but do not fully utilize it. In this paper, we propose a simple but effective method called “Candidate Soups,” which can obtain high-quality translations while maintaining the inference speed of NAT models. Unlike previous approaches that pick the individual result and discard the remainders, Candidate Soups (CDS) can fully use the valuable information in the different candidate translations through model uncertainty. Extensive experiments on two benchmarks (WMT’14 EN–DE and WMT’16 EN–RO) demonstrate the effectiveness and generality of our proposed method, which can significantly improve the translation quality of various base models. More notably, our best variant outperforms the AT model on three translation tasks with 7.6× speedup.

Co-authors

Jingfan Zhang 1

Yi Zhao 1

Venues

Fix author