Yixuan Wang

2025

pdf bib abs
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo | Yixuan Wang | Qingfu Zhu | Zhiming Zhang | Xuanyu Zhang | Qing Yang | Dongliang Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid growth in the parameters of LLMs has made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires <2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a training method by 25%.

pdf bib abs
Logical forms complement probability in understanding language model (and human) performance
Yixuan Wang | Freda Shi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the increasing interest in using large language models (LLMs) for planning in natural language, understanding their behaviors becomes an important research question. This work conducts a systematic investigation of LLMs’ ability to perform logical reasoning in natural language. We introduce a controlled dataset of hypothetical and disjunctive syllogisms in propositional and modal logic and use it as the testbed for understanding LLM performance. Our results lead to novel insights in predicting LLM behaviors: in addition to the probability of input, logical forms should be considered as important factors. In addition, we show similarities and discrepancies between the logical reasoning performances of humans and LLMs by collecting and comparing behavioral data from both.

Large language models (LLMs) rely on key-value cache (KV cache) to accelerate decoding by reducing redundant computations. However, the KV cache memory usage grows substantially with longer text sequences, posing challenges for efficient deployment. Existing KV cache eviction methods prune tokens using prefilling-stage attention scores, causing inconsistency with actual inference queries, especially under tight memory budgets. In this paper, we propose Lookahead Q-Cache (LAQ), a novel eviction framework that generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries. By using these lookahead queries as the observation window for importance estimation, LAQ achieves more consistent and accurate KV cache eviction aligned with real inference scenarios. Experimental results on LongBench and Needle-in-a-Haystack benchmarks show that LAQ outperforms existing methods across various budget levels, achieving a 1 4 point improvement on LongBench under limited cache budget. Moreover, LAQ is complementary to existing approaches and can be flexibly combined to yield further improvements.

pdf bib abs
Tag-Evol: Achieving Efficient Instruction Evolving via Tag Injection
Yixuan Wang | Shiqi Zhou | Chuanzhe Guo | Qingfu Zhu
Findings of the Association for Computational Linguistics: ACL 2025

Evol-Instruct has made significant improvements as a data synthesis method in several areas. Existing methods typically rely on a fixed set of strategies to evolve, which require manual design and are monolithic in form. In addition, iterative evolution also makes the acquisition of hard samples expensive. In view of this, we propose the Tag-Evol framework, a more diverse and efficient instruction evolving method. Specifically, Tag-Evol uses diverse and specific knowledge tags as strategies to achieve controlled evolution by injecting different combinations of tags into the original instructions. Experiments with multiple backbones in mathematical and code domain benchmarks show that the proposed method generates significantly better evolved data than other methods. Furthermore, we conduct a thorough analysis of the evolved data, demonstrating that Tag-Evol is not only efficient but also generates more diverse and challenging data.

2024

Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.

Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generation model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.

pdf bib abs
LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction
Yixuan Wang | Baoxin Wang | Yijun Liu | Dayong Wu | Wanxiang Che
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Over-correction is a critical problem in Chinese grammatical error correction (CGEC) task. Recent work using model ensemble methods based on voting can effectively mitigate over-correction and improve the precision of the GEC system. However, these methods still require the output of several GEC systems and inevitably lead to reduced error recall. In this light, we propose the LM-Combiner, a rewriting model that can directly modify the over-correction of GEC system outputs without a model ensemble. Specifically, we train the model on an over-correction dataset constructed through the proposed K-fold cross inference method, which allows it to directly generate filtered sentences by combining the original and the over-corrected text. In the inference stage, we directly take the original sentences and the output results of other systems as input and then obtain the filtered sentences through LM-Combiner. Experiments on the FCGEC dataset show that our proposed method effectively alleviates the over-correction of the original system (+18.2 Precision) while ensuring the error recall remains unchanged. Besides, we find that LM-Combiner still has a good rewriting performance even with small parameters and few training data, and thus can cost-effectively mitigate the over-correction of black-box GEC systems (e.g., ChatGPT).

2023

pdf bib abs
System Report for CCL23-Eval Task 8: Chinese Grammar Error Detection and Correction Using Multi-Granularity Information
Yixuan Wang | Yijun Liu | Bo Sun | Wanxiang Che
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“This paper introduces our system at CCL-2023 Task: Chinese Essay Fluency Evaluation (CEFE).The CEFE task aims to study the identification and correction of grammatical errors in primaryand middle school students’ test compositions. The evaluation has three tracks to examine therecognition of wrong sentence types, character-level error correction, and wrong sentence rewrit-ing. According to the task characteristics and data distribution of each track, we propose a token-level discriminative model based on sequence labeling for the multi-label classification task ofwrong sentences, an auto-encoder model based on edited labels for character-level error correc-tion and a seq2seq model obtained by pre-training on pseudo data and fine-tuning on labeleddata to solve the wrong sentence rewriting task. In the final evaluation results, the method weproposed won the first place in all three tracks according to the corresponding evaluation metrics.”

pdf bib abs
Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages
Yixuan Wang | Qingyan Chen | Duygu Ataman
Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems

Language generation has been an important task in natural language processing (NLP) with increasing variety of applications especially in the recent years. The evaluation of generative language models typically rely on automatic heuristics which search for overlaps over word or phrase level patterns in generated outputs and traditionally some hand-crafted reference sentences in the given language ranging in the forms from sentences to entire documents. Language, on the other hand, is productive by nature, which means the same concept can be expressed potentially in many different lexical or phrasal forms, making the assessment of generated outputs a very difficult one. Many studies have indicated potential hazards related to the prominent choice of heuristics matching generated language to selected references and the limitations raised by this setting in developing robust generative models. This paper undertakes an in-depth analysis of evaluation metrics used for generative models, specifically investigating their responsiveness to various syntactic structures, and how these characteristics vary across languages with different morphosyntactic typologies. Preliminary findings indicate that while certain metrics exhibit robustness in particular linguistic contexts, a discernible variance emerges in their performance across distinct syntactic forms. Through this exploration, we highlight the imperative need for more nuanced and encompassing evaluation strategies in generative models, advocating for metrics that are sensitive to the multifaceted nature of languages.

2022

pdf bib abs
Adaptive Unsupervised Self-training for Disfluency Detection
Zhongyuan Wang | Yixuan Wang | Shaolei Wang | Wanxiang Che
Proceedings of the 29th International Conference on Computational Linguistics

Supervised methods have achieved remarkable results in disfluency detection. However, in real-world scenarios, human-annotated data is difficult to obtain. Recent works try to handle disfluency detection with unsupervised self-training, which can exploit existing large-scale unlabeled data efficiently. However, their self-training-based methods suffer from the problems of selection bias and error accumulation. To tackle these problems, we propose an adaptive unsupervised self-training method for disfluency detection. Specifically, we re-weight the importance of each training example according to its grammatical feature and prediction confidence. Experiments on the Switchboard dataset show that our method improves 2.3 points over the current SOTA unsupervised method. Moreover, our method is competitive with the SOTA supervised method.

Co-authors

Venues

findings3
acl2
coling2
emnlp2
ccl1
show all...

eval4nlp1

lrec1

ws1

Fix author