Chenglong Wang - ACL Anthology

Chenglong Wang

2025

Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models
Kaiyan Chang | Yonghao Shi | Chenglong Wang | Hang Zhou | Chi Hu | Xiaoqian Liu | Yingfeng Luo | Yuan Ge | Tong Xiao | JingBo Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling.In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

HEAL: A Hypothesis-Based Preference-Aware Analysis Framework
Yifu Huo | Chenglong Wang | Qiren Zhu | Shunjie Xing | Tong Xiao | Chunliang Zhang | Tongran Liu | JingBo Zhu
Findings of the Association for Computational Linguistics: EMNLP 2025

Preference optimization methods like DPO have achieved remarkable performance in LLM alignment. However, the evaluation for these methods relies on a single response and overlooks other potential outputs, which could also be generated in real-world applications within this hypothetical space. To address this issue, this paper presents a Hypothesis-based PrEference-aware AnaLysis Framework (HEAL), a novel evaluation paradigm that formulates preference alignment as a re-ranking process within hypothesis spaces. The framework incorporates two complementary metrics: ranking accuracy for evaluating ordinal consistency and preference strength correlation for assessing continuous alignment. To facilitate this framework, we develop UniHypoBench, a unified hypothesis benchmark constructed from diverse instruction-response pairs. Through extensive experiments based on HEAL, with a particular focus on the intrinsic mechanisms of preference learning, we demonstrate that current preference learning methods can effectively capture preferences provided by proxy models while simultaneously suppressing negative samples. These findings contribute to preference learning research through two significant avenues. Theoretically, we introduce hypothesis space analysis as an innovative paradigm for understanding preference alignment. Practically, HEAL offers researchers robust diagnostic tools for refining preference optimization methods, while our empirical results identify promising directions for developing more advanced alignment algorithms capable of comprehensive preference capture.

基于关联神经元识别的知识编辑方法
Yuzhang Wu | Yongyu Mu | Chenglong Wang | Qiaozhi He | Tong Xiao | Anxiang Ma | Chunliang Zhang | JingBo Zhu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"近年来,大语言模型展现出了从训练语料中存储并提取知识的优秀能力,但相应地,其可靠性也容易遭受训练语料中错误信息的破坏,进而产生信息过时、错误回复等问题。基于神经元识别的知识编辑方法通过在模型中识别并微调与目标知识相关的知识神经元,实现对模型内部知识的精确修改。然而,本文研究发现,知识的表达形式会显著影响知识神经元的识别结果,例如,现有神经元识别方法对于同一知识的不同表达形式识别得到的神经元集合平均重叠率只有21.86%。这就导致只对单一的表达形式进行知识编辑无法覆盖到与这个知识相关的所有神经元,所以现有知识编辑方法的鲁棒性往往较差。为了全面且准确地识别到与某一知识相关的所有神经元,本文设计了一种轻量级关联神经元识别器(Light weight Associated Neuron Detector,LAND),通过学习不同表达形式的知识识别出的知识神经元集合之间的差异,从而在知识神经元识别的过程中,自动补全因表达形式差异而未被检出的知识神经元。实验结果表明,LAND方法能够将不同表达形式的文本识别出的知识神经元平均重叠率提升至96%以上,在不同句式的知识编辑成功率上较基线方法多提升了至多10.83个百分点。"

Defending against Indirect Prompt Injection by Instruction Detection
Tongyu Wen | Chenglong Wang | Xiyuan Yang | Haoyu Tang | Yueqi Xie | Lingjuan Lyu | Zhicheng Dou | Fangzhao Wu
Findings of the Association for Computational Linguistics: EMNLP 2025

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that IPI attacks fundamentally rely on the presence of instructions embedded within external content, which can alter the behavioral states of LLMs. Can the effective detection of such state changes help us defend against IPI attacks? In this paper, we propose InstructDetector, a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks. Specifically, we demonstrate the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark. The code is publicly available at https://github.com/MYVAE/Instruction-detection.

2024

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms
ChuYuan Zhang | Jiangyan Yi | Jianhua Tao | Chenglong Wang | Xinrui Yan
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“Recent advancements in neural speech synthesis technologies have brought aboutwidespread applications but have also raised concerns about potential misuse and abuse.Addressing these challenges is crucial, particularly in the realms of forensics and intellec-tual property protection. While previous research on source attribution of synthesizedspeech has its limitations, our study aims to fill these gaps by investigating the identifi-cation of sources in synthesized speech. We focus on analyzing speech synthesis modelfingerprints in generated speech waveforms, emphasizing the roles of the acoustic modeland vocoder. Our research, based on the multi-speaker LibriTTS dataset, reveals twokey insights: (1) both vocoders and acoustic models leave distinct, model-specific fin-gerprints on generated waveforms, and (2) vocoder fingerprints, being more dominant,may obscure those from the acoustic model. These findings underscore the presence ofmodel-specific fingerprints in both components, suggesting their potential significance insource identification applications.”

EmoFake: An Initial Dataset for Emotion Fake Audio Detection
Yan Zhao | Jiangyan Yi | Jianhua Tao | Chenglong Wang | Yongfeng Dong
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“To enhance the effectiveness of fake audio detection techniques, researchers have developed mul-tiple datasets such as those for the ASVspoof and ADD challenges. These datasets typically focuson capturing non-emotional characteristics in speech, such as the identity of the speaker and theauthenticity of the content. However, they often overlook changes in the emotional state of theaudio, which is another crucial dimension affecting the authenticity of speech. Therefore, thisstudy reports our progress in developing such an emotion fake audio detection dataset involvingchanging emotion state of the origin audio named EmoFake. The audio samples in EmoFake aregenerated using open-source emotional voice conversion models, intended to simulate potentialemotional tampering scenarios in real-world settings. We conducted a series of benchmark ex-periments on this dataset, and the results show that even advanced fake audio detection modelstrained on the ASVspoof 2019 LA dataset and the ADD 2022 track 3.2 dataset face challengeswith EmoFake. The EmoFake is publicly available1 now.”

Hybrid Alignment Training for Large Language Models
Chenglong Wang | Hang Zhou | Kaiyan Chang | Bei Li | Yongyu Mu | Tong Xiao | Tongran Liu | JingBo Zhu
Findings of the Association for Computational Linguistics: ACL 2024

Alignment training is crucial for enabling large language models (LLMs) to cater to human intentions and preferences. It is typically performed based on two stages with different objectives: instruction-following alignment and human-preference alignment. However, aligning LLMs with these objectives in sequence suffers from an inherent problem: the objectives may conflict, and the LLMs cannot guarantee to simultaneously align with the instructions and human preferences well. To response to these, in this work, we propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods. The basic idea is to alternate between different objectives during alignment training, so that better collaboration can be achieved between the two alignment tasks. We experiment with Hbat on summarization and dialogue tasks. Experimental results show that the proposed Hbat can significantly outperform all baselines. Notably, Hbat yields consistent performance gains over the traditional two-stage alignment training when using both proximal policy optimization and direct preference optimization.

Revealing the Parallel Multilingual Learning within Large Language Models
Yongyu Mu | Peinan Feng | Zhiquan Cao | Yuzhang Wu | Bei Li | Chenglong Wang | Tong Xiao | Kai Song | Tongran Liu | Chunliang Zhang | JingBo Zhu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) can handle multilingual and cross-lingual text within a single input; however, previous works leveraging multilingualism in LLMs primarily focus on using English as the pivot language to enhance language understanding and reasoning. Given that multiple languages are a compensation for the losses caused by a single language’s limitations, it’s a natural next step to enrich the model’s learning context through the integration of the original input with its multiple translations. In this paper, we start by revealing that LLMs learn from parallel multilingual input (PMI). Our comprehensive evaluation shows that PMI enhances the model’s comprehension of the input, achieving superior performance than conventional in-context learning (ICL). Furthermore, to explore how multilingual processing affects prediction, we examine the activated neurons in LLMs. Surprisingly, involving more languages in the input activates fewer neurons, leading to more focused and effective neural activation patterns. Also, this neural reaction coincidently mirrors the neuroscience insight about synaptic pruning, highlighting a similarity between artificial and biological ‘brains’.

Prior Constraints-based Reward Model Training for Aligning Large Language Models
Hang Zhou | Chenglong Wang | Yimin Hu | Tong Xiao | Chunliang Zhang | Jingbo Zhu
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“Reinforcement learning with human feedback for aligning large language models (LLMs) trainsa reward model typically using ranking loss with comparison pairs. However, the training pro-cedure suffers from an inherent problem: the uncontrolled scaling of reward scores during rein-forcement learning due to the lack of constraints while training the reward model. This paperproposes a Prior Constraints-based Reward Model (PCRM) training method to mitigate thisproblem. PCRM incorporates prior constraints—specifically, length ratio and cosine similaritybetween outputs of each comparison pair—during reward model training to regulate optimiza-tion magnitude and control score margins. We comprehensively evaluate PCRM by examining itsrank correlation with human preferences and its effectiveness in aligning LLMs via RL. Exper-imental results demonstrate that PCRM significantly improves alignment performance by effec-tively constraining reward score scaling. As another bonus, our method is easily integrated intoarbitrary rank-based alignment methods, such as direct preference optimization, and can yieldconsistent improvement. The code is available at https://github.com/wangclnlp/DeepSpeed-Chat-Extension/tree/PCRM.”

2022

CodeExp: Explanatory Code Document Generation
Haotian Cui | Chenglong Wang | Junjie Huang | Jeevana Priya Inala | Todd Mytkowicz | Bo Wang | Jianfeng Gao | Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022

Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger-scale unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection
Chenglong Wang | Yi Lu | Yongyu Mu | Yimin Hu | Tong Xiao | Jingbo Zhu
Findings of the Association for Computational Linguistics: EMNLP 2022

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model.In this process, we typically have multiple types of knowledge extracted from the teacher model.The problem is to make full use of them to train the student model.Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps.In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation.In addition, we offer a refinement of the training algorithm to ease the computational burden.Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.

Execution-based Evaluation for Data Science Code Generation Models
Junjie Huang | Chenglong Wang | Jipeng Zhang | Cong Yan | Haotian Cui | Jeevana Priya Inala | Colin Clement | Nan Duan
Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances)

Code generation models can benefit data scientists’ productivity by automatically generating code from context and text descriptions. An important measure of the modeling progress is whether a model can generate code that can correctly execute to solve the task. However, due to the lack of an evaluation dataset that directly supports execution-based model evaluation, existing work relies on code surface form similarity metrics (e.g., BLEU, CodeBLEU) for model selection, which can be inaccurate. To remedy this, we introduce ExeDS, an evaluation dataset for execution evaluation for data science code generation tasks. ExeDS contains a set of 534 problems from Jupyter Notebooks, each consisting of code context, task description, reference program, and the desired execution output. With ExeDS, we evaluate the execution performance of five state-of-the-art code generation models that have achieved high surface-form evaluation scores. Our experiments show that models with high surface-form scores do not necessarily perform well on execution metrics, and execution-based metrics can better capture model code generation errors. All the code and data will be released upon acceptance.

2021

This paper describes NiuTrans neural machine translation systems of the WMT 2021 news translation tasks. We made submissions to 9 language directions, including English2Chinese, Japanese, Russian, Icelandic and English2Hausa tasks. Our primary systems are built on several effective variants of Transformer, e.g., Transformer-DLCL, ODE-Transformer. We also utilize back-translation, knowledge distillation, post-ensemble, and iterative fine-tuning techniques to enhance the model performance further.

The NiuTrans System for the WMT 2021 Efficiency Task
Chenglong Wang | Chi Hu | Yongyu Mu | Zhongxiang Yan | Siming Wu | Yimin Hu | Hang Cao | Bei Li | Ye Lin | Tong Xiao | Jingbo Zhu
Proceedings of the Sixth Conference on Machine Translation

This paper describes the NiuTrans system for the WMT21 translation efficiency task. Following last year’s work, we explore various techniques to improve the efficiency while maintaining translation quality. We investigate the combinations of lightweight Transformer architectures and knowledge distillation strategies. Also, we improve the translation efficiency with graph optimization, low precision, dynamic batching, and parallel pre/post-processing. Putting these together, our system can translate 247,000 words per second on an NVIDIA A100, being 3× faster than our last year’s system. Our system is the fastest and has the lowest memory consumption on the GPU-throughput track. The code, model, and pipeline will be available at NiuTrans.NMT.

RankNAS: Efficient Neural Architecture Search by Pairwise Ranking
Chi Hu | Chenglong Wang | Xiangnan Ma | Xia Meng | Yinqiao Li | Tong Xiao | Jingbo Zhu | Changliang Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper addresses the efficiency challenge of Neural Architecture Search (NAS) by formulating the task as a ranking problem. Previous methods require numerous training examples to estimate the accurate performance of architectures, although the actual goal is to find the distinction between “good” and “bad” candidates. Here we do not resort to performance predictors. Instead, we propose a performance ranking method (RankNAS) via pairwise ranking. It enables efficient architecture search using much fewer training examples. Moreover, we develop an architecture selection method to prune the search space and concentrate on more promising candidates. Extensive experiments on machine translation and language modeling tasks show that RankNAS can design high-performance architectures while being orders of magnitude faster than state-of-the-art NAS systems.

2020

The NiuTrans System for WNGT 2020 Efficiency Task
Chi Hu | Bei Li | Yinqiao Li | Ye Lin | Yanyang Li | Chenglong Wang | Tong Xiao | Jingbo Zhu
Proceedings of the Fourth Workshop on Neural Generation and Translation

This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models (Wang et al., 2019; Li et al., 2019) using NiuTensor, a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on newstest2018.

The NiuTrans System for the WMT20 Quality Estimation Shared Task
Chi Hu | Hui Liu | Kai Feng | Chen Xu | Nuo Xu | Zefan Zhou | Shiqin Yan | Yingfeng Luo | Chenglong Wang | Xia Meng | Tong Xiao | Jingbo Zhu
Proceedings of the Fifth Conference on Machine Translation

This paper describes the submissions of the NiuTrans Team to the WMT 2020 Quality Estimation Shared Task. We participated in all tasks and all language pairs. We explored the combination of transfer learning, multi-task learning and model ensemble. Results on multiple tasks show that deep transformer machine translation models and multilingual pretraining methods significantly improve translation quality estimation performance. Our system achieved remarkable results in multiple level tasks, e.g., our submissions obtained the best results on all tracks in the sentence-level Direct Assessment task.

2018

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System
Xi Victoria Lin | Chenglong Wang | Luke Zettlemoyer | Michael D. Ernst
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Natural Language to Structured Query Generation via Meta-Learning
Po-Sen Huang | Chenglong Wang | Rishabh Singh | Wen-tau Yih | Xiaodong He
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

In conventional supervised training, a model is trained to fit all the training examples. However, having a monolithic model may not always be the best strategy, as examples could vary widely. In this work, we explore a different learning protocol that treats each example as a unique pseudo-task, by reducing the original learning problem to a few-shot meta-learning scenario with the help of a domain-dependent relevance function. When evaluated on the WikiSQL dataset, our approach leads to faster convergence and achieves 1.1%–5.4% absolute accuracy gains over the non-meta-learning counterparts.

Co-authors

Chunliang Zhang 4

Jeevana Priya Inala 2

Zhongxiang Yan 2

Colin Clement 1

Yongfeng Dong 1

Zhicheng Dou (窦志成) 1

Michael D. Ernst 1

Changliang Li 1

Xi Victoria Lin 1

Todd Mytkowicz 1

Rishabh Singh 1

Luke Zettlemoyer 1

Jingnan Zhang 1

ChuYuan Zhang 1

Venues