Wenpeng Lu - ACL Anthology

Wenpeng Lu

2025

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Yongheng Zhang | Xu Liu | Ruoxi Zhou | Qiguang Chen | Hao Fei | Wenpeng Lu | Libo Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.

Constructing Your Model’s Value Distinction: Towards LLM Alignment with Anchor Words Tuning
Zhen Yang | Ping Jian | Chengzhi Li | Chenxu Wang | Xinyue Zhang | Wenpeng Lu
Findings of the Association for Computational Linguistics: EMNLP 2025

With the widespread applications of large language models (LLMs), aligning LLMs with human values has emerged as a critical challenge. For alignment, we always expect LLMs to be honest, positive, harmless, etc. And LLMs appear to be capable of generating the desired outputs after the alignment tuning process, such as the preference tuning via reinforcement learning from human feedback (RLHF). However, it also raises a question about **after alignment, do LLMs genuinely obtain a value distinction between positives and negatives, beyond the generation of positive outputs?** In this work, we start by investigating this question from the token distribution perspective. Our findings reveal that compared to the unaligned versions, LLMs after alignment exhibit a larger logits gap between positive and negative tokens at each generation step, which suggests that LLMs do obtain a value distinction of positives and negatives after alignment. Meanwhile, it also motivates us to achieve alignment by directly constructing such value distinction, thus alleviating the excessive reliance on computational resources required by training-time alignment. Specifically, we propose a representation editing method that intervenes the last hidden representation by amplifying the logits difference between positive and negative tokens (defined as anchor words). Experimental results demonstrate that the proposed method not only achieves effective alignment, but also requires fewer computational resources compared to training-time alignment methods

RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models
Luyang Zhang | Shuaimin Li | Yishuo Li | Kunpeng Kang | Kaiyuan Zhang | Cong Wang | Wenpeng Lu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Accurately evaluating the word sense disambiguation (WSD) capabilities of large language models (LLMs) remains challenging, as existing studies primarily rely on single-task evaluations and classification-based metrics that overlook the fundamental differences between generative LLMs and traditional classification models. To bridge this gap, we proposeRoDEval, the first comprehensive evaluation framework specifically tailored for assessing LLM-based WSD methods. RoDEval introduces four novel metrics: Disambiguation Scope, Disambiguation Robustness, Disambiguation Reliability, and Definition Generation Quality Score, enabling a multifaceted evaluation of LLMs’ WSD capabilities. Experimental results using RoDEval across five mainstream LLMs uncover significant limitations in their WSD performance. Specifically, incorrect definition selections in multiple-choice WSD tasks stem not from simple neglect or forget of correct options, but rather from incomplete acquisition of the all senses for polysemous words. Instead, disambiguation reliability is often compromised by the models’ persistent overconfidence. In addition, inherent biases continue to affect performance, and scaling up model parameters alone fails to meaningfully enhance their ability to generate accurate sense definitions. These findings provide actionable insights for enhancing LLMs’ WSD capabilities. The source code and evaluation scripts are open-sourced at https://github.com/DayDream405/RoDEval.

MADAWSD: Multi-Agent Debate Framework for Adversarial Word Sense Disambiguation
Kaiyuan Zhang | Qian Liu | Luyang Zhang | Chaoqun Zheng | Shuaimin Li | Bing Xu | Muyun Yang | Xinxiao Qiao | Wenpeng Lu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Word sense disambiguation (WSD) is a fundamental yet challenging task in natural language processing. In recent years, the advent of large language models (LLMs) has led to significant advancements in regular WSD tasks. However, most existing LLMs face two major issues that hinder their performance in WSD. Firstly, these models are often prone to misclassifying the correct meaning of an ambiguous word when confronted with contexts containing adversarial information. Secondly, there is a lack of sufficient adversarial WSD datasets, which severely limits the development and evaluation of adversarial WSD systems. To address these gaps, we propose a novel Multi-Agent Debate framework for Adversarial Word Sense Disambiguation (MADAWSD). The MADAWSD framework simulates a real-world debate environment where multiple agent roles, namely, the Debater, Moderator, Consensus-seeker, and Judge, engage in discussions about ambiguous words in the context of adversarial information. Through a collaborative mechanism among these agents, it achieves accurate WSD. Additionally, a novel dataset for Chinese adversarial WSD has been constructed, focusing on improving and evaluating the performance of WSD models in the Chinese language. Extensive experiments on both English and Chinese adversarial WSD datasets demonstrate that MADAWSD can seamlessly integrate with existing LLMs and significantly enhance their performance, showcasing broad generality and outstanding effectiveness.

Plan Dynamically, Express Rhetorically: A Debate-Driven Rhetorical Framework for Argumentative Writing
Xueguan Zhao | Wenpeng Lu | Chaoqun Zheng | Weiyu Zhang | Jiasheng Si | Deyu Zhou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Argumentative essay generation (AEG) is a complex task that requires advanced semantic understanding, logical reasoning, and organized integration of perspectives. Despite showing a promising performance, current efforts often overlook the dynamical and hierarchical nature of structural argumentative planning, and struggle with flexible rhetorical expression, leading to limited argument divergence and rhetorical optimization. Inspired by human debate behavior and Bitzer’s rhetorical situation theory, we propose a debate-driven rhetorical framework for argumentative writing. The uniqueness lies in three aspects: (1) dynamic assesses the divergence of viewpoints and progressively reveals the hierarchical outline of arguments based on a depth-then-breadth paradigm, improving the perspective divergence within argumentation; (2) simulates human debate through iterative defender-attacker interactions, improving the logical coherence of arguments; (3) incorporates Bitzer’s rhetorical situation theory to flexibly select appropriate rhetorical techniques, enabling the rhetorical expression. Experiments on four benchmarks validate that our approach significantly improves logical depth, argumentative diversity, and rhetorical persuasiveness over existing state-of-the-art models.

A Survey on Training-free Alignment of Large Language Models
Birong Pan | Yongqi Li | Weiyu Zhang | Wenpeng Lu | Mayi Xu | Shen Zhou | Yuanyuan Zhu | Ming Zhong | Tieyun Qian
Findings of the Association for Computational Linguistics: EMNLP 2025

The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques—leveraging in-context learning, decoding-time adjustments, and post-generation corrections—offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of **pre-decoding**, **in-decoding**, and **post-decoding**. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.

CCL25-Eval任务9总结报告:中医辨证辨病及中药处方生成评测
Cong Wang | Zhizhuo Zhao | Yishuo Li | Hongjiao Guan | Yifei Wang | Zhenyu Li | Wenpeng Lu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"中医辨证辨病及中药处方生成评测任务专注于中医“辨证论治”。该任务由齐鲁工业大学(山东省科学院)与山东中医药大学附属医院联合发起,基于真实病历构建了中医“辨证论治”全流程公开数据集TCM-TBOSD,覆盖10类中医证型、4类中医疾病及381种常见中药。评测任务设立两个子任务:中医多标签辨证辨病与中药处方推荐,旨在系统评估大模型在中医诊疗全过程中的建模与推理能力。本次评测收到了学术界与产业界的广泛关注,评测共吸引123支队伍参与,35支队伍晋级复赛,最终提交了8份高质量技术报告。评测结果表明,大语言模型在中医任务中展现出良好的适应性与发展潜力,为中医智能化提供了可行路径与技术参考。详细信息可以从网址查看我们的评测任务。"

A Chain-of-Task Framework for Instruction Tuning of LLMs Based on Chinese Grammatical Error Correction
Xinpeng Liu | Bing Xu | Muyun Yang | Hailong Cao | Conghui Zhu | Tiejun Zhao | Wenpeng Lu
Proceedings of the 31st International Conference on Computational Linguistics

Over-correction is a critical issue for large language models (LLMs) to address Grammatical Error Correction (GEC) task, esp. for Chinese. This paper proposes a Chain-of-Task (CoTask) framework to reduce over-correction. The CoTask framework is applied as multi-task instruction tuning of LLMs by decomposing the process of grammatical error analysis to design auxiliary tasks and adjusting the types and combinations of training tasks. A supervised fine-tuning (SFT) strategy is also presented to enhance the performance of LLMs, together with an algorithm for automatic dataset annotation to avoid additional manual costs. Experimental results demonstrate that our method achieves new state-of-the-art results on both FCGEC (in-domain) and NaCGEC (out-of-domain) test sets.

CCL25-Eval任务8总结报告:中文电子病历ICD诊断编码评测
Zhenpeng Liang | Chuanlong Li | Ying Lian | Guoqiang Chen | Hongjiao Guan | Wenpeng Lu
Proceedings of the 24th China National Conference on Computational Linguistics (CCL 2025)

"中文电子病历国际疾病分类(ICD)诊断编码评测依托第二十四届中国计算语言学大会(CCL)举办。该评测聚焦于自然语言处理技术在智能医疗领域的应用,旨在从真实脱敏的电子病历文本中自动分析关键临床表征,实现主诊断及其他诊断ICD编码的精准预测与分配,从而辅助临床医生与专业编码员提升编码工作的准确性和效率。本次评测在阿里云天池平台进行,获得了学术界与工业界的广泛关注和积极参与。数据显示,共有445支队伍报名参赛,其中A榜和B榜分别有85支和36支队伍成功提交了有效结果。最终,8支表现优异的队伍受邀撰写并分享了其技术报告,为推动该领域的技术进步与方法创新贡献了宝贵经验。本次评测的详细信息可参见相关发布页面。"

2024

CHECKWHY: Causal Fact Verification via Argument Structure
Jiasheng Si | Yibo Zhao | Yingjie Zhu | Haiyang Zhu | Wenpeng Lu | Deyu Zhou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the growing complexity of fact verification tasks, the concern with “thoughtful” reasoning capabilities is increasing. However, recent fact verification benchmarks mainly focus on checking a narrow scope of semantic factoids within claims and lack an explicit logical reasoning process. In this paper, we introduce CHECKWHY, a challenging dataset tailored to a novel causal fact verification task: checking the truthfulness of the causal relation within claims through rigorous reasoning steps. CHECKWHY consists of over 19K “why” claim-evidence- argument structure triplets with supports, refutes, and not enough info labels. Each argument structure is composed of connected evidence, representing the reasoning process that begins with foundational evidence and progresses toward claim establishment. Through extensive experiments on state-of-the-art models, we validate the importance of incorporating the argument structure for causal fact verification. Moreover, the automated and human evaluation of argument structure generation reveals the difficulty in producing satisfying argument structure by fine-tuned models or Chain-of-Thought prompted LLMs, leaving considerable room for future improvements.

Extractive Medical Entity Disambiguation with Memory Mechanism and Memorized Entity Information
Guobiao Zhang | Xueping Peng | Tao Shen | Guodong Long | Jiasheng Si | Libo Qin | Wenpeng Lu
Findings of the Association for Computational Linguistics: EMNLP 2024

Medical entity disambiguation (MED) aims to ground medical mentions in text with ontological entities in knowledge bases (KBs). A notable challenge of MED is the long medical text usually contains multiple entities’ mentions with intricate correlations. However, limited by computation overhead, many existing methods consider only a single candidate entity mention during the disambiguation process. As such, they focus only on local MED optimal while ignoring the sole-mention disambiguation possibly boosted by richer context from other mentions’ disambiguating processes – missing global optimal on entity combination in the text. Motivated by this, we propose a new approach called Extractive Medical Entity Disambiguation with Memory Mechanism and Memorized Entity Information (M3E). Specifically, we reformulate MED as a text extraction task, which simultaneously accepts the context of medical mentions, all possible candidate entities, and entity definitions, and it is then trained to extract the text span corresponding to the correct entity. Upon our new formulation, 1) to alleviate the computation overhead from the enriched context, we devise a memory mechanism module that performs memory caching, retrieval, fusion and cross-network residual; and 2) to utilize the disambiguation clues from other mentions, we design an auxiliary disambiguation module that employs a gating mechanism to assist the disambiguation of remaining mentions. Extensive experiments on two benchmark datasets demonstrate the superiority of M3E over the state-of-the-art MED methods on all metrics.

Self-Evaluation of Large Language Model based on Glass-box Features
Hui Huang | Yingqi Qu | Jing Liu | Muyun Yang | Bing Xu | Tiejun Zhao | Wenpeng Lu
Findings of the Association for Computational Linguistics: EMNLP 2024

The proliferation of open-source Large Language Models (LLMs) underscores the pressing need for evaluation methods. Existing works primarily rely on external evaluators, focusing on training and prompting strategies. However, a crucial aspect – model-aware glass-box features – is overlooked. In this study, we explore the utility of glass-box features under the scenario of self-evaluation, namely applying an LLM to evaluate its own output. We investigate various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation. Experimental results on public benchmarks validate the feasibility of self-evaluation of LLMs using glass-box features.

Medical Entity Disambiguation with Medical Mention Relation and Fine-grained Entity Knowledge
Wenpeng Lu | Guobiao Zhang | Xueping Peng | Hongjiao Guan | Shoujin Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Medical entity disambiguation (MED) plays a crucial role in natural language processing and biomedical domains, which is the task of mapping ambiguous medical mentions to structured candidate medical entities from knowledge bases (KBs). However, existing methods for MED often fail to fully utilize the knowledge within medical KBs and overlook essential interactions between medical mentions and candidate entities, resulting in knowledge- and interaction-inefficient modeling and suboptimal disambiguation performance. To address these limitations, this paper proposes a novel approach, MED with Medical Mention Relation and Fine-grained Entity Knowledge (MMR-FEK). Specifically, MMR-FEK incorporates a mention relation fusion module and an entity knowledge fusion module, followed by an interaction module. The former employs a relation graph convolutional network to fuse mention relation information between medical mentions to enhance mention representations, while the latter leverages an attention mechanism to fuse synonym and type information of candidate entities to enhance entity representations. Afterwards, an interaction module is designed to employ a bidirectional attention mechanism to capture interactions between mentions and entities to generate the matching representation. Extensive experiments on two publicly available real-world datasets demonstrate MMR-FEK’s superiority over state-of-the-art(SOTA) MED baselines across all metrics. Our source code is publicly available.

Denoising Rationalization for Multi-hop Fact Verification via Multi-granular Explainer
Jiasheng Si | Yingjie Zhu | Wenpeng Lu | Deyu Zhou
Findings of the Association for Computational Linguistics: EMNLP 2024

The success of deep learning models on multi-hop fact verification has prompted researchers to understand the behavior behind their veracity. One feasible way is erasure search: obtaining the rationale by entirely removing a subset of input without compromising verification accuracy. Despite extensive exploration, current rationalization methods struggle to discern nuanced composition within the correlated evidence, which inevitably leads to noise rationalization in multi-hop scenarios. To address this issue, this paper explores the multi-granular rationale extraction method, aiming to realize the denoising rationalization for multi-hop fact verification. Specifically, given a pretrained veracity prediction model, two independent external explainers are introduced and trained collaboratively to enhance the discriminating ability by imposing varied constraints. Meanwhile, three key properties (Fidelity, Consistency, Salience) are introduced to regularize the denoising and faithful rationalization process. Additionally, a new Noiselessness metric is proposed to measure the purity of the rationales. Experimental results on three multi-hop fact verification datasets show that the proposed approach outperforms 12 baselines.

Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information
Yongheng Zhang | Qiguang Chen | Jingxuan Zhou | Peng Wang | Jiasheng Si | Jin Wang | Wenpeng Lu | Libo Qin
Findings of the Association for Computational Linguistics: EMNLP 2024

Chain-of-Thought (CoT) has become a vital technique for enhancing the performance of Large Language Models (LLMs), attracting increasing attention from researchers. One stream of approaches focuses on the iterative enhancement of LLMs by continuously verifying and refining their reasoning outputs for desired quality. Despite its impressive results, this paradigm faces two critical issues: (1) Simple verification methods: The current paradigm relies solely on a single verification method. (2) Wrong Information Ignorance: Traditional paradigms directly ignore wrong information during reasoning and refine the logic paths from scratch each time. To address these challenges, we propose Wrong-of-Thought (WoT), which includes two core modules: (1) Multi-Perspective Verification: A multi-perspective verification method for accurately refining the reasoning process and result, and (2) Wrong Information Utilization: Utilizing wrong information to alert LLMs and reduce the probability of LLMs making same mistakes. Experiments on 8 popular datasets and 5 LLMs demonstrate that WoT surpasses all previous baselines. In addition, WoT exhibits powerful capabilities in difficult computation tasks.

2022

Word Sense Disambiguation with Knowledge-Enhanced and Local Self-Attention-based Extractive Sense Comprehension
Guobiao Zhang | Wenpeng Lu | Xueping Peng | Shoujin Wang | Baoshuo Kan | Rui Yu
Proceedings of the 29th International Conference on Computational Linguistics

Word sense disambiguation (WSD), identifying the most suitable meaning of ambiguous words in the given contexts according to a predefined sense inventory, is one of the most classical and challenging tasks in natural language processing. Benefiting from the powerful ability of deep neural networks, WSD has achieved a great advancement in recent years. Reformulating WSD as a text span extraction task is an effective approach, which accepts a sentence context of an ambiguous word together with all definitions of its candidate senses simultaneously, and requires to extract the text span corresponding with the right sense. However, the approach merely depends on a short definition to learn sense representation, which neglects abundant semantic knowledge from related senses and leads to data-inefficient learning and suboptimal WSD performance. To address the limitations, we propose a novel WSD method with Knowledge-Enhanced and Local Self-Attention-based Extractive Sense Comprehension (KELESC). Specifically, a knowledge-enhanced method is proposed to enrich semantic representation by incorporating additional examples and definitions of the related senses in WordNet. Then, in order to avoid the huge computing complexity induced by the additional information, a local self-attention mechanism is utilized to constrain attention to be local, which allows longer input texts without large-scale computing burdens. Extensive experimental results demonstrate that KELESC achieves better performance than baseline models on public benchmark datasets.

2020

Intra-Correlation Encoding for Chinese Sentence Intention Matching
Xu Zhang | Yifeng Li | Wenpeng Lu | Ping Jian | Guoqiang Zhang
Proceedings of the 28th International Conference on Computational Linguistics

Sentence intention matching is vital for natural language understanding. Especially for Chinese sentence intention matching task, due to the ambiguity of Chinese words, semantic missing or semantic confusion are more likely to occur in the encoding process. Although the existing methods have enriched text representation through pre-trained word embedding to solve this problem, due to the particularity of Chinese text, different granularities of pre-trained word embedding will affect the semantic description of a piece of text. In this paper, we propose an effective approach that combines character-granularity and word-granularity features to perform sentence intention matching, and we utilize soft alignment attention to enhance the local information of sentences on the corresponding levels. The proposed method can capture sentence feature information from multiple perspectives and correlation information between different levels of sentences. By evaluating on BQ and LCQMC datasets, our model has achieved remarkable results, and demonstrates better or comparable performance with BERT-based models.

2017

QLUT at SemEval-2017 Task 2: Word Similarity Based on Word Embedding and Knowledge Base
Fanqing Meng | Wenpeng Lu | Yuteng Zhang | Ping Jian | Shumin Shi | Heyan Huang
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper shows the details of our system submissions in the task 2 of SemEval 2017. We take part in the subtask 1 of this task, which is an English monolingual subtask. This task is designed to evaluate the semantic word similarity of two linguistic items. The results of runs are assessed by standard Pearson and Spearman correlation, contrast with official gold standard set. The best performance of our runs is 0.781 (Final). The techniques of our runs mainly make use of the word embeddings and the knowledge-based method. The results demonstrate that the combined method is effective for the computation of word similarity, while the word embeddings and the knowledge-based technique, respectively, needs more deeply improvement in details.

QLUT at SemEval-2017 Task 1: Semantic Textual Similarity Based on Word Embeddings
Fanqing Meng | Wenpeng Lu | Yuteng Zhang | Jinyong Cheng | Yuehan Du | Shuwang Han
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper reports the details of our submissions in the task 1 of SemEval 2017. This task aims at assessing the semantic textual similarity of two sentences or texts. We submit three unsupervised systems based on word embeddings. The differences between these runs are the various preprocessing on evaluation data. The best performance of these systems on the evaluation of Pearson correlation is 0.6887. Unsurprisingly, results of our runs demonstrate that data preprocessing, such as tokenization, lemmatization, extraction of content words and removing stop words, is helpful and plays a significant role in improving the performance of models.

2016

BIT at SemEval-2016 Task 1: Sentence Similarity Based on Alignments and Vector with the Weight of Information Content
Hao Wu | Heyan Huang | Wenpeng Lu
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Co-authors

Muyun Yang (杨沐昀) 3

Guobiao Zhang 3

Qiguang Chen (陈麒光) 2

He-Yan Huang (黄河燕) 2

Yongheng Zhang 2

Kaiyuan Zhang 2

Tiejun Zhao (赵铁军) 2

Chaoqun Zheng 2

Guoqiang Chen 1

Jinyong Cheng 1

Zhenpeng Liang 1

Guoqiang Zhang 1

Jingxuan Zhou 1

Venues