Wei Cheng - ACL Anthology

Wei Cheng

2026

DeepSieve: Information Sieving via LLM-as-a-Knowledge-Router
Minghao Guo | Qingcheng Zeng | Xujiang Zhao | Yanchi Liu | Wenchao Yu | Mengnan Du | Haifeng Chen | Wei Cheng
Findings of the Association for Computational Linguistics: EACL 2026

Large Language Models (LLMs) excel at many reasoning tasks but struggle with knowledge-intensive queries due to their inability to dynamically access up-to-date or domain-specific information. Retrieval-Augmented Generation (RAG) has emerged as a promising solution, enabling LLMs to ground their responses in external sources. However, existing RAG methods lack fine-grained control over both the query and source sides, often resulting in noisy retrieval and shallow reasoning. In this work, we introduce DeepSieve, an agentic RAG framework that incorporates information sieving via LLM-as-a-knowledge-router. DeepSieve decomposes complex queries into structured sub-questions and recursively routes each to the most suitable knowledge source, filtering irrelevant information through a multi-stage distillation process. Our design emphasizes modularity, transparency, and adaptability, leveraging recent advances in agentic system design. Experiments on multi-hop QA tasks across heterogeneous sources demonstrate improved reasoning depth, retrieval precision, and interpretability over conventional RAG approaches.

Multi-Agent Procedural Graph Extraction with Structural and Logical Refinement
Wangyang Ying | Yanchi Liu | Xujiang Zhao | Wei Cheng | Zhengzhang Chen | Wenchao Yu | Yanjie Fu | Haifeng Chen
Findings of the Association for Computational Linguistics: EACL 2026

Automatically extracting workflows as procedural graphs from natural language is a promising yet underexplored task that requires ensuring both structural validity and logical alignment. Recent advances in large language models (LLMs) show potential for graph extraction, but often yield ill-formed structures or misinterpret logical constructs such as gateways. We introduce , a multi-agent framework that treats procedural graph extraction as a multi-round reasoning process with structural and logical refinement agents. The framework operates in three iterative stages: (1) an LLM-based graph extraction phase, (2) a structural feedback phase where a simulation agent diagnoses and explains structural issues, and (3) a logical feedback phase where a semantic agent aligns semantics between flow logic and linguistic cues in the source text. Important feedback is prioritized and expressed in natural language, which is injected into the next-round prompt, enabling interpretable and controllable refinement. This modular design allows agents to target distinct error types without supervision or parameter updates. Experiments demonstrate that achieves substantial improvements in both structural correctness and logical consistency over strong baselines.

2025

Uncertainty Propagation on LLM Agent
Qiwei Zhao | Dong Li | Yanchi Liu | Wei Cheng | Yiyou Sun | Mika Oishi | Takao Osaki | Katsushi Matsuda | Huaxiu Yao | Chen Zhao | Haifeng Chen | Xujiang Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) integrated into multi-step agent systems enable complex decision-making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multi-step decision-making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM-based agent’s reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step’s uncertainty during the propagation. Our method, compatible with various one-step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state-of-the-art methods, achieving up to 20% improvement in AUROC.

Beyond Sequences: Two-dimensional Representation and Dependency Encoding for Code Generation
Xiangyu Zhang | Yu Zhou | Guang Yang | Wei Cheng | Taolue Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The advent of large language models has significantly advanced automatic code generation, transforming the way programmers writing code. Inspired by natural language processing, mainstream code generation approaches represent code as a linear sequence of tokens. In this paper, we propose to represent code snippets as two-dimensional entities, where both code lines and tokens within lines are explicitly modeled. This representation allows us to capture the hierarchical and spatial structure of code, especially the dependencies between code lines. Our method CoDE introduces a dependency encoding approach that leverages dictionary learning to perform semantic matching between code lines. As such, it avoids the reliance on strict position indices, leading to better generalization to code with diverse context and lengths. We thoroughly evaluate CoDE based on four categories of tasks. The experimental results showcase its generalizability, context understanding and retrieval, as well as interpretability in code generation.

Position Really Matters: Towards a Holistic Approach for Prompt Tuning
Xianjun Yang | Wei Cheng | Xujiang Zhao | Wenchao Yu | Linda Ruth Petzold | Haifeng Chen
Findings of the Association for Computational Linguistics: NAACL 2025

Prompt tuning is highly effective in efficiently extracting knowledge from foundation models, encompassing both language, vision, and vision-language models. However, the efficacy of employing fixed soft prompts with a predetermined position for concatenation with inputs for all instances, irrespective of their inherent disparities, remains uncertain. Variables such as the position, length, and representations of prompts across diverse instances and tasks can substantially influence the performance of prompt tuning. We first provide a theoretical analysis, revealing that optimizing the position of the prompt to encompass the input can capture additional semantic information that traditional prefix or postfix prompt tuning methods fail to capture. Then, we present a holistic parametric prompt tuning strategy that dynamically determines different factors of prompts based on specific tasks or instances. Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks, including NLP, vision recognition, and vision-language tasks. Furthermore, we establish the universal applicability of our approach under full-data, few-shot, and multitask settings.

MixLLM: Dynamic Routing in Mixed Large Language Models
Xinyuan Wang | Yanchi Liu | Wei Cheng | Xujiang Zhao | Zhengzhang Chen | Wenchao Yu | Yanjie Fu | Haifeng Chen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4’s quality at 24.18% of the cost under the time constraint).

2024

Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt
Peter Anderson | Mano Vikash Janardhanan | Jason He | Wei Cheng | Charlie Flanagan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Financial documents are filled with specialized terminology, arcane jargon, and curious acronyms that pose challenges for general-purpose text embeddings. Yet, few text embeddings specialized for finance have been reported in the literature, perhaps in part due to a lack of public datasets and benchmarks. We present BAM embeddings, a set of text embeddings finetuned on a carefully constructed dataset of 14.3M query-passage pairs including both public and proprietary financial documents. Demonstrating the benefits of domain-specific training, BAM embeddings achieve Recall@1 of 62.8% on a held-out test set, vs. only 39.2% for the best general-purpose text embedding from OpenAI. Further, BAM embeddings increase question answering accuracy by 8% on FinanceBench and show increased sensitivity to the finance-specific elements that are found in detailed, forward-looking and company and date-specific queries. To support further research we describe our approach in detail, quantify the importance of hard negative mining and dataset scale, and publicly release our embeddings.

A Survey on Detection of LLMs-Generated Content
Xianjun Yang | Liangming Pan | Xuandong Zhao | Haifeng Chen | Linda Ruth Petzold | William Yang Wang | Wei Cheng
Findings of the Association for Computational Linguistics: EMNLP 2024

The burgeoning capabilities of advanced large language models (LLMs) such as ChatGPT have led to an increase in synthetic content generation with implications across a variety of sectors, including media, cybersecurity, public discourse, and education. As such, the ability to detect LLMs-generated content has become of paramount importance. We aim to provide a detailed overview of existing detection strategies and benchmarks, scrutinizing their differences and identifying key challenges and prospects in the field, advocating for more adaptable and robust models to enhance detection accuracy. We also posit the necessity for a multi-faceted approach to defend against various attacks to counter the rapidly advancing capabilities of LLMs. To the best of our knowledge, this work is the first comprehensive survey on the detection in the era of LLMs. We hope it will provide a broad understanding of the current landscape of LLMs-generated content detection, and we have maintained a website to consistently update the latest research as a guiding reference for researchers and practitioners.

TrustAgent: Towards Safe and Trustworthy LLM-based Agents
Wenyue Hua | Xianjun Yang | Mingyu Jin | Zelong Li | Wei Cheng | Ruixiang Tang | Yongfeng Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

The rise of LLM-based agents shows great potential to revolutionize task planning, capturing significant attention. Given that these agents will be integrated into high-stake domains, ensuring their reliability and safety is crucial. This paper presents an Agent-Constitution-based agent framework, TrustAgent, with a particular focus on improving the LLM-based agent safety. The proposed framework ensures strict adherence to the Agent Constitution through three strategic components: pre-planning strategy which injects safety knowledge to the model before plan generation, in-planning strategy which enhances safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection. Our experimental results demonstrate that the proposed framework can effectively enhance an LLM agent’s safety across multiple domains by identifying and mitigating potential dangers during the planning. Further analysis reveals that the framework not only improves safety but also enhances the helpfulness of the agent. Additionally, we highlight the importance of the LLM reasoning ability in adhering to the Constitution. This paper sheds light on how to ensure the safe integration of LLM-based agents into human-centric environments. Data and code are available at https://anonymous.4open.science/r/TrustAgent-06DC.

Pruning as a Domain-specific LLM Extractor
Nan Zhang | Yanchi Liu | Xujiang Zhao | Wei Cheng | Runxue Bao | Rui Zhang | Prasenjit Mitra | Haifeng Chen
Findings of the Association for Computational Linguistics: NAACL 2024

Large Language Models (LLMs) have exhibited remarkable proficiency across a wide array of NLP tasks. However, the escalation in model size also engenders substantial deployment costs. While few efforts have explored model pruning techniques to reduce the size of LLMs, they mainly center on general or task-specific weights. This leads to suboptimal performance due to lacking specificity on the target domain or generality on different tasks when applied to domain-specific challenges. This work introduces an innovative unstructured dual-pruning methodology, D-Pruner, for domain-specific compression on LLM. It extracts a compressed, domain-specific, and task- agnostic LLM by identifying LLM weights that are pivotal for general capabilities, like linguistic capability and multi-task solving, and domain-specific knowledge. More specifically, we first assess general weight importance by quantifying the error incurred upon their removal with the help of an open-domain calibration dataset. Then, we utilize this general weight importance to refine the training loss, so that it preserves generality when fitting into a specific domain. Moreover, by efficiently approximating weight importance with the refined training loss on a domain-specific calibration dataset, we obtain a pruned model emphasizing generality and specificity. Our comprehensive experiments across various tasks in healthcare and legal domains show the effectiveness of D-Pruner in domain-specific compression. Our code is available at https://github.com/psunlpgroup/D-Pruner.

Large Language Models Can Be Contextual Privacy Protection Learners
Yijia Xiao | Yiqiao Jin | Yushi Bai | Yue Wu | Xianjun Yang | Xiao Luo | Wenchao Yu | Xujiang Zhao | Yanchi Liu | Quanquan Gu | Haifeng Chen | Wei Wang | Wei Cheng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create specialized language models. Nevertheless, such domain-specific fine-tuning data often contains contextually sensitive personally identifiable information (PII). Direct fine-tuning LLMs on this data without privacy protection poses a risk of data leakage of sensitive PII during inference time. To address this challenge, we introduce Contextual Privacy Protection Language Models (CPPLM), a novel paradigm for fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding inference-time data privacy. Our work offers a theoretical analysis for model design and delves into various techniques such as corpus curation, penalty-based unlikelihood in training loss, and instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches. In particular, instruction tuning with both positive and negative examples, stands out as a promising method, effectively protecting private data while enhancing the model’s knowledge. Our work underscores the potential for Large Language Models as robust contextual privacy protection learners.

InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration
Fali Wang | Runxue Bao | Suhang Wang | Wenchao Yu | Yanchi Liu | Wei Cheng | Haifeng Chen
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) have achieved exceptional capabilities in open generation across various domains, yet they encounter difficulties with tasks that require intensive knowledge. To address these challenges, methods for integrating knowledge have been developed, which augment LLMs with domain-specific knowledge graphs through external modules. These approaches, however, face data inefficiency issues as they necessitate the processing of both known and unknown knowledge for fine-tuning. Thus, our research focuses on a novel problem: efficiently integrating unknown knowledge into LLMs without unnecessary overlap of known knowledge. A risk of introducing new knowledge is the potential forgetting of existing knowledge. To mitigate this risk, we propose the innovative InfuserKI framework. This framework employs transformer internal states to determine when to enrich LLM outputs with additional information, effectively preventing knowledge forgetting. Performance evaluations using the UMLS-2.5k and MetaQA domain knowledge graphs reveal that InfuserKI not only successfully integrates new knowledge but also outperforms state-of-the-art baselines, reducing knowledge forgetting by 9% and 6%, respectively.

Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion
Wei Cheng | Yuhan Wu | Wei Hu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent years have witnessed the deployment of code language models (LMs) in various code intelligence tasks such as code completion. Yet, it is challenging for pre-trained LMs to generate correct completions in private repositories. Previous studies retrieve cross-file context based on import relations or text similarity, which is insufficiently relevant to completion targets. In this paper, we propose a dataflow-guided retrieval augmentation approach, called DraCo, for repository-level code completion. DraCo parses a private repository into code entities and establishes their relations through an extended dataflow analysis, forming a repo-specific context graph. Whenever triggering code completion, DraCo precisely retrieves relevant background knowledge from the repo-specific context graph and generates well-formed prompts to query code LMs. Furthermore, we construct a large Python dataset, ReccEval, with more diverse completion targets. Our experiments demonstrate the superior accuracy and applicable efficiency of DraCo, improving code exact match by 3.43% and identifier F1-score by 3.27% on average compared to the state-of-the-art approach.

Uncertainty Quantification for In-Context Learning of Large Language Models
Chen Ling | Xujiang Zhao | Xuchao Zhang | Wei Cheng | Yanchi Liu | Yiyou Sun | Mika Oishi | Takao Osaki | Katsushi Matsuda | Jie Ji | Guangji Bai | Liang Zhao | Haifeng Chen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) and revolutionized various fields by providing a few task-relevant demonstrations in the prompt. However, trustworthy issues with LLM’s response, such as hallucination, have also been actively discussed. Existing works have been devoted to quantifying the uncertainty in LLM’s response, but they often overlook the complex nature of LLMs and the uniqueness of in-context learning. In this work, we delve into the predictive uncertainty of LLMs associated with in-context learning, highlighting that such uncertainties may stem from both the provided demonstrations (aleatoric uncertainty) and ambiguities tied to the model’s configurations (epistemic uncertainty). We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion. Extensive experiments are conducted to demonstrate the effectiveness of the decomposition. The code and data are available at: https://github.com/lingchen0331/UQ_ICL.

2023

Open-ended Commonsense Reasoning with Unrestricted Answer Candidates
Chen Ling | Xuchao Zhang | Xujiang Zhao | Yanchi Liu | Wei Cheng | Mika Oishi | Takao Osaki | Katsushi Matsuda | Haifeng Chen | Liang Zhao
Findings of the Association for Computational Linguistics: EMNLP 2023

Open-ended Commonsense Reasoning is defined as solving a commonsense question without providing 1) a short list of answer candidates and 2) a pre-defined answer scope. Conventional ways of formulating the commonsense question into a question-answering form or utilizing external knowledge to learn retrieval-based methods are less applicable in the open-ended setting due to an inherent challenge. Without pre-defining an answer scope or a few candidates, open-ended commonsense reasoning entails predicting answers by searching over an extremely large searching space. Moreover, most questions require implicit multi-hop reasoning, which presents even more challenges to our problem. In this work, we leverage pre-trained language models to iteratively retrieve reasoning paths on the external knowledge base, which does not require task-specific supervision. The reasoning paths can help to identify the most precise answer to the commonsense question. We conduct experiments on two commonsense benchmark datasets. Compared to other approaches, our proposed method achieves better performance both quantitatively and qualitatively.

Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy for Language Models
Jianwei Li | Qi Lei | Wei Cheng | Dongkuan Xu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The pruning objective has recently extended beyond accuracy and sparsity to robustness in language models. Despite this, existing methods struggle to enhance robustness against adversarial attacks when continually increasing model sparsity and require a retraining process. As humans step into the era of large language models, these issues become increasingly prominent. This paper proposes that the robustness of language models is proportional to the extent of pre-trained knowledge they encompass. Accordingly, we introduce a post-training pruning strategy designed to faithfully replicate the embedding space and feature space of dense language models, aiming to conserve more pre-trained knowledge during the pruning process. In this setup, each layer’s reconstruction error not only originates from itself but also includes cumulative error from preceding layers, followed by an adaptive rectification. Compared to other state-of-art baselines, our approach demonstrates a superior balance between accuracy, sparsity, robustness, and pruning cost with BERT on datasets SST2, IMDB, and AGNews, marking a significant stride towards robust pruning in language models.

2021

Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction
Zeyu Li | Wei Cheng | Reema Kshetramade | John Houser | Haifeng Chen | Wei Wang
Findings of the Association for Computational Linguistics: EMNLP 2021

Compliments and concerns in reviews are valuable for understanding users’ shopping interests and their opinions with respect to specific aspects of certain items. Existing review-based recommenders favor large and complex language encoders that can only learn latent and uninterpretable text representations. They lack explicit user-attention and item-property modeling, which however could provide valuable information beyond the ability to recommend items. Therefore, we propose a tightly coupled two-stage approach, including an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE). Unsupervised ASPE mines Aspect-Sentiment pairs (AS-pairs) and APRE predicts ratings using AS-pairs as concrete aspect-level evidences. Extensive experiments on seven real-world Amazon Review Datasets demonstrate that ASPE can effectively extract AS-pairs which enable APRE to deliver superior accuracy over the leading baselines.

Unsupervised Concept Representation Learning for Length-Varying Text Similarity
Xuchao Zhang | Bo Zong | Wei Cheng | Jingchao Ni | Yanchi Liu | Haifeng Chen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Measuring document similarity plays an important role in natural language processing tasks. Most existing document similarity approaches suffer from the information gap caused by context and vocabulary mismatches when comparing varying-length texts. In this paper, we propose an unsupervised concept representation learning approach to address the above issues. Specifically, we propose a novel Concept Generation Network (CGNet) to learn concept representations from the perspective of the entire text corpus. Moreover, a concept-based document matching method is proposed to leverage advances in the recognition of local phrase features and corpus-level concept features. Extensive experiments on real-world data sets demonstrate that new method can achieve a considerable improvement in comparing length-varying texts. In particular, our model achieved 6.5% better F1 Score compared to the best of the baseline models for a concept-project benchmark dataset.

Co-authors

Katsushi Matsuda 3

Zhengzhang Chen 2

Linda Ruth Petzold 2

Liang Zhao (赵亮) 2

Peter Anderson 1

Charlie Flanagan 1

Mano Vikash Janardhanan 1

Reema Kshetramade 1

Prasenjit Mitra 1

Liangming Pan 1

Ruixiang Tang 1

William Yang Wang 1

Wangyang Ying 1

Qingcheng Zeng 1

Yongfeng Zhang 1

Xiangyu Zhang 1

Xuandong Zhao 1

Venues