Xiang Chen - ACL Anthology

Xiang Chen

2026

Skill Discovery for Software Scripting Automation via Offline Simulations with LLMs
Paiheng Xu | Gang Wu | Xiang Chen | Tong Yu | Chang Xiao | Franck Dernoncourt | Tianyi Zhou | Wei Ai | Viswanathan Swaminathan
Findings of the Association for Computational Linguistics: EACL 2026

Scripting interfaces enable users to automate tasks and customize software workflows, but creating scripts traditionally requires programming expertise and familiarity with specific APIs, posing barriers for many users. While Large Language Models (LLMs) can generate code from natural language queries, runtime code generation is severely limited due to unverified code, security risks, longer response times, and higher computational costs. To bridge the gap, we propose an offline simulation framework to curate a software-specific skillset—a collection of verified scripts—by exploiting LLMs and publicly available scripting guides. Our framework comprises two components: (1) task creation, using top-down functionality guidance and bottom-up API synergy exploration to generate helpful tasks; and (2) skill generation with trials, refining and validating scripts based on execution feedback. To efficiently navigate the extensive API landscape, we introduce a Graph Neural Network (GNN)-based link prediction model to capture API synergy, enabling the generation of skills involving underutilized APIs and expanding the skillset’s diversity. Experiments with Adobe Illustrator demonstrate that our framework significantly improves automation success rates, reduces response time, and saves runtime token costs compared to traditional runtime code generation. This is the first attempt to use software scripting interfaces as a testbed for LLM-based systems, highlighting the advantages of leveraging execution feedback in a controlled environment and offering valuable insights into aligning AI capabilities with user needs in specialized software domains.

2025

Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources, making them ideal for various settings including on-device, mobile, edge devices, among many others. In this article, we present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques. We propose a novel taxonomy for categorizing the methods used to optimize SLMs, including model compression, pruning, and quantization techniques. We summarize the benchmark datasets that are useful for benchmarking SLMs along with the evaluation metrics commonly used. Additionally, we highlight key open challenges that remain to be addressed. Our survey aims to serve as a valuable resource for researchers and practitioners interested in developing and deploying small yet efficient language models.

A Framework for Fine-Tuning LLMs Using Heterogeneous Feedback
Ryan Aponte | Ryan A. Rossi | Shunan Guo | Franck Dernoncourt | Tong Yu | Xiang Chen | Subrata Mitra | Nedim Lipka
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Large language models (LLMs) have been applied to a wide range of tasks, including text summarization, web navigation, and chat- bots. They have benefitted from supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) following an un- supervised pretraining. These datasets can be difficult to collect, limited in scope, and vary in sample quality. Additionally, datasets can vary extensively in supervision format, from numer- ical to binary as well as multi-dimensional with many different values. We present a framework for fine-tuning LLMs using heterogeneous feed- back, which has two main components. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high- quality and diverse subset to obtain perfor- mance increases potentially exceeding the full dataset. We conduct extensive experiments to understand the effectiveness of these tech- niques for incorporating heterogeneous feed- back, and demonstrate improvements from us- ing a high-quality and diverse subset of the data. We find that our framework is able to improve models in multiple areas simultaneously, such as in instruction following and bias reduction.

Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics
Sheldon Yu | Yuxin Xiong | Junda Wu | Xintong Li | Tong Yu | Xiang Chen | Ritwik Sinha | Jingbo Shang | Julian McAuley
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in chain-of-thought (CoT) prompting have demonstrated the ability of large language models (LLMs) to perform multi-step reasoning. While prior work focuses on improving CoT generation quality or attributing token-level importance, we propose a novel framework to structurally analyze the latent dynamics of CoT trajectories for interpretability. Our method segments generated CoT into discrete reasoning steps, abstracts each step into a spectral embedding based on the eigenvalues of token-level Gram matrices, and clusters these embeddings into semantically meaningful latent states. We model the global evolution of reasoning as a first-order Markov chain over latent clusters, yielding interpretable transition structures. Through t-SNE visualizations and Monte Carlo rollouts, we uncover consistent trajectories across tasks and models, supporting the hypothesis that LLM reasoning follows globally coherent yet abstract paths.

Multi-LLM Debiasing Framework
Deonna M. Owens | Ryan Rossi | Sungchul Kim | Tong Yu | Franck Dernoncourt | Xiang Chen | Ruiyi Zhang | Jiuxiang Gu | Hanieh Deilamsalehy | Nedim Lipka
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Large Language Models (LLMs) are powerful tools with the potential to benefit society immensely, yet, they have demonstrated biases that perpetuate societal inequalities. Despite significant advancements in bias mitigation techniques using data augmentation, zero-shot prompting, and model fine-tuning, biases continuously persist, including subtle biases that may elude human detection. Recent research has shown a growing interest in multi-LLM approaches, which have been demonstrated to be effective in improving the quality of reasoning and factuality in LLMs. Building on this approach, we propose a novel multi-LLM debiasing framework aimed at reducing bias in LLMs. Our work is the first to introduce and evaluate two distinct approaches within this framework for debiasing LLMs: a centralized method, where the conversation is facilitated by a single central LLM, and a decentralized method, where all models communicate directly. Our findings reveal that our multi-LLM framework significantly reduces bias in LLMs, outperforming the baseline method across several social groups.

Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.

CoMMIT: Coordinated Multimodal Instruction Tuning
Xintong Li | Junda Wu | Tong Yu | Rui Wang | Yu Wang | Xiang Chen | Jiuxiang Gu | Lina Yao | Julian McAuley | Jingbo Shang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Instruction tuning in multimodal large language models (MLLMs) generally involves cooperative learning between a backbone LLM and a feature encoder of non-text input modalities. The major challenge is how to efficiently find the synergy between the two modules so that LLMs can adapt their reasoning abilities to downstream tasks while feature encoders can adjust to provide more task-specific information about its modality. In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives, where we find the unbalanced learning between the feature encoder and the LLM can cause problems of oscillation and biased learning that lead to sub-optimal convergence. Inspired by our findings, we propose a Multimodal Balance Coefficient that enables quantitative measurement of the balance of learning. Based on this, we further design a dynamic learning scheduler that better coordinates the learning between the LLM and feature encoder, alleviating the problems of oscillation and biased learning. In addition, we introduce an auxiliary regularization on the gradient to promote updating with larger step sizes, which potentially allows for a more accurate estimation of the proposed MultiModal Balance Coefficient and further improves the training sufficiency. Our proposed approach is agnostic to the architecture of LLM and feature encoder, so it can be generically integrated with various MLLMs. We conduct experiments on multiple downstream tasks with various MLLMs, demonstrating that the proposed method is more effective than the baselines in MLLM instruction tuning.

Agentic Knowledgeable Self-awareness
Shuofei Qiao | Zhisong Qiu | Baochang Ren | Xiaobin Wang | Xiangyuan Ru | Ningyu Zhang | Xiang Chen | Yong Jiang | Pengjun Xie | Fei Huang | Huajun Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional approaches adopt a “flood irrigation” methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of self-awareness - the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose Agentic Knowledgeable Self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent’s self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that can outperform various strong baselines on different tasks and models with minimal use of external knowledge.

Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning
Yuxia Geng | Runkai Zhu | Jiaoyan Chen | Jintai Chen | Xiang Chen | Zhuo Chen | Shuofei Qiao | Yuxiang Wang | Xiaoliang Xu | Sheng-Jun Huang
Findings of the Association for Computational Linguistics: ACL 2025

Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP’s frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies. Our code and data are available at: https://github.com/zhurunkai/DCDA.

Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs
Yu Xia | Rui Wang | Xu Liu | Mingyan Li | Tong Yu | Xiang Chen | Julian McAuley | Shuai Li
Proceedings of the 31st International Conference on Computational Linguistics

Chain-of-Thought (CoT) has been a widely adopted prompting method, eliciting impressive reasoning abilities of Large Language Models (LLMs). Inspired by the sequential thought structure of CoT, a number of Chain-of-X (CoX) methods have been developed to address challenges across diverse domains and tasks. In this paper, we provide a comprehensive survey of Chain-of-X methods for LLMs in different contexts. Specifically, we categorize them by taxonomies of nodes, i.e., the X in CoX, and application tasks. We also discuss the findings and implications of existing CoX methods, as well as potential future directions. Our survey aims to serve as a detailed and up-to-date resource for researchers seeking to apply the idea of CoT to broader scenarios.

WebQuality: A Large-scale Multi-modal Web Page Quality Assessment Dataset with Multiple Scoring Dimensions
Tao Zhang | Yige Wang | Hangyu Zhu | Xin Li | Xiang Chen | Tianhua Zhou | Jin Ma
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The assessment of web page quality plays a critical role in a range of downstream applications, yet there is a notable absence of datasets for the evaluation of web page quality. This research presents the pioneering task of web page quality assessment and introduces the first comprehensive, multi-modal Chinese dataset named WebQuality specifically designed for this task. The dataset includes over 65,000 detailed an-notations spanning four sub-dimensions and incorporates elements such as HTML+CSS, text, and visual screenshot, facilitating in-depth modeling and assessment of web page quality. We performed evaluations using a variety of baseline models to demonstrate the complexity of the task. Additionally, we propose Hydra, an integrated multi-modal analysis model, and rigorously assess its performance and limitations through extensive ablation studies. To advance the field of web quality assessment, we offer unrestricted access to our dataset and codebase for the research community, available at https://github.com/incredible-smurf/WebQuality

Doc-React: Multi-page Heterogeneous Document Question-answering
Junda Wu | Yu Xia | Tong Yu | Xiang Chen | Sai Sree Harsha | Akash V Maharaj | Ruiyi Zhang | Victor Bursztyn | Sungchul Kim | Ryan A. Rossi | Julian McAuley | Yunyao Li | Ritwik Sinha
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Answering questions over multi-page, multimodal documents, including text and figures, is a critical challenge for applications that require answers to integrate information across multiple modalities and contextual dependencies. Existing methods, such as single-turn retrieval-augmented generation (RAG), struggle to retrieve fine-grained and contextually relevant information from large, heterogeneous documents, leading to suboptimal performance. Inspired by iterative frameworks like ReAct, which refine retrieval through feedback, we propose Doc-React, an adaptive iterative framework that balances information gain and uncertainty reduction at each step. Doc-React leverages InfoNCE-guided retrieval to approximate mutual information, enabling dynamic sub-query generation and refinement. A large language model (LLM) serves as both a judge and generator, providing structured feedback to iteratively improve retrieval. By combining mutual information optimization with entropy-aware selection, Doc-React systematically captures relevant multimodal content, achieving strong performance on complex QA tasks

Disambiguation in Conversational Question Answering in the Era of LLMs and Agents: A Survey
Mehrab Tanjim | Yeonjun In | Xiang Chen | Victor Bursztyn | Ryan A. Rossi | Sungchul Kim | Guang-Jie Ren | Vaishnavi Muppala | Shun Jiang | Yongsung Kim | Chanyoung Park
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Ambiguity remains a fundamental challenge in Natural Language Processing (NLP) due to the inherent complexity and flexibility of human language. With the advent of Large Language Models (LLMs), addressing ambiguity has become even more critical due to their expanded capabilities and applications. In the context of Conversational Question Answering (CQA), this paper explores the definition, forms, and implications of ambiguity for language driven systems, particularly in the context of LLMs. We define key terms and concepts, categorize various disambiguation approaches enabled by LLMs, and provide a comparative analysis of their advantages and disadvantages. We also explore publicly available datasets for benchmarking ambiguity detection and resolution techniques and highlight their relevance for ongoing research. Finally, we identify open problems and future research directions, especially in agentic settings, proposing areas for further investigation. By offering a comprehensive review of current research on ambiguities and disambiguation with LLMs, we aim to contribute to the development of more robust and reliable LLM-based systems.

Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing
Jiho Kim | Philippe Laban | Xiang Chen | Kenneth C. Arnold
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring sessions, can help writers reflect more thoughtfully on their work compared to static feedback. Recent advancements in multi-modal large language models (LLMs) now offer new possibilities for supporting interactive and expressive voice-based reflection in writing. In particular, we propose that LLM-generated static feedback can be repurposed as conversation starters, allowing writers to seek clarification, request examples, and ask follow-up questions, thereby fostering deeper reflection on their writing. We argue that voice-based interaction can naturally facilitate this conversational exchange, encouraging writers’ engagement with higher-order concerns, facilitating iterative refinement of their reflections, and reduce cognitive load compared to text-based interactions. To investigate these effects, we propose a formative study exploring how text vs. voice input influence writers’ reflection and subsequent revisions. Findings from this study will inform the design of intelligent and interactive writing tools, offering insights into how voice-based interactions with LLM-powered conversational agents can support reflection and revision.

GraPPI: A Retrieve-Divide-Solve GraphRAG Framework for Large-scale Protein-protein Interaction Exploration
Ziwen Li | Xiang Chen | Youngseung Jeon
Findings of the Association for Computational Linguistics: NAACL 2025

Drug discovery (DD) has tremendously contributed to maintaining and improving public health. Hypothesizing that inhibiting protein misfolding can slow disease progression, researchers focus on target identification (Target ID) to find protein structures for drug binding. While Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have accelerated drug discovery, integrating models into cohesive workflows remains challenging. We conducted a user study with drug discovery researchers to identify the applicability of LLMs and RAGs in Target ID. We identified two main findings: 1) an LLM should provide multiple Protein-Protein Interactions (PPIs) based on an initial protein and protein candidates that have a therapeutic impact; 2) the model must provide the PPI and relevant explanations for better understanding. Based on these observations, we identified three limitations on previous approaches for Target ID: 1) semantic ambiguity, 2) lack of explainability, and 3) short retrieval units. To address these issues, we propose GraPPI, a large-scale knowledge graph (KG)-based retrieve-divide-solve agent pipeline RAG framework to support large-scale PPI signaling pathway exploration in understanding therapeutic impacts by decomposing the analysis of entire PPI pathways into sub-tasks focused on the analysis of PPI edges.

2024

TAME-RD: Text Assisted Replication of Image Multi-Adjustments for Reverse Designing
Pooja Guhan | Uttaran Bhattacharya | Somdeb Sarkhel | Vahid Azizi | Xiang Chen | Saayan Mitra | Aniket Bera | Dinesh Manocha
Findings of the Association for Computational Linguistics: ACL 2024

Given a source and its edited version performed based on human instructions in natural language, how do we extract the underlying edit operations, to automatically replicate similar edits on other images? This is the problem of reverse designing, and we present TAME-RD, a model to solve this problem. TAME-RD automatically learns from the complex interplay of image editing operations and the natural language instructions to learn fully specified edit operations. It predicts both the underlying image edit operations as discrete categories and their corresponding parameter values in the continuous space.We accomplish this by mapping together the contextual information from the natural language text and the structural differences between the corresponding source and edited images using the concept of pre-post effect. We demonstrate the efficiency of our network through quantitative evaluations on multiple datasets. We observe improvements of 6-10% on various accuracy metrics and 1.01X-4X on the RMSE score and the concordance correlation coefficient for the corresponding parameter values on the benchmark GIER dataset. We also introduce I-MAD, a new two-part dataset: I-MAD-Dense, a collection of approximately 100K source and edited images, together with automatically generated text instructions and annotated edit operations, and I-MAD-Pro, consisting of about 1.6K source and edited images, together with text instructions and annotated edit operations provided by professional editors. On our dataset, we observe absolute improvements of 1-10% on the accuracy metrics and 1.14X–5X on the RMSE score.

DeCoT: Debiasing Chain-of-Thought for Knowledge-Intensive Tasks in Large Language Models via Causal Intervention
Junda Wu | Tong Yu | Xiang Chen | Haoliang Wang | Ryan Rossi | Sungchul Kim | Anup Rao | Julian McAuley
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) often require task-relevant knowledge to augment their internal knowledge through prompts. However, simply injecting external knowledge into prompts does not guarantee that LLMs can identify and use relevant information in the prompts to conduct chain-of-thought reasoning, especially when the LLM’s internal knowledge is derived from biased information on the pretraining data. In this paper, we propose a novel causal view to formally explain the internal knowledge bias of LLMs via a Structural Causal Model (SCM). We review the chain-of-thought (CoT) prompting from a causal perspective and discover that the biased information from pretrained models can impair LLMs’ reasoning abilities. When the CoT reasoning paths are misled by irrelevant information from prompts and are logically incorrect, simply editing factual information is insufficient to reach the correct answer. To estimate the confounding effect on CoT reasoning in LLMs, we use external knowledge as an instrumental variable. We further introduce CoT as a mediator to conduct front-door adjustment and generate logically correct CoTs where the spurious correlation between LLMs’ pretrained knowledge and task queries is reduced. With extensive experiments, we validate that our approach enables more accurate CoT reasoning and enhances LLM generation on knowledge-intensive tasks.

Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Mengru Wang | Yunzhi Yao | Ziwen Xu | Shuofei Qiao | Shumin Deng | Peng Wang | Xiang Chen | Jia-Chen Gu | Yong Jiang | Pengjun Xie | Fei Huang | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.

Unified Hallucination Detection for Multimodal Large Language Models
Xiang Chen | Chenxi Wang | Yida Xue | Ningyu Zhang | Xiaoyan Yang | Qiang Li | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite significant strides in multimodal tasks, Multimodal Large Language Models (MLLMs) are plagued by the critical issue of hallucination. The reliable detection of such hallucinations in MLLMs has, therefore, become a vital aspect of model evaluation and the safeguarding of practical application deployment. Prior research in this domain has been constrained by a narrow focus on singular tasks, an inadequate range of hallucination categories addressed, and a lack of detailed granularity. In response to these challenges, our work expands the investigative horizons of hallucination detection. We present a novel meta-evaluation benchmark, MHaluBench, meticulously crafted to facilitate the evaluation of advancements in hallucination detection methods. Additionally, we unveil a novel unified multimodal hallucination detection framework, UNIHD, which leverages a suite of auxiliary tools to validate the occurrence of hallucinations robustly. We demonstrate the effectiveness of UNIHD through meticulous evaluation and comprehensive analysis. We also provide strategic insights on the application of specific tools for addressing various categories of hallucinations.

OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
Jintian Zhang | Cheng Peng | Mengshu Sun | Xiang Chen | Lei Liang | Zhiqiang Zhang | Jun Zhou | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Despite the recent advancements in Large Language Models (LLMs), which have significantly enhanced the generative capabilities for various NLP tasks, LLMs still face limitations in directly handling retrieval tasks. However, many practical applications demand the seamless integration of both retrieval and generation. This paper introduces a novel and efficient One-pass Generation and retrieval framework (OneGen), designed to improve LLMs’ performance on tasks that require both generation and retrieval. The proposed framework bridges the traditionally separate training approaches for generation and retrieval by incorporating retrieval tokens generated autoregressively. This enables a single LLM to handle both tasks simultaneously in a unified forward pass. We conduct experiments on two distinct types of composite tasks, RAG and Entity Linking, to validate the pluggability, effectiveness, and efficiency of OneGen in training and inference. Furthermore, our results show that integrating generation and retrieval within the same context preserves the generative capabilities of LLMs while improving retrieval performance. To the best of our knowledge, OneGen is the first to enable LLMs to conduct vector retrieval during the generation.

2023

INTELMO: Enhancing Models’ Adoption of Interactive Interfaces
Chunxu Yang | Chien-Sheng Wu | Lidiya Murakhovs’ka | Philippe Laban | Xiang Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This paper presents INTELMO, an easy-to-use library to help model developers adopt user-faced interactive interfaces and articles from real-time RSS sources for their language models. The library categorizes common NLP tasks and provides default style patterns, streamlining the process of creating interfaces with minimal code modifications while ensuring an intuitive user experience. Moreover, INTELMO employs a multi-granular hierarchical abstraction to provide developers with fine-grained and flexible control over user interfaces. INTELMO is under active development, with document available at https://intelmo.github.io.

Reasoning with Language Model Prompting: A Survey
Shuofei Qiao | Yixin Ou | Ningyu Zhang | Xiang Chen | Yunzhi Yao | Shumin Deng | Chuanqi Tan | Fei Huang | Huajun Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions. Resources are available at https://github.com/zjunlp/Prompt4ReasoningPapers (updated periodically).

Event-Centric Query Expansion in Web Search
Yanan Zhang | Weijie Cui | Yangfan Zhang | Xiaoling Bai | Zhe Zhang | Jin Ma | Xiang Chen | Tianhua Zhou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

In search engines, query expansion (QE) is a crucial technique to improve search experience. Previous studies often rely on long-term search log mining, which leads to slow updates and is sub-optimal for time-sensitive news searches. In this work, we present Event-Centric Query Expansion (EQE), the QE system used in a famous Chinese search engine. EQE utilizes a novel event retrieval framework that consists of four stages, i.e., event collection, event reformulation, semantic retrieval and online ranking, which can select the best expansion from a significant amount of potential events rapidly and accurately. Specifically, we first collect and filter news headlines from websites. Then we propose a generation model that incorporates contrastive learning and prompt-tuning techniques to reformulate these headlines to concise candidates. Additionally, we fine-tune a dual-tower semantic model to serve as an encoder for event retrieval and explore a two-stage contrastive training approach to enhance the accuracy of event retrieval. Finally, we rank the retrieved events and select the optimal one as QE, which is then used to improve the retrieval of event-related documents. Through offline analysis and online A/B testing, we observed that the EQE system has significantly improved many indicators compared to the baseline. The system has been deployed in a real production environment and serves hundreds of millions of users.

2022

LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting
Xiang Chen | Lei Li | Shumin Deng | Chuanqi Tan | Changliang Xu | Fei Huang | Luo Si | Huajun Chen | Ningyu Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings.

Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction
Xiang Chen | Ningyu Zhang | Lei Li | Yunzhi Yao | Shumin Deng | Chuanqi Tan | Fei Huang | Luo Si | Huajun Chen
Findings of the Association for Computational Linguistics: NAACL 2022

Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in information extraction. However, existing approaches for MNER and MRE usually suffer from error sensitivity when irrelevant object images incorporated in texts. To deal with these issues, we propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction, aiming to achieve more effective and robust performance. Specifically, we regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. We further propose a dynamic gated aggregation strategy to achieve hierarchical multi-scaled visual features as visual prefix for fusion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.

Title2Event: Benchmarking Open Event Extraction with a Large-scale Chinese Title Dataset
Haolin Deng | Yanan Zhang | Yangfan Zhang | Wangyang Ying | Changlong Yu | Jun Gao | Wei Wang | Xiaoling Bai | Nan Yang | Jin Ma | Xiang Chen | Tianhua Zhou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Event extraction (EE) is crucial to downstream tasks such as new aggregation and event knowledge graph construction. Most existing EE datasets manually define fixed event types and design specific schema for each of them, failing to cover diverse events emerging from the online text. Moreover, news titles, an important source of event mentions, have not gained enough attention in current EE research. In this paper, we present Title2Event, a large-scale sentence-level dataset benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages. To the best of our knowledge, it is currently the largest manually annotated Chinese dataset for open event extraction. We further conduct experiments on Title2Event with different models and show that the characteristics of titles make it challenging for event extraction, addressing the significance of advanced study on this problem. The dataset and baseline codes are available at https://open-event-hub.github.io/title2event.

Discord Questions: A Computational Approach To Diversity Analysis in News Coverage
Philippe Laban | Chien-Sheng Wu | Lidiya Murakhovs’ka | Xiang Chen | Caiming Xiong
Findings of the Association for Computational Linguistics: EMNLP 2022

There are many potential benefits to news readers accessing diverse sources. Modern news aggregators do the hard work of organizing the news, offering readers a plethora of source options, but choosing which source to read remains challenging.We propose a new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity.The framework is based on the generation of Discord Questions: questions with a diverse answer pool, explicitly illustrating source differences.To assemble a prototype of the framework, we focus on two components: (1) discord question generation, the task of generating questions answered differently by sources, for which we propose an automatic scoring method, and create a model that improves performance from current question generation (QG) methods by 5%, (2) answer consolidation, the task of grouping answers to a question that are semantically similar, for which we collect data and repurpose a method that achieves 81% balanced accuracy on our realistic test set.We illustrate the framework’s feasibility through a prototype interface. Even though model performance at discord QG still lags human performance by more than 15%, generated questions are judged to be more interesting than factoid questions and can reveal differences in the level of detail, sentiment, and reasoning of sources in news coverage. Code is available at https://github.com/Salesforce/discord_questions.

DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population
Ningyu Zhang | Xin Xu | Liankuan Tao | Haiyang Yu | Hongbin Ye | Shuofei Qiao | Xin Xie | Xiang Chen | Zhoubo Li | Lei Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present an open-source and extensible knowledge extraction toolkit DeepKE, supporting complicated low-resource, document-level and multimodal scenarios in the knowledge base population. DeepKE implements various information extraction tasks, including named entity recognition, relation extraction and attribute extraction. With a unified framework, DeepKE allows developers and researchers to customize datasets and models to extract information from unstructured data according to their requirements. Specifically, DeepKE not only provides various functional modules and model implementation for different tasks and scenarios but also organizes all components by consistent frameworks to maintain sufficient modularity and extensibility. We release the source code at GitHub in https://github.com/zjunlp/DeepKE with Google Colab tutorials and comprehensive documents for beginners. Besides, we present an online system in http://deepke.openkg.cn/EN/re_doc_show.html for real-time extraction of various tasks, and a demo video.

ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select
Yuchen Zhuang | Yinghao Li | Junyang Zhang | Yue Yu | Yingjun Mou | Xiang Chen | Le Song | Chao Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We study the problem of extracting N-ary relation tuples from scientific articles. This task is challenging because the target knowledge tuples can reside in multiple parts and modalities of the document. Our proposed method ReSel decomposes this task into a two-stage procedure that first retrieves the most relevant paragraph/table and then selects the target entity from the retrieved component. For the high-level retrieval stage, ReSel designs a simple and effective feature set, which captures multi-level lexical and semantic similarities between the query and components. For the low-level selection stage, ReSel designs a cross-modal entity correlation graph along with a multi-view architecture, which models both semantic and document-structural relations between entities. Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.

Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
Xin Xu | Xiang Chen | Ningyu Zhang | Xin Xie | Xi Chen | Huajun Chen
Findings of the Association for Computational Linguistics: EMNLP 2022

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent pre-trained language models, we comprehensively investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; (iii) data augmentation technologies and self-training to generate more labeled in-domain data. We create a benchmark with 8 relation extraction (RE) datasets covering different languages, domains and contexts and perform extensive comparisons over the proposed schemes with combinations. Our experiments illustrate: (i) Though prompt-based tuning is beneficial in low-resource RE, there is still much potential for improvement, especially in extracting relations from cross-sentence contexts with multiple relational triples; (ii) Balancing methods are not always helpful for RE with long-tailed distribution; (iii) Data augmentation complements existing baselines and can bring much performance gain, while self-training may not consistently achieve advancement to low-resource RE. Code and datasets are in https://github.com/zjunlp/LREBench.

2021

WIND: Weighting Instances Differentially for Model-Agnostic Domain Adaptation
Xiang Chen | Yue Cao | Xiaojun Wan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning
Jianguo Zhang | Trung Bui | Seunghyun Yoon | Xiang Chen | Zhiwei Liu | Congying Xia | Quan Hung Tran | Walter Chang | Philip Yu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In this work, we focus on a more challenging few-shot intent detection scenario where many intents are fine-grained and semantically similar. We present a simple yet effective few-shot intent detection schema via contrastive pre-training and fine-tuning. Specifically, we first conduct self-supervised contrastive pre-training on collected intent datasets, which implicitly learns to discriminate semantically similar utterances without using any labels. We then perform few-shot intent detection together with supervised contrastive learning, which explicitly pulls utterances from the same intent closer and pushes utterances across different intents farther. Experimental results show that our proposed method achieves state-of-the-art performance on three challenging intent detection datasets under 5-shot and 10-shot settings.

ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning
Xin Xie | Xiangnan Chen | Xiang Chen | Yong Wang | Ningyu Zhang | Shumin Deng | Huajun Chen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches. The code and dataset used in our paper can be found at https://github.com/CheaSim/SemEval2021. The leaderboard can be found at https://competitions.codalab.org/competitions/26153.

Co-authors

Franck Dernoncourt 5

Ryan A. Rossi 5

Hanieh Deilamsalehy 3

Philippe Laban 3

Nesreen K. Ahmed 2

Victor Bursztyn 2

Puneet Mathur 2

Lidiya Murakhovs’ka 2

Thien Huu Nguyen 2

Chien-Sheng Wu 2

Seunghyun Yoon 2

Yangfan Zhang 2

Kenneth C. Arnold 1

Samyadeep Basu 1

Uttaran Bhattacharya 1

Xiangnan Chen 1

Sai Sree Harsha 1

Ting-Hao Huang 1

Sheng-Jun Huang 1

Youngseung Jeon 1

Sasidhar Kunapuli 1

Branislav Kveton 1

Akash V Maharaj 1

Dinesh Manocha 1

Subrata Mitra 1

Subhojyoti Mukherjee 1

Koyel Mukherjee 1

Vaishnavi Muppala 1

Chien Van Nguyen 1

Deonna M. Owens 1

Soumyabrata Pal 1

Chanyoung Park 1

Guang-Jie Ren 1

Michael Rimer 1

Somdeb Sarkhel 1

Viswanathan Swaminathan 1

Mehrab Tanjim 1

Quan Hung Tran 1

Haoliang Wang 1

Caiming Xiong 1

Changliang Xu 1

Wangyang Ying 1

Jianguo Zhang 1

Junyang Zhang 1

Jintian Zhang 1

Zhiqiang Zhang 1

Yuchen Zhuang 1

Venues