Xiang Lorraine Li - ACL Anthology

Xiang Lorraine Li

2025

Unveiling Confirmation Bias in Chain-of-Thought Reasoning
Yue Wan | Xiaowei Jia | Xiang Lorraine Li
Findings of the Association for Computational Linguistics: ACL 2025

Chain-of-thought (CoT) prompting has been widely adopted to enhance the reasoning capabilities of large language models (LLMs). However, the effectiveness of CoT reasoning is inconsistent across tasks with different reasoning types. This work presents a novel perspective to understand CoT behavior through the lens of confirmation bias in cognitive psychology. Specifically, we examine how model internal beliefs, approximated by direct question-answering probabilities, affect both reasoning generation (Q → R) and reasoning-guided answer prediction (QR → A) in CoT. By decomposing CoT into a two-stage process, we conduct a thorough correlation analysis in model beliefs, rationale attributes, and stage-wise performance. Our results provide strong evidence of confirmation bias in LLMs, such that model beliefs not only skew the reasoning process but also influence how rationales are utilized for answer prediction. Furthermore, the interplay between task vulnerability to confirmation bias and the strength of beliefs also provides explanations for CoT effectiveness across reasoning tasks and models. Overall, this study provides a valuable insight for the needs of better prompting strategies that mitigate confirmation bias to enhance reasoning performance. Code is available at https://github.com/yuewan2/biasedcot.

Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games
César Guerra-Solano | Zhuochun Li | Xiang Lorraine Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply “out-of-the-box thinking” to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds – English, Spanish, Chinese, Hindi, and Arabic – in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.

Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing
Bhiman Kumar Baghel | Emma Jordan | Zheyuan Ryan Shi | Xiang Lorraine Li
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive. Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm. Despite this efficiency, existing methods are limited: edits may fail to inject knowledge (UnderEdit) or unintentionally disrupt unrelated neighboring knowledge (OverEdit). To address these challenges, we propose two complementary methods: **iterative model editing**, which applies successive edits to mitigate UnderEdit, and **neighbor-assisted model editing**, which incorporates neighboring knowledge during editing to reduce OverEdit. Our extensive experiments show that these techniques improve editing performance across multiple LLMs, algorithms, and benchmarks, reducing UnderEdit by up to 38 percentage points and OverEdit by up to 6, while remaining broadly applicable to any locate-and-edit method.

Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity
Zhaoyi Joey Hou | Adriana Kovashka | Xiang Lorraine Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.

Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)
Vaibhav Adlakha | Alexandra Chronopoulou | Xiang Lorraine Li | Bodhisattwa Prasad Majumder | Freda Shi | Giorgos Vernikos
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

2024

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations
Wenting Zhao | Justin T. Chiu | Jena D. Hwang | Faeze Brahman | Jack Hessel | Sanjiban Choudhury | Yejin Choi | Xiang Lorraine Li | Alane Suhr
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Language technologies that accurately model the dynamics of events must perform commonsense reasoning. Existing work evaluating commonsense reasoning focuses on making inferences about common, everyday situations. To instead investigate the ability to model unusual, unexpected, and unlikely situations, we explore the task of uncommonsense abductive reasoning. Given a piece of context with an unexpected outcome, this task requires reasoning abductively to generate an explanation that makes the unexpected outcome more likely in the context. To this end, we curate and release a new English language corpus called UNcommonsense. We characterize the performance differences between human explainers and the best-performing large language models, finding that model-enhanced human-written explanations achieve the highest quality by trading off between specificity and diversity. Finally, we experiment with several imitation learning algorithms to train open and accessible language models on this task. When compared with the vanilla supervised fine-tuning approach, these methods consistently reduce lose rates on both common and uncommonsense abductive reasoning judged by human evaluators.

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
Qi Cheng | Michael Boratko | Pranay Kumar Yelugam | Tim O’Gorman | Nalini Singh | Andrew McCallum | Xiang Lorraine Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models have demonstrated impressive performance on commonsense tasks; however, these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic with multiple correct answers. The purpose of “boiling water” could be making tea, cooking but also could be killing germs. Existing tasks do not capture the probabilistic nature of common sense. To this end, we present commonsense frame completion (CFC), a new generative task that evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation that strongly correlates with human judgments. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.

In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search
Huihan Li | Yuting Ning | Zeyi Liao | Siyuan Wang | Xiang Lorraine Li | Ximing Lu | Wenting Zhao | Faeze Brahman | Yejin Choi | Xiang Ren
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

To effectively use large language models (LLMs) for real-world queries, it is imperative that they generalize to the long-tail distribution, i.e. rare examples where models exhibit low confidence. In this work, we take the first step towards evaluating LLMs in the long-tail distribution of inferential knowledge. We exemplify long-tail evaluation on the Natural Language Inference task. First, we introduce Logic-Induced-Knowledge-Search (LINK), a systematic long-tail data generation framework, to obtain factually-correct yet long-tail inferential statements. LINK uses variable-wise prompting grounded on symbolic rules to seek low-confidence statements while ensuring factual correctness. We then use LINK to curate Logic-Induced-Long-Tail (LINT), a large-scale long-tail inferential knowledge dataset that contains 108K statements spanning four domains. We evaluate popular LLMs on LINT; we find that state-of-the-art LLMs show significant performance drop (21% relative drop for GPT4) on long-tail data as compared to on head distribution data, and smaller models show even more generalization weakness. These results further underscore the necessity of long-tail evaluation in developing generalizable LLMs.

Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking
Mohamed Elaraby | Diane Litman | Xiang Lorraine Li | Ahmed Magooda
Findings of the Association for Computational Linguistics: EMNLP 2024

Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments demonstrate that the persuasiveness of the generated rationales can be enhanced by guiding their persuasive elements through prompting or self-refinement techniques.

2023

Editing Common Sense in Transformers
Anshita Gupta | Debanjan Mondal | Akshay Krishna Sheshadri | Wenlong Zhao | Xiang Lorraine Li | Sarah Wiegreffe | Niket Tandon
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Editing model parameters directly in Transformers makes updating open-source transformer-based models possible without re-training. However, these editing methods have only been evaluated on statements about encyclopedic knowledge with a single correct answer. Commonsense knowledge with multiple correct answers, e.g., an apple can be green or red but not transparent, has not been studied but is as essential for enhancing transformers’ reliability and usefulness. In this paper, we investigate whether commonsense judgments are causally associated with localized, editable parameters in Transformers, and we provide an affirmative answer. We find that directly applying the MEMIT editing algorithm results in sub-par performance and improve it for the commonsense domain by varying edit tokens and improving the layer selection strategy, i.e., MEMIT_CSK. GPT-2 Large and XL models edited using MEMIT_CSK outperform best-fine-tuned baselines by 10.97% and 10.73% F1 scores on PEP3k and 20Q datasets. In addition, we propose a novel evaluation dataset, PROBE\ SET, that contains unaffected and affected neighborhoods, affected paraphrases, and affected reasoning challenges. MEMIT_CSK performs well across the metrics while fine-tuning baselines show significant trade-offs between unaffected and affected metrics. These results suggest a compelling future direction for incorporating feedback about common sense into Transformers through direct model editing.

2022

Proceedings of the 7th Workshop on Representation Learning for NLP
Spandana Gella | He He | Bodhisattwa Prasad Majumder | Burcu Can | Eleonora Giunchiglia | Samuel Cahyawijaya | Sewon Min | Maximilian Mozes | Xiang Lorraine Li | Isabelle Augenstein | Anna Rogers | Kyunghyun Cho | Edward Grefenstette | Laura Rimell | Chris Dyer
Proceedings of the 7th Workshop on Representation Learning for NLP

A Systematic Investigation of Commonsense Knowledge in Large Language Models
Xiang Lorraine Li | Adhiguna Kuncoro | Jordan Hoffmann | Cyprien de Masson d’Autume | Phil Blunsom | Aida Nematzadeh
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Language models (LMs) trained on large amounts of data have shown impressive performance on many NLP tasks under the zero-shot and few-shot setup. Here we aim to better understand the extent to which such models learn commonsense knowledge — a critical component of many NLP applications. We conduct a systematic and rigorous zero-shot and few-shot commonsense evaluation of large pre-trained LMs, where we: (i) carefully control for the LMs’ ability to exploit potential surface cues and annotation artefacts, and (ii) account for variations in performance that arise from factors that are not related to commonsense knowledge. Our findings highlight the limitations of pre-trained LMs in acquiring commonsense knowledge without task-specific supervision; furthermore, using larger models or few-shot evaluation is insufficient to achieve human-level commonsense performance.

Word2Box: Capturing Set-Theoretic Semantics of Words using Box Embeddings
Shib Sankar Dasgupta | Michael Boratko | Siddhartha Mishra | Shriya Atmakuri | Dhruvesh Patel | Xiang Lorraine Li | Andrew McCallum
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Learning representations of words in a continuous space is perhaps the most fundamental task in NLP, however words interact in ways much richer than vector dot product similarity can provide. Many relationships between words can be expressed set-theoretically, for example, adjective-noun compounds (eg. “red cars”⊆“cars”) and homographs (eg. “tongue”∩“body” should be similar to “mouth”, while “tongue”∩“language” should be similar to “dialect”) have natural set-theoretic interpretations. Box embeddings are a novel region-based representation which provide the capability to perform these set-theoretic operations. In this work, we provide a fuzzy-set interpretation of box embeddings, and learn box representations of words using a set-theoretic training objective. We demonstrate improved performance on various word similarity tasks, particularly on less common words, and perform a quantitative and qualitative analysis exploring the additional unique expressivity provided by Word2Box.

2021

Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning
Xuelu Chen | Michael Boratko | Muhao Chen | Shib Sankar Dasgupta | Xiang Lorraine Li | Andrew McCallum
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Knowledge bases often consist of facts which are harvested from a variety of sources, many of which are noisy and some of which conflict, resulting in a level of uncertainty for each triple. Knowledge bases are also often incomplete, prompting the use of embedding methods to generalize from known facts, however, existing embedding methods only model triple-level uncertainty, and reasoning results lack global consistency. To address these shortcomings, we propose BEUrRE, a novel uncertain knowledge graph embedding method with calibrated probabilistic semantics. BEUrRE models each entity as a box (i.e. axis-aligned hyperrectangle) and relations between two entities as affine transforms on the head and tail entity boxes. The geometry of the boxes allows for efficient calculation of intersections and volumes, endowing the model with calibrated probabilistic semantics and facilitating the incorporation of relational constraints. Extensive experiments on two benchmark datasets show that BEUrRE consistently outperforms baselines on confidence prediction and fact ranking due to its probabilistic calibration and ability to capture high-order dependencies among facts.

Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization
Anshuman Mishra | Dhruvesh Patel | Aparna Vijayakumar | Xiang Lorraine Li | Pavan Kapanipathi | Kartik Talamadupula
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Natural Language Inference (NLI) has garnered significant attention in recent years; however, the promise of applying NLI breakthroughs to other downstream NLP tasks has remained unfulfilled. In this work, we use the multiple-choice reading comprehension (MCRC) and checking factual correctness of textual summarization (CFCS) tasks to investigate potential reasons for this. Our findings show that: (1) the relatively shorter length of premises in traditional NLI datasets is the primary challenge prohibiting usage in downstream applications (which do better with longer contexts); (2) this challenge can be addressed by automatically converting resource-rich reading comprehension datasets into longer-premise NLI datasets; and (3) models trained on the converted, longer-premise datasets outperform those trained using short-premise traditional NLI datasets on downstream tasks primarily due to the difference in premise lengths.

Box-To-Box Transformations for Modeling Joint Hierarchies
Shib Sankar Dasgupta | Xiang Lorraine Li | Michael Boratko | Dongxu Zhang | Andrew McCallum
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

Learning representations of entities and relations in structured knowledge bases is an active area of research, with much emphasis placed on choosing the appropriate geometry to capture the hierarchical structures exploited in, for example, isa or haspart relations. Box embeddings (Vilnis et al., 2018; Li et al., 2019; Dasgupta et al., 2020), which represent concepts as n-dimensional hyperrectangles, are capable of embedding hierarchies when training on a subset of the transitive closure. In Patel et al., (2020), the authors demonstrate that only the transitive reduction is required and further extend box embeddings to capture joint hierarchies by augmenting the graph with new nodes. While it is possible to represent joint hierarchies with this method, the parameters for each hierarchy are decoupled, making generalization between hierarchies infeasible. In this work, we introduce a learned box-to-box transformation that respects the structure of each hierarchy. We demonstrate that this not only improves the capability of modeling cross-hierarchy compositional edges but is also capable of generalizing from a subset of the transitive reduction.

2020

Reading Comprehension as Natural Language Inference:A Semantic Analysis
Anshuman Mishra | Dhruvesh Patel | Aparna Vijayakumar | Xiang Lorraine Li | Pavan Kapanipathi | Kartik Talamadupula
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

In the recent past, Natural language Inference (NLI) has gained significant attention, particularly given its promise for downstream NLP tasks. However, its true impact is limited and has not been well studied. Therefore, in this paper, we explore the utility of NLI for one of the most prominent downstream tasks, viz. Question Answering (QA). We transform one of the largest available MRC dataset (RACE) to an NLI form, and compare the performances of a state-of-the-art model (RoBERTa) on both these forms. We propose new characterizations of questions, and evaluate the performance of QA and NLI models on these categories. We highlight clear categories for which the model is able to perform better when the data is presented in a coherent entailment form, and a structured question-answer concatenation form, respectively.

ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning
Michael Boratko | Xiang Lorraine Li | Tim O’Gorman | Rajarshi Das | Dan Le | Andrew McCallum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Given questions regarding some prototypical situation — such as Name something that people usually do before they leave the house for work? — a human can easily answer them via acquired experiences. There can be multiple right answers for such questions, with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international trivia game show – Family Feud. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose a generative evaluation task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. After presenting multiple competitive baseline models, we find that human performance still exceeds model scores on all evaluation metrics with a meaningful gap, supporting the challenging nature of the task.

Co-authors

Pavan Kapanipathi 2

Bodhisattwa Prasad Majumder 2

Anshuman Mishra 2

Tim O’Gorman 2

Kartik Talamadupula 2

Aparna Vijayakumar 2

Vaibhav Adlakha 1

Shriya Atmakuri 1

Isabelle Augenstein 1

Bhiman Kumar Baghel 1

Samuel Cahyawijaya 1

Justin T. Chiu 1

Kyunghyun Cho 1

Sanjiban Choudhury 1

Alexandra Chronopoulou 1

Mohamed Elaraby 1

Spandana Gella 1

Eleonora Giunchiglia 1

Edward Grefenstette 1

César Guerra-Solano 1

Anshita Gupta 1

Jordan Hoffmann 1

Zhaoyi Joey Hou 1

Jena D. Hwang 1

Adriana Kovashka 1

Adhiguna Kuncoro 1

Ahmed Magooda 1

Siddhartha Mishra 1

Debanjan Mondal 1

Maximilian Mozes 1

Aida Nematzadeh 1

Akshay Krishna Sheshadri 1

Zheyuan Ryan Shi 1

Giorgos Vernikos 1

Siyuan Wang (王思远) 1

Sarah Wiegreffe 1

Pranay Kumar Yelugam 1

Cyprien de Masson d’Autume 1

Venues