Sihao Chen - ACL Anthology

Sihao Chen

2025

LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval
Yanzhen Shen | Sihao Chen | Xueqiang Xu | Yunyi Zhang | Chaitanya Malaviya | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.

On Reference (In-)Determinacy in Natural Language Inference
Sihao Chen | Chaitanya Malaviya | Alex Fabrikant | Hagai Taitelbaum | Tal Schuster | Senaka Buthpitiya | Dan Roth
Findings of the Association for Computational Linguistics: NAACL 2025

We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI), i.e., the premise and hypothesis are assumed to refer to the same context when human raters annotate a label. While RD is a practical assumption for constructing a new NLI dataset, we observe that current NLI models—which are typically trained solely on hypothesis-premise pairs created with the RD assumption—fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts. To highlight the impact of this phenomenon in real-world use cases, we introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples. In RefNLI, the premise is retrieved from a knowledge source (i.e. Wikipedia) and does not necessarily refer to the same context as the hypothesis. With RefNLI, we demonstrate that finetuned NLI models and few-shot prompted LLMs both fail to recognize context mismatch, leading to >80% false contradiction and >50% entailment predictions. We discover that the existence of reference ambiguity in NLI examples can in part explain the inherent human disagreements in NLI, and provide insight into how the RD assumption impacts NLI dataset creation process.

Talking Point based Ideological Discourse Analysis in News Events
Nishanth Sridhar Nakshatri | Nikhil Mehta | Siyi Liu | Sihao Chen | Daniel Hopkins | Dan Roth | Dan Goldwasser
Findings of the Association for Computational Linguistics: ACL 2025

Analyzing ideological discourse even in the age of LLMs remains a challenge, as these models often struggle to capture the key elements that shape real-world narratives. Specifically, LLMs fail to focus on characteristic elements driving dominant discourses and lack the ability to integrate contextual information required for understanding abstract ideological views. To address these limitations, we propose a framework motivated by the theory of ideological discourse analysis to analyze news articles related to real-world events. Our framework represents the news articles using a relational structure−talking points, which captures the interaction between entities, their roles, and media frames along with a topic of discussion. It then constructs a vocabulary of repeating themes−prominent talking points, that are used to generate ideology-specific viewpoints (or partisan perspectives). We evaluate our framework’s ability to generate these perspectives through automated tasks−ideology and partisan classification tasks, supplemented by human validation. Additionally, we demonstrate straightforward applicability of our framework in creating event snapshots, a visual way of interpreting event discourse. We release resulting dataset and model to the community to support further research.

Teaching Language Models To Gather Information Proactively
Tenghao Huang | Sihao Chen | Muhao Chen | Jonathan May | Longqi Yang | Mengting Wan | Pei Zhou
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) are increasingly expected to function as collaborative partners, engaging in back-and-forth dialogue to solve complex, ambiguous problems. However, current LLMs often falter in real-world settings, defaulting to passive responses or narrow clarifications when faced with incomplete or under-specified prompts—falling short of proactively gathering the missing information that is crucial for high-quality solutions. In this work, we introduce a new task paradigm: proactive information gathering, where LLMs must identify gaps in the provided context and strategically elicit implicit user knowledge through targeted questions. To systematically study and train this capability, we design a scalable framework that generates partially specified, real-world tasks, masking key information and simulating authentic ambiguity. Within this setup, our core innovation is a reinforcement finetuning strategy rewards questions that elicit genuinely new, implicit user information—such as hidden domain expertise or fine-grained requirements—that would otherwise remain unspoken. Experiments demonstrate that our trained Qwen-2.5-7B model significantly outperforms o3-mini by 18% on automatic evaluation metrics. More importantly, human evaluation reveals that clarification questions and final outlines generated by our model are favored by human annotators by 42% and 28% respectively. Together, these results highlight the value of proactive clarification in elevating LLMs from passive text generators to genuinely collaborative thought partners.

Minimal Evidence Group Identification for Claim Verification
Xiangci Li | Sihao Chen | Rajvi Kapadia | Jessica Ouyang | Fan Zhang
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

When verifying a claim in real-world settings, e.g. against a large collection of candidate evidence text retrieved from the web, a model is typically expected to identify and aggregate a complete set of evidence pieces that collectively provide full support to a claim.The problem becomes particularly challenging as there might exist different sets of evidence that could be used to verify the claim from different perspectives. In this paper, we formally define and study the problem of identifying such minimal evidence groups (MEGs) for fact verification. We show that MEG identification can be reduced to a Set Cover-like problem, based on an entailment model which estimates whether a given evidence group provides full or partial support to a claim. Our proposed approach achieves 18.4% & 34.8% absolute improvements on WiCE and SciFact datasets over LLM prompting. Finally, we demonstrate the downstream benefit of MEGs in applications such as claim generation.

2024

MixGR: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity
Fengyu Cai | Xinran Zhao | Tong Chen | Sihao Chen | Hongming Zhang | Iryna Gurevych | Heinz Koeppl
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Dense X Retrieval: What Retrieval Granularity Should We Use?
Tong Chen | Hongwei Wang | Sihao Chen | Wenhao Yu | Kaixin Ma | Xinran Zhao | Hongming Zhang | Dong Yu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented language models improves the performance of downstream QA tasks given a specific computation budget.

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
Lingfeng Shen | Weiting Tan | Sihao Chen | Yunmo Chen | Jingyu Zhang | Haoran Xu | Boyuan Zheng | Philipp Koehn | Daniel Khashabi
Findings of the Association for Computational Linguistics: ACL 2024

As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages,we observe that (1) LLMs tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) LLMs tend to generate more irrelevant responses to malicious prompts in lower-resource languages. To understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (RLHF) or supervised finetuning (SFT) on the HH-RLHF dataset. Surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. This suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. Our findings highlight the challenges in cross-lingual LLM safety, and we hope they inform future research in this direction.

Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations
Sihao Chen | Hongming Zhang | Tong Chen | Ben Zhou | Wenhao Yu | Dian Yu | Baolin Peng | Hongwei Wang | Dan Roth | Dong Yu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.

ExpertQA: Expert-Curated Questions and Attributed Answers
Chaitanya Malaviya | Subin Lee | Sihao Chen | Elizabeth Sieber | Mark Yatskar | Dan Roth
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

As language models are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying attribution and factuality has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality, by bringing domain experts in the loop. Specifically, we collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. In addition, we ask experts to improve upon responses from language models. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

2023

PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition
Sihao Chen | Senaka Buthpitiya | Alex Fabrikant | Dan Roth | Tal Schuster
Findings of the Association for Computational Linguistics: ACL 2023

The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI datasets and models, textual entailment relations are typically defined on the sentence- or paragraph-level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. As these propositions can carry different truth values in the context of a given premise, we argue for the need to recognize the textual entailment relation of each proposition in a sentence individually. We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity. We establish strong baselines for the segmentation and entailment tasks. Through case studies on summary hallucination detection and document-level NLI, we demonstrate that our conceptual framework is potentially useful for understanding and explaining the compositionality of NLI labels.

Using LLM for Improving Key Event Discovery: Temporal-Guided News Stream Clustering with Event Summaries
Nishanth Nakshatri | Siyi Liu | Sihao Chen | Dan Roth | Dan Goldwasser | Daniel Hopkins
Findings of the Association for Computational Linguistics: EMNLP 2023

Understanding and characterizing the discus- sions around key events in news streams is important for analyzing political discourse. In this work, we study the problem of identification of such key events and the news articles associated with those events from news streams. We propose a generic framework for news stream clustering that analyzes the temporal trend of news articles to automatically extract the underlying key news events that draw significant media attention. We characterize such key events by generating event summaries, based on which we form document clusters in an unsupervised fashion. We evaluate our simple yet effective framework, and show that it produces more coherent event-focused clusters. To demonstrate the utility of our approach, and facilitate future research along the line, we use our framework to construct KeyEvents, a dataset of 40k articles with 611 key events from 11 topics.

2022

Design Challenges for a Multi-Perspective Search Engine
Sihao Chen | Siyi Liu | Xander Uyttendaele | Yi Zhang | William Bruno | Dan Roth
Findings of the Association for Computational Linguistics: NAACL 2022

Many users turn to document retrieval systems (e.g. search engines) to seek answers to controversial or open-ended questions. However, classical document retrieval systems fall short at delivering users a set of direct and diverse responses in such cases, which requires identifying responses within web documents in the context of the query, and aggregating the responses based on their different perspectives. The goal of this work is to survey and study the user information needs for building a multi-perspective search engine of such. We examine the challenges of synthesizing such language understanding objectives with document retrieval, and study a new perspective-oriented document retrieval paradigm. We discuss and assess the inherent natural language understanding challenges one needs to address in order to achieve the goal. Following the design challenges and principles, we propose and evaluate a practical prototype pipeline system. We use the prototype system to conduct a user survey in order to assess the utility of our paradigm, as well as understanding the user information needs when issuing controversial and open-ended queries to a search engine.

Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters
Tal Schuster | Sihao Chen | Senaka Buthpitiya | Alex Fabrikant | Donald Metzler
Findings of the Association for Computational Linguistics: EMNLP 2022

Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. While early work identified certain biases in NLI models, recent advancements in modeling and datasets demonstrated promising performance.In this work, we further explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. First, we analyze the robustness of these models to longer and out-of-domain inputs. Then, we develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset. Interestingly, we find NLI scores to provide strong retrieval signals, leading to more relevant evidence extractions compared to common similarity-based methods. Finally, we go further and investigate whole document clusters to identify both discrepancies and consensus among sources. In a test case, we find real inconsistencies between Wikipedia pages in different languages about the same topic.

2021

MultiOpEd: A Corpus of Multi-Perspective News Editorials
Siyi Liu | Sihao Chen | Xander Uyttendaele | Dan Roth
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose MultiOpEd, an open-domain news editorial corpus that supports various tasks pertaining to the argumentation structure in news editorials, focusing on automatic perspective discovery. News editorial is a genre of persuasive text, where the argumentation structure is usually implicit. However, the arguments presented in an editorial typically center around a concise, focused thesis, which we refer to as their perspective. MultiOpEd aims at supporting the study of multiple tasks relevant to automatic perspective discovery, where a system is expected to produce a single-sentence thesis statement summarizing the arguments presented. We argue that identifying and abstracting such natural language perspectives from editorials is a crucial step toward studying the implicit argumentation structure in news editorials. We first discuss the challenges and define a few conceptual tasks towards our goal. To demonstrate the utility of MultiOpEd and the induced tasks, we study the problem of perspective summarization in a multi-task learning setting, as a case study. We show that, with the induced tasks as auxiliary tasks, we can improve the quality of the perspective summary generated. We hope that MultiOpEd will be a useful resource for future studies on argumentation in the news editorial domain.

Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection
Sihao Chen | Fan Zhang | Kazoo Sone | Dan Roth
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Despite significant progress in neural abstractive summarization, recent studies have shown that the current models are prone to generating summaries that are unfaithful to the original context. To address the issue, we study contrast candidate generation and selection as a model-agnostic post-processing technique to correct the extrinsic hallucinations (i.e. information not present in the source text) in unfaithful summaries. We learn a discriminative correction model by generating alternative candidate summaries where named entities and quantities in the generated summary are replaced with ones with compatible semantic types from the source document. This model is then used to select the best candidate as the final output summary. Our experiments and analysis across a number of neural summarization systems show that our proposed method is effective in identifying and correcting extrinsic hallucinations. We analyze the typical hallucination phenomenon by different types of neural summarization systems, in hope to provide insights for future work on the direction.

2020

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

2019

Seeing Things from a Different Angle:Discovering Diverse Perspectives about Claims
Sihao Chen | Daniel Khashabi | Wenpeng Yin | Chris Callison-Burch | Dan Roth
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

One key consequence of the information revolution is a significant increase and a contamination of our information supply. The practice of fact checking won’t suffice to eliminate the biases in text data we observe, as the degree of factuality alone does not determine whether biases exist in the spectrum of opinions visible to us. To better understand controversial issues, one needs to view them from a diverse yet comprehensive set of perspectives. For example, there are many ways to respond to a claim such as “animals should have lawful rights”, and these responses form a spectrum of perspectives, each with a stance relative to this claim and, ideally, with evidence supporting it. Inherently, this is a natural language understanding task, and we propose to address it as such. Specifically, we propose the task of substantiated perspective discovery where, given a claim, a system is expected to discover a diverse set of well-corroborated perspectives that take a stance with respect to the claim. Each perspective should be substantiated by evidence paragraphs which summarize pertinent results and facts. We construct PERSPECTRUM, a dataset of claims, perspectives and evidence, making use of online debate websites to create the initial data collection, and augmenting it using search engines in order to expand and diversify our dataset. We use crowd-sourcing to filter out noise and ensure high-quality data. Our dataset contains 1k claims, accompanied with pools of 10k and 8k perspective sentences and evidence paragraphs, respectively. We provide a thorough analysis of the dataset to highlight key underlying language understanding challenges, and show that human baselines across multiple subtasks far outperform ma-chine baselines built upon state-of-the-art NLP techniques. This poses a challenge and opportunity for the NLP community to address.

PerspectroScope: A Window to the World of Diverse Perspectives
Sihao Chen | Daniel Khashabi | Chris Callison-Burch | Dan Roth
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

This work presents PerspectroScope, a web-based system which lets users query a discussion-worthy natural language claim, and extract and visualize various perspectives in support or against the claim, along with evidence supporting each perspective. The system thus lets users explore various perspectives that could touch upon aspects of the issue at hand. The system is built as a combination of retrieval engines and learned textual-entailment-like classifiers built using a few recent developments in natural language understanding. To make the system more adaptive, expand its coverage, and improve its decisions over time, our platform employs various mechanisms to get corrections from the users. PerspectroScope is available at github.com/CogComp/perspectroscope Web demo link: http://orwell.seas.upenn.edu:4002/ Link to demo video: https://www.youtube.com/watch?v=MXBTR1Sp3Bs

Co-authors

Alex Fabrikant 3

Chaitanya Malaviya 3

Hongming Zhang 3

Chris Callison-Burch 2

Dan Goldwasser 2

Daniel Hopkins 2

Xander Uyttendaele 2

Dong Yu (于东) 2

Victoria Basmov 1

Jonathan Berant 1

William Bruno 1

Pradeep Dasigi 1

Ananth Gottumukkala 1

Iryna Gurevych 1

Hannaneh Hajishirzi 1

Tenghao Huang 1

Gabriel Ilharco 1

Rajvi Kapadia 1

Philipp Koehn 1

Jiangming Liu 1

Nelson F. Liu 1

Donald Metzler 1

Phoebe Mulcaire 1

Nishanth Nakshatri 1

Nishanth Sridhar Nakshatri 1

Jessica Ouyang 1

Lingfeng Shen 1

Elizabeth Sieber 1

Noah A. Smith 1

Sanjay Subramanian 1

Hagai Taitelbaum 1

Reut Tsarfaty 1

Venues