Vidhisha Balachandran - ACL Anthology

Vidhisha Balachandran

2025

FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text
Varich Boonsanong | Vidhisha Balachandran | Xiaochuang Han | Shangbin Feng | Lucy Lu Wang | Yulia Tsvetkov
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience.We develop FACTS&EVIDENCE—an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with explanation of model decisions and attribution to multiple, diverse evidence sources. FACTS&EVIDENCE aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.

Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers
Kabir Ahuja | Vidhisha Balachandran | Madhur Panwar | Tianxing He | Noah A. Smith | Navin Goyal | Yulia Tsvetkov
Transactions of the Association for Computational Linguistics, Volume 13

Transformers trained on natural language data have been shown to exhibit hierarchical generalization without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such preference for hierarchical generalization. We extensively experiment with transformers trained on five synthetic, controlled datasets using several training objectives and show that, while objectives such as sequence-to-sequence modeling, classification, etc., often fail to lead to hierarchical generalization, the language modeling objective consistently leads to transformers generalizing hierarchically. We then study how different generalization behaviors emerge during the training by conducting pruning experiments that reveal the joint existence of subnetworks within the model implementing different generalizations. Finally, we take a Bayesian perspective to understand transformers’ preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and if the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization. Overall, our work presents new insights on the origins of hierarchical generalization in transformers and provides a theoretical framework for studying generalization in language models.

2024

Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration
Shangbin Feng | Weijia Shi | Yike Wang | Wenxuan Ding | Vidhisha Balachandran | Yulia Tsvetkov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite efforts to expand the knowledge of large language models (LLMs), knowledge gaps—missing or outdated information in LLMs—might always persist given the evolving nature of knowledge. In this work, we study approaches to identify LLM knowledge gaps and abstain from answering questions when knowledge gaps are present. We first adapt existing approaches to model calibration or adaptation through fine-tuning/prompting and analyze their ability to abstain from generating low-confidence outputs. Motivated by their failures in self-reflection and over-reliance on held-out sets, we propose two novel approaches that are based on model collaboration, i.e., LLMs probing other LLMs for knowledge gaps, either cooperatively or competitively. Extensive experiments with three LLMs on four QA tasks featuring diverse knowledge domains demonstrate that both cooperative and competitive approaches to unveiling LLM knowledge gaps achieve up to 19.3% improvements on abstain accuracy against the strongest baseline. Further analysis reveals that our abstention methods pinpoint failure cases in retrieval augmentation and knowledge gaps in multi-hop reasoning.

Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models
Wenxuan Ding | Shangbin Feng | Yuhan Liu | Zhaoxuan Tan | Vidhisha Balachandran | Tianxing He | Yulia Tsvetkov
Findings of the Association for Computational Linguistics: ACL 2024

We propose Knowledge Crosswords, a geometric knowledge reasoning benchmark consisting of incomplete knowledge networks bounded by structured factual constraints, where LLMs are tasked with inferring the missing facts to meet all constraints. The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA, such as backtracking, verifying facts and constraints, reasoning with uncertainty, and more. Knowledge Crosswords contains 2,101 individual problems, covering diverse knowledge domains, and is further divided into three difficulty levels. We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords. Results demonstrate that baseline approaches struggle with larger knowledge networks and semantically-equivalent entity distractors. In light of their limitations, we propose two new approaches, Staged Prompting and Verify-All, to augment LLMs’ abilities for error-aware backtracking and constraint verification. Our Verify-All significantly outperforms prior methods and is more robust towards problems in the hard subset. Further analysis shows that geometric knowledge reasoning poses new challenges to LLMs’ knowledge abilities, particularly in robustness towards varying option orders, complex structural constraints in knowledge networks, “none of the above” scenarios, and more.

P³Sum: Preserving Author’s Perspective in News Summarization with Diffusion Language Models
Yuhan Liu | Shangbin Feng | Xiaochuang Han | Vidhisha Balachandran | Chan Young Park | Sachin Kumar | Yulia Tsvetkov
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In this work, we take a first step towards designing summarization systems that are faithful to the author’s intent, not only the semantic content of the article. Focusing on a case study of preserving political perspectives in news summarization, we find that existing approaches alter the political opinions and stances of news articles in more than 50% of summaries, misrepresenting the intent and perspectives of the news authors. We thus propose P³Sum, a diffusion model-based summarization approach controlled by political perspective classifiers. In P³Sum, the political leaning of a generated summary is iteratively evaluated at each decoding step, and any drift from the article’s original stance incurs a loss back-propagated to the embedding layers, steering the political stance of the summary at inference time. Extensive experiments on three news summarization datasets demonstrate that P³Sum outperforms state-of-the-art summarization systems and large language models by up to 13.7% in terms of the success rate of stance preservation, with competitive performance on standard metrics of summarization quality. Our findings present a first analysis of preservation of pragmatic features in summarization, highlight the lacunae in existing summarization models—that even state-of-the-art models often struggle to preserve author’s intents—and develop new summarization systems that are more faithful to author’s perspectives.

Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Sachin Kumar | Vidhisha Balachandran | Chan Young Park | Weijia Shi | Shirley Anugrah Hayati | Yulia Tsvetkov | Noah Smith | Hannaneh Hajishirzi | Dongyeop Kang | David Jurgens
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Teaching LLMs to Abstain across Languages via Multilingual Feedback
Shangbin Feng | Weijia Shi | Yike Wang | Wenxuan Ding | Orevaoghene Ahia | Shuyue Stella Li | Vidhisha Balachandran | Sunayana Sitaram | Yulia Tsvetkov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to abstain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in up to 20.5% performance gaps between high and low-resource languages, potentially due to LLMs’ drop in calibration and reasoning beyond a few resource-rich languages. To this end, we propose strategies to enhance LLM abstention by learning from multilingual feedback, where LLMs self-reflect on proposed answers in one language by generating multiple feedback items in related languages: we show that this helps identifying the knowledge gaps across diverse languages, cultures, and communities. Extensive experiments demonstrate that our multilingual feedback approach outperforms various strong baselines, achieving up to 9.2% improvement for low-resource languages across three black-box and open models on three datasets, featuring open-book, closed-book, and commonsense QA. Further analysis reveals that multilingual feedback is both an effective and a more equitable abstain strategy to serve diverse language speakers, and cultural factors have great impact on language selection and LLM abstention behavior, highlighting future directions for multilingual and multi-cultural reliable language modeling.

2023

Unsupervised Keyphrase Extraction via Interpretable Neural Networks
Rishabh Joshi | Vidhisha Balachandran | Emily Saldanha | Maria Glenski | Svitlana Volkova | Yulia Tsvetkov
Findings of the Association for Computational Linguistics: EACL 2023

Keyphrase extraction aims at automatically extracting a list of “important” phrases representing the key concepts in a document. Prior approaches for unsupervised keyphrase extraction resorted to heuristic notions of phrase importance via embedding clustering or graph centrality, requiring extensive domain expertise. Our work presents a simple alternative approach which defines keyphrases as document phrases that are salient for predicting the topic of the document. To this end, we propose INSPECT—an approach that uses self-explaining models for identifying influential keyphrases in a document by measuring the predictive impact of input phrases on the downstream task of the document topic classification. We show that this novel method not only alleviates the need for ad-hoc heuristics but also achieves state-of-the-art results in unsupervised keyphrase extraction in four datasets across two domains: scientific publications and news articles.

Mitigating Societal Harms in Large Language Models
Sachin Kumar | Vidhisha Balachandran | Lucille Njoo | Antonios Anastasopoulos | Yulia Tsvetkov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Numerous recent studies have highlighted societal harms that can be caused by language technologies deployed in the wild. While several surveys, tutorials, and workshops have discussed the risks of harms in specific contexts – e.g., detecting and mitigating gender bias in NLP models – no prior work has developed a unified typology of technical approaches for mitigating harms of language generation models. Our tutorial is based on a survey we recently wrote that proposes such a typology. We will provide an overview of potential social issues in language generation, including toxicity, social biases, misinformation, factual inconsistency, and privacy violations. Our primary focus will be on how to systematically identify risks, and how eliminate them at various stages of model development, from data collection, to model development, to inference/language generation. Through this tutorial, we aim to equip NLP researchers and engineers with a suite of practical tools for mitigating safety risks from pretrained language generation models.

LEXPLAIN: Improving Model Explanations via Lexicon Supervision
Orevaoghene Ahia | Hila Gonen | Vidhisha Balachandran | Yulia Tsvetkov | Noah A. Smith
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Model explanations that shed light on the model’s predictions are becoming a desired additional output of NLP models, alongside their predictions. Challenges in creating these explanations include making them trustworthy and faithful to the model’s predictions. In this work, we propose a novel framework for guiding model explanations by supervising them explicitly. To this end, our method, LEXplain, uses task-related lexicons to directly supervise model explanations. This approach consistently improves the model’s explanations without sacrificing performance on the task, as we demonstrate on sentiment analysis and toxicity detection. Our analyses show that our method also demotes spurious correlations (i.e., with respect to African American English dialect) when performing the task, improving fairness.

Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar | Vidhisha Balachandran | Lucille Njoo | Antonios Anastasopoulos | Yulia Tsvetkov
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Recent advances in the capacity of large language models to generate human-like text have resulted in their increased adoption in user-facing settings. In parallel, these improvements have prompted a heated discourse around the risks of societal harms they introduce, whether inadvertent or malicious. Several studies have explored these harms and called for their mitigation via development of safer, fairer models. Going beyond enumerating the risks of harms, this work provides a survey of practical methods for addressing potential threats and societal harms from language generation models. We draw on several prior works’ taxonomies of language model risks to present a structured overview of strategies for detecting and ameliorating different kinds of risks/harms of language generators. Bridging diverse strands of research, this survey aims to serve as a practical guide for both LM researchers and practitioners, with explanations of different strategies’ motivations, their limitations, and open problems for future research.

FactKB: Generalizable Factuality Evaluation using Language Models Enhanced with Factual Knowledge
Shangbin Feng | Vidhisha Balachandran | Yuyang Bai | Yulia Tsvetkov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Evaluating the factual consistency of automatically generated summaries is essential for the progress and adoption of reliable summarization systems. Despite recent advances, existing factuality evaluation models are not robust, being especially prone to entity and relation errors in new domains. We propose FactKB—a simple new approach to factuality evaluation that is generalizable across domains, in particular with respect to entities and relations. FactKB is based on language models pretrained using facts extracted from external knowledge bases. We introduce three types of complementary factuality pretraining objectives based on entity-specific facts, facts extracted from auxiliary knowledge about entities, and facts constructed compositionally through knowledge base walks. The resulting factuality evaluation model achieves state-of-the-art performance on two in-domain news summarization benchmarks as well as on three out-of-domain scientific literature datasets. Further analysis of FactKB shows improved ability to detect erroneous entities and relations in summaries and is robust and easily generalizable across domains.

2022

Correcting Diverse Factual Errors in Abstractive Summarization via Post-Editing and Language Model Infilling
Vidhisha Balachandran | Hannaneh Hajishirzi | William Cohen | Yulia Tsvetkov
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Abstractive summarization models often generate inconsistent summaries containing factual errors or hallucinated content. Recent works focus on correcting factual errors in generated summaries via post-editing. Such correction models are trained using adversarial non-factual summaries constructed using heuristic rules for injecting errors. However, generating non-factual summaries using heuristics often does not generalize well to actual model errors. In this work, we propose to generate hard, representative synthetic examples of non-factual summaries through infilling language models. With this data, we train a more robust fact-correction model to post-edit the summaries to improve factual consistency. Through quantitative and qualitative experiments on two popular summarization datasets— CNN/DM and XSum—we show that our approach vastly outperforms prior methods in correcting erroneous summaries. Our model—FactEdit—improves factuality scores by over ~11 points on CNN/DM and over ~31 points on XSum on average across multiple summarization models, producing more factual summaries while maintaining competitive summarization quality.

2021

Investigating the Effect of Background Knowledge on Natural Questions
Vidhisha Balachandran | Bhuwan Dhingra | Haitian Sun | Michael Collins | William Cohen
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Existing work shows the benefits of integrating KBs with textual evidence for QA only on questions that are answerable by KBs alone (Sun et al., 2019). In contrast, real world QA systems often have to deal with questions that might not be directly answerable by KBs. Here, we investigate the effect of integrating background knowledge from KBs for the Natural Questions (NQ) task. We create a subset of the NQ data, Factual Questions (FQ), where the questions have evidence in the KB in the form of paths that link question entities to answer entities but still must be answered using text, to facilitate further research into KB integration methods. We propose and analyze a simple, model-agnostic approach for incorporating KB paths into text-based QA systems and establish a strong upper bound on FQ for our method using an oracle retriever. We show that several variants of Personalized PageRank based fact retrievers lead to a low recall of answer entities and consequently fail to improve QA performance. Our results suggest that fact retrieval is a bottleneck for integrating KBs into real world QA datasets

SELFEXPLAIN: A Self-Explaining Architecture for Neural Text Classifiers
Dheeraj Rajagopal | Vidhisha Balachandran | Eduard H Hovy | Yulia Tsvetkov
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We introduce SelfExplain, a novel self-explaining model that explains a text classifier’s predictions using phrase-based concepts. SelfExplain augments existing neural classifiers by adding (1) a globally interpretable layer that identifies the most influential concepts in the training set for a given sample and (2) a locally interpretable layer that quantifies the contribution of each local input concept by computing a relevance score relative to the predicted label. Experiments across five text-classification datasets show that SelfExplain facilitates interpretability without sacrificing performance. Most importantly, explanations from SelfExplain show sufficiency for model predictions and are perceived as adequate, trustworthy and understandable by human judges compared to existing widely-used baselines.

Simple and Efficient ways to Improve REALM
Vidhisha Balachandran | Ashish Vaswani | Yulia Tsvetkov | Niki Parmar
Proceedings of the 3rd Workshop on Machine Reading for Question Answering

Dense retrieval has been shown to be effective for Open Domain Question Answering, surpassing sparse retrieval methods like BM25. One such model, REALM, (Guu et al., 2020) is an end-to-end dense retrieval system that uses MLM based pretraining for improved downstream QA performance. However, the current REALM setup uses limited resources and is not comparable in scale to more recent systems, contributing to its lower performance. Additionally, it relies on noisy supervision for retrieval during fine-tuning. We propose REALM++, where we improve upon the training and inference setups and introduce better supervision signal for improving performance, without any architectural changes. REALM++ achieves ~5.5% absolute accuracy gains over the baseline while being faster to train. It also matches the performance of large models which have 3x more parameters demonstrating the efficiency of our setup.

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics
Artidoro Pagnoni | Vidhisha Balachandran | Yulia Tsvetkov
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics cannot be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights on the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. Through these annotations we identify the proportion of different categories of factual errors and benchmark factuality metrics, showing their correlation with human judgement as well as their specific strengths and weaknesses.

StructSum: Summarization via Structured Representations
Vidhisha Balachandran | Artidoro Pagnoni | Jay Yoon Lee | Dheeraj Rajagopal | Jaime Carbonell | Yulia Tsvetkov
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Abstractive text summarization aims at compressing the information of a long source document into a rephrased, condensed summary. Despite advances in modeling techniques, abstractive summarization models still suffer from several key challenges: (i) layout bias: they overfit to the style of training corpora; (ii) limited abstractiveness: they are optimized to copying n-grams from the source rather than generating novel abstractive summaries; (iii) lack of transparency: they are not interpretable. In this work, we propose a framework based on document-level structure induction for summarization to address these challenges. To this end, we propose incorporating latent and explicit dependencies across sentences in the source document into end-to-end single-document summarization models. Our framework complements standard encoder-decoder summarization models by augmenting them with rich structure-aware document representations based on implicitly learned (latent) structures and externally-derived linguistic (explicit) structures. We show that our summarization framework, trained on the CNN/DM dataset, improves the coverage of content in the source documents, generates more abstractive summaries by generating more novel n-grams, and incorporates interpretable sentence-level structures, while performing on par with standard baselines.

2020

“A Little Birdie Told Me ... ” - Inductive Biases for Rumour Stance Detection on Social Media
Karthik Radhakrishnan | Tushar Kanakagiri | Sharanya Chakravarthy | Vidhisha Balachandran
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

The rise in the usage of social media has placed it in a central position for news dissemination and consumption. This greatly increases the potential for proliferation of rumours and misinformation. In an effort to mitigate the spread of rumours, we tackle the related task of identifying the stance (Support, Deny, Query, Comment) of a social media post. Unlike previous works, we impose inductive biases that capture platform specific user behavior. These biases, coupled with social media fine-tuning of BERT allow for better language understanding, thus yielding an F1 score of 58.7 on the SemEval 2019 task on rumour stance detection.

2018

Learning to Define Terms in the Software Domain
Vidhisha Balachandran | Dheeraj Rajagopal | Rose Catherine Kanjirathinkal | William Cohen
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

One way to test a person’s knowledge of a domain is to ask them to define domain-specific terms. Here, we investigate the task of automatically generating definitions of technical terms by reading text from the technical domain. Specifically, we learn definitions of software entities from a large corpus built from the user forum Stack Overflow. To model definitions, we train a language model and incorporate additional domain-specific information like word co-occurrence, and ontological category information. Our approach improves previous baselines by 2 BLEU points for the definition generation task. Our experiments also show the additional challenges associated with the task and the short-comings of language-model based architectures for definition generation.

Venues