Saloni Potdar - ACL Anthology

Saloni Potdar

2026

Over-Searching in Search-Augmented Large Language Models
Roy Xie | Deepak Gopinath | David Qiu | Dong Lin | Haitian Sun | Saloni Potdar | Bhuwan Dhingra
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search – unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our findings show: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA benchmark to foster continued research into efficient search-augmented LLMs.

2025

Do Large Language Models have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
Yanzhu Guo | Simone Conia | Zelin Zhou | Min Li | Saloni Potdar | Henry Xiao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.

KG-TRICK: Unifying Textual and Relational Information Completion of Knowledge for Multilingual Knowledge Graphs
Zelin Zhou | Simone Conia | Daniel Lee | Min Li | Shenglei Huang | Umar Farooq Minhas | Saloni Potdar | Henry Xiao | Yunyao Li
Proceedings of the 31st International Conference on Computational Linguistics

Multilingual knowledge graphs (KGs) provide high-quality relational and textual information for various NLP applications, but they are often incomplete, especially in non-English languages. Previous research has shown that combining information from KGs in different languages aids either Knowledge Graph Completion (KGC), the task of predicting missing relations between entities, or Knowledge Graph Enhancement (KGE), the task of predicting missing textual information for entities. Although previous efforts have considered KGC and KGE as independent tasks, we hypothesize that they are interdependent and mutually beneficial. To this end, we introduce KG-TRICK, a novel sequence-to-sequence framework that unifies the tasks of textual and relational information completion for multilingual KGs. KG-TRICK demonstrates that: i) it is possible to unify the tasks of KGC and KGE into a single framework, and ii) combining textual information from multiple languages is beneficial to improve the completeness of a KG. As part of our contributions, we also introduce WikiKGE10++, the largest manually-curated benchmark for textual information completion of KGs, which features over 25,000 entities across 10 diverse languages.

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Saloni Potdar | Lina Rojas-Barahona | Sebastien Montella
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
Yajie Li | Albert Galimov | Mitra Datta Ganapaneni | Pujitha Thejaswi | De Meng | Priyanshu Kumar | Saloni Potdar
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.

mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages
Hellina Hailu Nigatu | Min Li | Maartje Ter Hoeve | Saloni Potdar | Sarah Chasins
Findings of the Association for Computational Linguistics: ACL 2025

Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages, Arabic and English, to utilize cross-lingual transfer for mKGC. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by up to 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.

SemEval-2025 Task 2: Entity-Aware Machine Translation
Simone Conia | Min Li | Roberto Navigli | Saloni Potdar
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Translating text that contains complex or challenging named entities—e.g., cultural-specific book and movie titles, location names, proper nouns, food names, etc.—remains a difficult task for modern machine translation systems, including the latest large language models. To systematically study and advance progress in this area, we organized Entity-Aware Machine Translation, or EA-MT, a shared task that evaluates how well systems handle entity translation across 10 language pairs. With EA-MT, we introduce XC-Translate, a novel gold benchmark comprising over 50K manually-translated sentences with entity names that can deviate significantly from word-to-word translations in their target languages. This paper describes the creation process of XC-Translate, provides an overview of the approaches explored by our participants, presents the main evaluation findings, and points toward open research directions, such as contextual retrieval methods for low-resource entities and more robust evaluation metrics for entity correctness. We hope that our shared task will inspire further research in entity-aware machine translation and foster the development of more culturally-accurate translation systems.

2024

AGRaME: Any-Granularity Ranking with Multi-Vector Embeddings
Revanth Gangi Reddy | Omar Attia | Yunyao Li | Heng Ji | Saloni Potdar
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Ranking is a fundamental problem in search, however, existing ranking algorithms usually restrict the granularity of ranking to full passages or require a specific dense index for each desired level of granularity. Such lack of flexibility in granularity negatively affects many applications that can benefit from more granular ranking, such as sentence-level ranking for open-domain QA, or proposition-level ranking for attribution. In this work, we introduce the idea of any-granularity ranking which leverages multi-vector embeddings to rank at varying levels of granularity while maintaining encoding at a single (coarser) level of granularity. We propose a multi-granular contrastive loss for training multi-vector approaches and validate its utility with both sentences and propositions as ranking units. Finally, we demonstrate the application of proposition-level ranking to post-hoc citation addition in retrieval-augmented generation, surpassing the performance of prompt-driven citation generation.

Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs
Simone Conia | Daniel Lee | Min Li | Umar Farooq Minhas | Saloni Potdar | Yunyao Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Translating text that contains entity names is a challenging task, as cultural-related references can vary significantly across languages. These variations may also be caused by transcreation, an adaptation process that entails more than transliteration and word-for-word translation. In this paper, we address the problem of cross-cultural translation on two fronts: (i) we introduce XC-Translate, the first large-scale, manually-created benchmark for machine translation that focuses on text that contains potentially culturally-nuanced entity names, and (ii) we propose KG-MT, a novel end-to-end method to integrate information from a multilingual knowledge graph into a neural machine translation model by leveraging a dense retrieval mechanism. Our experiments and analyses show that current machine translation systems and large language models still struggle to translate texts containing entity names, whereas KG-MT outperforms state-of-the-art approaches by a large margin, obtaining a 129% and 62% relative improvement compared to NLLB-200 and GPT-4, respectively.

ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA Datasets with Large Language Models
Ronak Pradeep | Daniel Lee | Ali Mousavi | Jeffrey Pound | Yisi Sang | Jimmy Lin | Ihab Ilyas | Saloni Potdar | Mostafa Arefiyan | Yunyao Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

The rapid evolution of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation.These datasets must accommodate diverse user interaction modes, including text and voice, each presenting unique modeling challenges. Knowledge Graphs (KGs), with their structured and evolving nature, offer an ideal foundation for current and precise knowledge.Although human-curated KG-based conversational datasets exist, they struggle to keep pace with the rapidly changing user information needs.We present ConvKGYarn, a scalable method for generating up-to-date and configurable conversational KGQA datasets. Qualitative psychometric analyses demonstrate ConvKGYarn’s effectiveness in producing high-quality data comparable to popular conversational KGQA datasets across various metrics.ConvKGYarn excels in adhering to human interaction configurations and operating at a significantly larger scale.We showcase ConvKGYarn’s utility by testing LLMs on diverse conversations — exploring model behavior on conversational KGQA sets with different configurations grounded in the same KG fact set.Our results highlight the ability of ConvKGYarn to improve KGQA foundations and evaluate parametric knowledge of LLMs, thus offering a robust solution to the constantly evolving landscape of conversational assistants.

Entity Disambiguation via Fusion Entity Decoding
Junxiong Wang | Ali Mousavi | Omar Attia | Ronak Pradeep | Saloni Potdar | Alexander Rush | Umar Farooq Minhas | Yunyao Li
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly, entity descriptions, which could contain crucial information to distinguish similar entities from each other, are often overlooked.We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions. Given text and candidate entities, the encoder learns interactions between the text and each candidate entity, producing representations for each entity candidate. The decoder then fuses the representations of entity candidates together and selects the correct entity.Our experiments, conducted on various entity disambiguation benchmarks, demonstrate the strong and robust performance of this model, particularly +1.5% in the ZELDA benchmark compared with GENRE. Furthermore, we integrate this approach into the retrieval/reader framework and observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA.

2022

Distinguish Sense from Nonsense: Out-of-Scope Detection for Virtual Assistants
Cheng Qian | Haode Qi | Gengyu Wang | Ladislav Kunc | Saloni Potdar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Out of Scope (OOS) detection in Conversational AI solutions enables a chatbot to handle a conversation gracefully when it is unable to make sense of the end-user query. Accurately tagging a query as out-of-domain is particularly hard in scenarios when the chatbot is not equipped to handle a topic which has semantic overlap with an existing topic it is trained on. We propose a simple yet effective OOS detection method that outperforms standard OOS detection methods in a real-world deployment of virtual assistants. We discuss the various design and deployment considerations for a cloud platform solution to train virtual assistants and deploy them at scale. Additionally, we propose a collection of datasets that replicates real-world scenarios and show comprehensive results in various settings using both offline and online evaluation metrics.

Benchmarking Language-agnostic Intent Classification for Virtual Assistant Platforms
Gengyu Wang | Cheng Qian | Lin Pan | Haode Qi | Ladislav Kunc | Saloni Potdar
Proceedings of the Workshop on Multilingual Information Access (MIA)

Current virtual assistant (VA) platforms are beholden to the limited number of languages they support. Every component, such as the tokenizer and intent classifier, is engineered for specific languages in these intricate platforms. Thus, supporting a new language in such platforms is a resource-intensive operation requiring expensive re-training and re-designing. In this paper, we propose a benchmark for evaluating language-agnostic intent classification, the most critical component of VA platforms. To ensure the benchmarking is challenging and comprehensive, we include 29 public and internal datasets across 10 low-resource languages and evaluate various training and testing settings with consideration of both accuracy and training time. The benchmarking result shows that Watson Assistant, among 7 commercial VA platforms and pre-trained multilingual language models (LMs), demonstrates close-to-best accuracy with the best accuracy-training time trade-off.

Fast and Light-Weight Answer Text Retrieval in Dialogue Systems
Hui Wan | Siva Sankalp Patel | J William Murdock | Saloni Potdar | Sachindra Joshi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

Dialogue systems can benefit from being able to search through a corpus of text to find information relevant to user requests, especially when encountering a request for which no manually curated response is available. The state-of-the-art technology for neural dense retrieval or re-ranking involves deep learning models with hundreds of millions of parameters. However, it is difficult and expensive to get such models to operate at an industrial scale, especially for cloud services that often need to support a big number of individually customized dialogue systems, each with its own text corpus. We report our work on enabling advanced neural dense retrieval systems to operate effectively at scale on relatively inexpensive hardware. We compare with leading alternative industrial solutions and show that we can provide a solution that is effective, fast, and cost-efficient.

2021

Multilingual BERT Post-Pretraining Alignment
Lin Pan | Chung-Wei Hang | Haode Qi | Abhishek Shah | Saloni Potdar | Mo Yu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a simple method to align multilingual contextual embeddings as a post-pretraining step for improved cross-lingual transferability of the pretrained language models. Using parallel data, our method aligns embeddings on the word level through the recently proposed Translation Language Modeling objective as well as on the sentence level via contrastive learning and random input shuffling. We also perform sentence-level code-switching with English when finetuning on downstream tasks. On XNLI, our best model (initialized from mBERT) improves over mBERT by 4.7% in the zero-shot setting and achieves comparable result to XLM for translate-train while using less than 18% of the same parallel data and 31% fewer model parameters. On MLQA, our model outperforms XLM-R_Base, which has 57% more parameters than ours.

Benchmarking Commercial Intent Detection Services with Practice-Driven Evaluations
Haode Qi | Lin Pan | Atin Sood | Abhishek Shah | Ladislav Kunc | Mo Yu | Saloni Potdar
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Intent detection is a key component of modern goal-oriented dialog systems that accomplish a user task by predicting the intent of users’ text input. There are three primary challenges in designing robust and accurate intent detection models. First, typical intent detection models require a large amount of labeled data to achieve high accuracy. Unfortunately, in practical scenarios it is more common to find small, unbalanced, and noisy datasets. Secondly, even with large training data, the intent detection models can see a different distribution of test data when being deployed in the real world, leading to poor accuracy. Finally, a practical intent detection model must be computationally efficient in both training and single query inference so that it can be used continuously and re-trained frequently. We benchmark intent detection methods on a variety of datasets. Our results show that Watson Assistant’s intent detection model outperforms other commercial solutions and is comparable to large pretrained language models while requiring only a fraction of computational resources and training data. Watson Assistant demonstrates a higher degree of robustness when the training and test distributions differ.

Narrative Question Answering with Cutting-Edge Open-Domain QA Techniques: A Comprehensive Study
Xiangyang Mou | Chenghao Yang | Mo Yu | Bingsheng Yao | Xiaoxiao Guo | Saloni Potdar | Hui Su
Transactions of the Association for Computational Linguistics, Volume 9

Recent advancements in open-domain question answering (ODQA), that is, finding answers from large open-domain corpus like Wikipedia, have led to human-level performance on many datasets. However, progress in QA over book stories (Book QA) lags despite its similar task formulation to ODQA. This work provides a comprehensive and quantitative analysis about the difficulty of Book QA: (1) We benchmark the research on the NarrativeQA dataset with extensive experiments with cutting-edge ODQA techniques. This quantifies the challenges Book QA poses, as well as advances the published state-of-the-art with a ∼7% absolute improvement on ROUGE-L. (2) We further analyze the detailed challenges in Book QA through human studies.1 Our findings indicate that the event-centric questions dominate this task, which exemplifies the inability of existing QA models to handle event-oriented scenarios.

2020

Frustratingly Hard Evidence Retrieval for QA Over Books
Xiangyang Mou | Mo Yu | Bingsheng Yao | Chenghao Yang | Xiaoxiao Guo | Saloni Potdar | Hui Su
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

A lot of progress has been made to improve question answering (QA) in recent years, but the special problem of QA over narrative book stories has not been explored in-depth. We formulate BookQA as an open-domain QA task given its similar dependency on evidence retrieval. We further investigate how state-of-the-art open-domain QA approaches can help BookQA. Besides achieving state-of-the-art on the NarrativeQA benchmark, our study also reveals the difficulty of evidence retrieval in books with a wealth of experiments and analysis - which necessitates future effort on novel solutions for evidence retrieval in BookQA.

2019

Out-of-Domain Detection for Low-Resource Text Classification Tasks
Ming Tan | Yang Yu | Haoyu Wang | Dakuo Wang | Saloni Potdar | Shiyu Chang | Mo Yu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Out-of-domain (OOD) detection for low-resource text classification is a realistic but understudied task. The goal is to detect the OOD cases with limited in-domain (ID) training data, since in machine learning applications we observe that training data is often insufficient. In this work, we propose an OOD-resistant Prototypical Network to tackle this zero-shot OOD detection and few-shot ID classification task. Evaluations on real-world datasets show that the proposed solution outperforms state-of-the-art methods in zero-shot OOD detection task, while maintaining a competitive performance on ID classification task.

Context-Aware Conversation Thread Detection in Multi-Party Chat
Ming Tan | Dakuo Wang | Yupeng Gao | Haoyu Wang | Saloni Potdar | Xiaoxiao Guo | Shiyu Chang | Mo Yu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In multi-party chat, it is common for multiple conversations to occur concurrently, leading to intermingled conversation threads in chat logs. In this work, we propose a novel Context-Aware Thread Detection (CATD) model that automatically disentangles these conversation threads. We evaluate our model on four real-world datasets and demonstrate an overall im-provement in thread detection accuracy over state-of-the-art benchmarks.

Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers
Haoyu Wang | Ming Tan | Mo Yu | Shiyu Chang | Dakuo Wang | Kun Xu | Xiaoxiao Guo | Saloni Potdar
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Many approaches to extract multiple relations from a paragraph require multiple passes over the paragraph. In practice, multiple passes are computationally expensive and this makes difficult to scale to longer paragraphs and larger text corpora. In this work, we focus on the task of multiple relation extractions by encoding the paragraph only once. We build our solution upon the pre-trained self-attentive models (Transformer), where we first add a structured prediction layer to handle extraction between multiple entity pairs, then enhance the paragraph embedding to capture multiple relational information associated with each entity with entity-aware attention. We show that our approach is not only scalable but can also perform state-of-the-art on the standard benchmark ACE 2005.

2018

Diverse Few-Shot Text Classification with Multiple Metrics
Mo Yu | Xiaoxiao Guo | Jinfeng Yi | Shiyu Chang | Saloni Potdar | Yu Cheng | Gerald Tesauro | Haoyu Wang | Bowen Zhou
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We study few-shot learning in natural language domains. Compared to many existing works that apply either metric-based or optimization-based meta-learning to image domain with low inter-task variance, we consider a more realistic setting, where tasks are diverse. However, it imposes tremendous difficulties to existing state-of-the-art metric-based algorithms since a single metric is insufficient to capture complex task variations in natural language domain. To alleviate the problem, we propose an adaptive metric learning approach that automatically determines the best weighted combination from a set of metrics obtained from meta-training tasks for a newly seen few-shot task. Extensive quantitative evaluations on real-world sentiment analysis and dialog intent classification datasets demonstrate that the proposed method performs favorably against state-of-the-art few shot learning algorithms in terms of predictive accuracy. We make our code and data available for further study.

2014

Identifying Student Leaders from MOOC Discussion Forums through Language Influence
Seungwhan Moon | Saloni Potdar | Lara Martin
Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs

Co-authors

Ladislav Kunc 3

Umar Farooq Minhas 3

Xiangyang Mou 2

Ronak Pradeep 2

Abhishek Shah 2

Chenghao Yang 2

Bingsheng Yao 2

Mostafa Arefiyan 1

Sarah Chasins 1

Bhuwan Dhingra 1

Albert Galimov 1

Mitra Datta Ganapaneni 1

Revanth Gangi Reddy 1

Deepak Gopinath 1

Chung-Wei Hang 1

Shenglei Huang 1

Sachindra Joshi 1

Priyanshu Kumar 1

Sebastien Montella 1

Seungwhan Moon 1

J. William Murdock 1

Roberto Navigli 1

Hellina Hailu Nigatu 1

Siva Sankalp Patel 1

Jeffrey Pound 1

Lina M. Rojas Barahona 1

Alexander M. Rush 1

Maartje Ter Hoeve 1

Gerald Tesauro 1

Pujitha Thejaswi 1

Junxiong Wang 1

Venues