Avirup Sil - ACL Anthology

Avirup Sil

Also published as: Avi Sil

2025

SMART: Self-Aware Agent for Tool Overuse Mitigation
Cheng Qian | Emre Can Acikgoz | Hongru Wang | Xiusi Chen | Avirup Sil | Dilek Hakkani-Tür | Gokhan Tur | Heng Ji
Findings of the Association for Computational Linguistics: ACL 2025

Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to **Tool Overuse**, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce **SMART** (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent’s self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce **SMART-ER**, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop **SMARTAgent**, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.

ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
Cheng Qian | Hongyi Du | Hongru Wang | Xiusi Chen | Yuji Zhang | Avirup Sil | ChengXiang Zhai | Kathleen McKeown | Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect real-world complexity, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce **ModelingBench**, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. To solve these challenges, we present **ModelingAgent**, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges. All the codes are released for future research.

ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models
Biddut Sarker Bijoy | Mohammad Saqib Hasan | Pegah Alipoormolabashi | Avirup Sil | Aruna Balasubramanian | Niranjan Balasubramanian
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models.We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

CLAPnq: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems
Sara Rosenthal | Avirup Sil | Radu Florian | Salim Roukos
Transactions of the Association for Computational Linguistics, Volume 13

Retrieval Augmented Generation (RAG) has become a popular application for large language models. It is preferable that successful RAG systems provide accurate answers that are supported by being grounded in a passage without any hallucinations. While considerable work is required for building a full RAG pipeline, being able to benchmark performance is also necessary. We present CLAPnq, a benchmark Long-form Question Answering dataset for the full RAG pipeline. CLAPnq includes long answers with grounded gold passages from Natural Questions (NQ) and a corpus to perform either retrieval, generation, or the full RAG pipeline. The CLAPnq answers are concise, 3x smaller than the full passage, and cohesive, meaning that the answer is composed fluently, often by integrating multiple pieces of the passage that are not contiguous. RAG models must adapt to these properties to be successful at CLAPnq. We present baseline experiments and analysis for CLAPnq that highlight areas where there is still significant room for improvement in grounded RAG. CLAPnq is publicly available at https://github.com/primeqa/clapnq.

2024

FIRST: Faster Improved Listwise Reranking with Single Token Decoding
Revanth Gangi Reddy | JaeHyeok Doo | Yifei Xu | Md Arafat Sultan | Deevya Swain | Avirup Sil | Heng Ji
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have significantly advanced the field of information retrieval, particularly for reranking. Listwise LLM rerankers have showcased superior performance and generalizability compared to existing supervised approaches. However, conventional listwise LLM reranking methods lack efficiency as they provide ranking output in the form of a generated ordered sequence of candidate passage identifiers. Further, they are trained with the typical language modeling objective, which treats all ranking errors uniformly–potentially at the cost of misranking highly relevant passages. Addressing these limitations, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Further, we incorporate a learning-to-rank loss during training, prioritizing ranking accuracy for the more relevant passages. Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Finally, to illustrate the practical effectiveness of listwise LLM rerankers, we investigate their application in providing relevance feedback for retrievers during inference. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
Yi Yang | Aida Davani | Avi Sil | Anoop Kumar
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

2023

The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PrimeQA: a one-stop and open-source QA repository with an aim to democratize QA research and facilitate easy replication of state-of-the-art (SOTA) QA methods. PrimeQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation. It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on public benchmarks, and expanding pre-existing methods. PrimeQA is available at: https://github.com/primeqa.

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers
Jon Saad-Falcon | Omar Khattab | Keshav Santhanam | Radu Florian | Martin Franz | Salim Roukos | Avirup Sil | Md Sultan | Christopher Potts
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using large language models (LLMs) to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains and achieves substantially lower latency than standard reranking methods.

Muted: Multilingual Targeted Offensive Speech Identification and Visualization
Christoph Tillmann | Aashka Trivedi | Sara Rosenthal | Santosh Borse | Rong Zhang | Avirup Sil | Bishwaranjan Bhattacharjee
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Offensive language such as hate, abuse, and profanity (HAP) occurs in various content on the web. While previous work has mostly dealt with sentence level annotations, there have been a few recent attempts to identify offensive spans as well. We build upon this work and introduce MUTED, a system to identify multilingual HAP content by displaying offensive arguments and their targets using heat maps to indicate their intensity. MUTED can leverage any transformer-based HAP-classification model and its attention mechanism out-of-the-box to identify toxic spans, without further fine-tuning. In addition, we use the spaCy library to identify the specific targets and arguments for the words predicted by the attention heatmaps. We present the model’s performance on identifying offensive spans and their targets in existing datasets and present new annotations on German text. Finally, we demonstrate our proposed visualization tool on multilingual inputs.

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
Keshav Santhanam | Jon Saad-Falcon | Martin Franz | Omar Khattab | Avi Sil | Radu Florian | Md Arafat Sultan | Salim Roukos | Matei Zaharia | Christopher Potts
Findings of the Association for Computational Linguistics: ACL 2023

Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.

Towards Effective Long-Form QA with Evidence Augmentation
Mengxia Yu | Sara Rosenthal | Mihaela Bornea | Avi Sil
Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

In this study, we focus on the challenge of improving Long-form Question Answering (LFQA) by extracting and effectively utilizing knowledge from a large set of retrieved passages. We first demonstrate the importance of accurate evidence retrieval for LFQA, showing that optimal extracted knowledge from passages significantly benefits the generation. We also show that the choice of generative models impacts the system’s ability to leverage the evidence and produce answers that are grounded in the retrieved passages. We propose a Mixture of Experts (MoE) model as an alternative to the Fusion in Decoder (FiD) used in state-of-the-art LFQA systems and we compare these two models in our experiments.

2022

On The Ingredients of an Effective Zero-shot Semantic Parser
Pengcheng Yin | John Wieting | Avirup Sil | Graham Neubig
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Semantic parsers map natural language utterances into meaning representations (e.g., programs). Such models are typically bottlenecked by the paucity of training data due to the required laborious annotation efforts. Recent studies have performed zero-shot learning by synthesizing training examples of canonical utterances and programs from a grammar, and further paraphrasing these utterances to improve linguistic diversity. However, such synthetic examples cannot fully capture patterns in real data. In this paper we analyze zero-shot parsers through the lenses of the language and logical gaps (Herzig and Berant, 2019), which quantify the discrepancy of language and programmatic patterns between the canonical examples and real-world user-issued ones. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods using canonical examples that most likely reflect real user intents. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.

Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning
Revanth Gangi Reddy | Vikas Yadav | Md Arafat Sultan | Martin Franz | Vittorio Castelli | Heng Ji | Avirup Sil
Proceedings of the 29th International Conference on Computational Linguistics

Research on neural IR has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this paper, we propose to improve the out-of-domain generalization of Dense Passage Retrieval (DPR) - a popular choice for neural IR - through synthetic data augmentation only in the source domain. We empirically show that pre-finetuning DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation.

Task Transfer and Domain Adaptation for Zero-Shot Question Answering
Xiang Pan | Alex Sheng | David Shimshoni | Aditya Singhal | Sara Rosenthal | Avirup Sil
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Pretrained language models have shown success in various areas of natural language processing, including reading comprehension tasks. However, when applying machine learning methods to new domains, labeled data may not always be available. To address this, we use supervised pretraining on source-domain data to reduce sample complexity on domainspecific downstream tasks. We evaluate zeroshot performance on domain-specific reading comprehension tasks by combining task transfer with domain adaptation to fine-tune a pretrained model with no labelled data from the target task. Our approach outperforms DomainAdaptive Pretraining on downstream domainspecific reading comprehension tasks in 3 out of 4 domains.

Not to Overfit or Underfit the Source Domains? An Empirical Study of Domain Generalization in Question Answering
Md Arafat Sultan | Avi Sil | Radu Florian
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Machine learning models are prone to overfitting their training (source) domains, which is commonly believed to be the reason why they falter in novel target domains. Here we examine the contrasting view that multi-source domain generalization (DG) is first and foremost a problem of mitigating source domain underfitting: models not adequately learning the signal already present in their multi-domain training data. Experiments on a reading comprehension DG benchmark show that as a model learns its source domains better—using familiar methods such as knowledge distillation (KD) from a bigger model—its zero-shot out-of-domain utility improves at an even faster pace. Improved source domain learning also demonstrates superior out-of-domain generalization over three popular existing DG approaches that aim to limit overfitting. Our implementation of KD-based domain generalization is available via PrimeQA at: https://ibm.biz/domain-generalization-with-kd.

Zero-Shot Dynamic Quantization for Transformer Inference
Yousef El-kurdi | Jerry Quinn | Avi Sil
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

We introduce a novel run-time method for significantly reducing the accuracy loss associated with quantizing BERT-like models to 8-bit integers. Existing methods for quantizing models either modify the training procedure, or they require an additional calibration step to adjust parameters that also requires a selected held-out dataset. Our method permits taking advantage of quantization without the need for these adjustments. We present results on several NLP tasks demonstrating the usefulness of this technique.

Learning Cross-Lingual IR from an English Retriever
Yulong Li | Martin Franz | Md Arafat Sultan | Bhavani Iyer | Young-Suk Lee | Avirup Sil
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations
Hannaneh Hajishirzi | Qiang Ning | Avi Sil
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

2021

InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection
Yi Fung | Christopher Thomas | Revanth Gangi Reddy | Sandeep Polisetty | Heng Ji | Shih-Fu Chang | Kathleen McKeown | Mohit Bansal | Avi Sil
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To defend against machine-generated fake news, an effective mechanism is urgently needed. We contribute a novel benchmark for fake news detection at the knowledge element level, as well as a solution for this task which incorporates cross-media consistency checking to detect the fine-grained knowledge elements making news articles misinformative. Due to training data scarcity, we also formulate a novel data synthesis method by manipulating knowledge elements within the knowledge graph to generate noisy training data with specific, hard to detect, known inconsistencies. Our detection approach outperforms the state-of-the-art (up to 16.8% accuracy gain), and more critically, yields fine-grained explanations.

VAULT: VAriable Unified Long Text Representation for Machine Reading Comprehension
Haoyang Wen | Anthony Ferritto | Heng Ji | Radu Florian | Avi Sil
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Existing models on Machine Reading Comprehension (MRC) require complex model architecture for effectively modeling long texts with paragraph representation and classification, thereby making inference computationally inefficient for production use. In this work, we propose VAULT: a light-weight and parallel-efficient paragraph representation for MRC based on contextualized representation from long document input, trained using a new Gaussian distribution-based objective that pays close attention to the partially correct instances that are close to the ground-truth. We validate our VAULT architecture showing experimental results on two benchmark MRC datasets that require long context modeling; one Wikipedia-based (Natural Questions (NQ)) and the other on TechNotes (TechQA). VAULT can achieve comparable performance on NQ with a state-of-the-art (SOTA) complex document modeling approach while being 16 times faster, demonstrating the efficiency of our proposed model. We also demonstrate that our model can also be effectively adapted to a completely different domain – TechQA – with large improvement over a model fine-tuned on a previously published large PLM.

Multi-Domain Multilingual Question Answering
Sebastian Ruder | Avi Sil
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Question answering (QA) is one of the most challenging and impactful tasks in natural language processing. Most research in QA, however, has focused on the open-domain or monolingual setting while most real-world applications deal with specific domains or languages. In this tutorial, we attempt to bridge this gap. Firstly, we introduce standard benchmarks in multi-domain and multilingual QA. In both scenarios, we discuss state-of-the-art approaches that achieve impressive performance, ranging from zero-shot transfer learning to out-of-the-box training with open-domain QA systems. Finally, we will present open research problems that this new research agenda poses such as multi-task learning, cross-lingual transfer learning, domain adaptation and training large scale pre-trained multilingual language models.

Event Time Extraction and Propagation via Graph Attention Networks
Haoyang Wen | Yanru Qu | Heng Ji | Qiang Ning | Jiawei Han | Avi Sil | Hanghang Tong | Dan Roth
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Grounding events into a precise timeline is important for natural language understanding but has received limited attention in recent work. This problem is challenging due to the inherent ambiguity of language and the requirement for information propagation over inter-related events. This paper first formulates this problem based on a 4-tuple temporal representation used in entity slot filling, which allows us to represent fuzzy time spans more conveniently. We then propose a graph attention network-based approach to propagate temporal information over document-level event graphs constructed by shared entity arguments and temporal relations. To better evaluate our approach, we present a challenging new benchmark on the ACE2005 corpus, where more than 78% of events do not have time spans mentioned explicitly in their local contexts. The proposed approach yields an absolute gain of 7.0% in match rate over contextualized embedding approaches, and 16.3% higher match rate compared to sentence-level manual event time argument annotation.

Capturing Row and Column Semantics in Transformer Based Question Answering over Tables
Michael Glass | Mustafa Canim | Alfio Gliozzo | Saneem Chemmengath | Vishwajeet Kumar | Rishav Chakravarti | Avi Sil | Feifei Pan | Samarth Bharadwaj | Nicolas Rodolfo Fauceglia
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of these specialized pre-training techniques. The first model, called RCI interaction, leverages a transformer based architecture that independently classifies rows and columns to identify relevant cells. While this model yields extremely high accuracy at finding cell values on recent benchmarks, a second model we propose, called RCI representation, provides a significant efficiency advantage for online QA systems over tables by materializing embeddings for existing tables. Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to ~98% Hit@1 accuracy on WikiSQL lookup questions). Also, the interaction model outperforms the state-of-the-art transformer based approaches, pre-trained on very large table corpora (TAPAS and TaBERT), achieving ~3.4% and ~18.86% additional precision improvement on the standard WikiSQL benchmark.

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations
Avi Sil | Xi Victoria Lin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

2020

We introduce TECHQA, a domain-adaptation question answering dataset for the technical support domain. The TECHQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size – 600 training, 310 dev, and 490 evaluation question/answer pairs – thus reflecting the cost of creating large labeled datasets with actual data. Hence, TECHQA is meant to stimulate research in domain adaptation rather than as a resource to build QA systems from scratch. TECHQA was obtained by crawling the IBMDeveloper and DeveloperWorks forums for questions with accepted answers provided in an IBM Technote—a technical document that addresses a specific technical issue. We also release a collection of the 801,998 Technotes available on the web as of April 4, 2019 as a companion resource that can be used to learn representations of the IT domain language.

Span Selection Pre-training for Question Answering
Michael Glass | Alfio Gliozzo | Rishav Chakravarti | Anthony Ferritto | Lin Pan | G P Shrivatsa Bhargav | Dinesh Garg | Avi Sil
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pretrained on two auxiliary tasks: Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection PreTraining (SSPT) poses cloze-like training instances, but rather than draw the answer from the model’s parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT-BASE and BERT-LARGE on multiple Machine Reading Comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.

A Multilingual Reading Comprehension System for more than 100 Languages
Anthony Ferritto | Sara Rosenthal | Mihaela Bornea | Kazi Hasan | Rishav Chakravarti | Salim Roukos | Radu Florian | Avi Sil
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

This paper presents M-GAAMA, a Multilingual Question Answering architecture and demo system. This is the first multilingual machine reading comprehension (MRC) demo which is able to answer questions in over 100 languages. M-GAAMA answers questions from a given passage in the same or different language. It incorporates several existing multilingual models that can be used interchangeably in the demo such as M-BERT and XLM-R. The M-GAAMA demo also improves language accessibility by incorporating the IBM Watson machine translation widget to provide additional capabilities to the user to see an answer in their desired language. We also show how M-GAAMA can be used in downstream tasks by incorporating it into an END-TO-END-QA system using CFO (Chakravarti et al., 2019). We experiment with our system architecture on the Multi-Lingual Question Answering (MLQA) and the COVID-19 CORD (Wang et al., 2020; Tang et al., 2020) datasets to provide insights into the performance of the system.

Towards building a Robust Industry-scale Question Answering System
Rishav Chakravarti | Anthony Ferritto | Bhavani Iyer | Lin Pan | Radu Florian | Salim Roukos | Avi Sil
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Industry-scale NLP systems necessitate two features. 1. Robustness: “zero-shot transfer learning” (ZSTL) performance has to be commendable and 2. Efficiency: systems have to train efficiently and respond instantaneously. In this paper, we introduce the development of a production model called GAAMA (Go Ahead Ask Me Anything) which possess the above two characteristics. For robustness, it trains on the recently introduced Natural Questions (NQ) dataset. NQ poses additional challenges over older datasets like SQuAD: (a) QA systems need to read and comprehend an entire Wikipedia article rather than a small passage, and (b) NQ does not suffer from observation bias during construction, resulting in less lexical overlap between the question and the article. GAAMA consists of Attention-over-Attention, diversity among attention heads, hierarchical transfer learning, and synthetic data augmentation while being computationally inexpensive. Building on top of the powerful BERTQA model, GAAMA provides a ∼2.0% absolute boost in F1 over the industry-scale state-of-the-art (SOTA) system on NQ. Further, we show that GAAMA transfers zero-shot to unseen real life and important domains as it yields respectable performance on two benchmarks: the BioASQ and the newly introduced CovidQA datasets.

Multi-Stage Pre-training for Low-Resource Domain Adaptation
Rong Zhang | Revanth Gangi Reddy | Md Arafat Sultan | Vittorio Castelli | Anthony Ferritto | Radu Florian | Efsun Sarioglu Kayi | Salim Roukos | Avi Sil | Todd Ward
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Transfer learning techniques are particularly useful for NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pretrained language model (LM) on in-domain text before fine-tuning to downstream tasks. We show that extending the vocabulary of the LM with domain-specific terms leads to further gains. To a bigger effect, we utilize structure in the unlabeled data to create auxiliary synthetic tasks, which helps the LM transfer to downstream tasks. We apply these approaches incrementally on a pretrained Roberta-large LM and show considerable performance gain on three tasks in the IT domain: Extractive Reading Comprehension, Document Ranking and Duplicate Question Detection.

ARES: A Reading Comprehension Ensembling Service
Anthony Ferritto | Lin Pan | Rishav Chakravarti | Salim Roukos | Radu Florian | J. William Murdock | Avi Sil
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce ARES (A Reading Comprehension Ensembling Service): a novel Machine Reading Comprehension (MRC) demonstration system which utilizes an ensemble of models to increase F1 by 2.3 points. While many of the top leaderboard submissions in popular MRC benchmarks such as the Stanford Question Answering Dataset (SQuAD) and Natural Questions (NQ) use model ensembles, the accompanying papers do not publish their ensembling strategies. In this work, we detail and evaluate various ensembling strategies using the NQ dataset. ARES leverages the CFO (Chakravarti et al., 2019) and ReactJS distributed frameworks to provide a scalable interactive Question Answering experience that capitalizes on the agreement (or lack thereof) between models to improve the answer visualization experience.

Answer Span Correction in Machine Reading Comprehension
Revanth Gangi Reddy | Md Arafat Sultan | Efsun Sarioglu Kayi | Rong Zhang | Vittorio Castelli | Avi Sil
Findings of the Association for Computational Linguistics: EMNLP 2020

Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the “answerability” of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions. We explore the nature of such errors and propose a post-processing correction method that yields statistically significant performance improvements over state-of-the-art MRC systems in both monolingual and multilingual evaluation.

Cross-lingual Structure Transfer for Zero-resource Event Extraction
Di Lu | Ananya Subburathinam | Heng Ji | Jonathan May | Shih-Fu Chang | Avi Sil | Clare Voss
Proceedings of the Twelfth Language Resources and Evaluation Conference

Most of the current cross-lingual transfer learning methods for Information Extraction (IE) have been only applied to name tagging. To tackle more complex tasks such as event extraction we need to transfer graph structures (event trigger linked to multiple arguments with various roles) across languages. We develop a novel share-and-transfer framework to reach this goal with three steps: (1) Convert each sentence in any language to language-universal graph structures; in this paper we explore two approaches based on universal dependency parses and complete graphs, respectively. (2) Represent each node in the graph structure with a cross-lingual word embedding so that all sentences in multiple languages can be represented with one shared semantic space. (3) Using this common semantic space, train event extractors from English training data and apply them to languages that do not have any event annotations. Experimental results on three languages (Spanish, Russian and Ukrainian) without any annotations show this framework achieves comparable performance to a state-of-the-art supervised model trained from more than 1,500 manually annotated event mentions.

2019

Cross-lingual Structure Transfer for Relation and Event Extraction
Ananya Subburathinam | Di Lu | Heng Ji | Jonathan May | Shih-Fu Chang | Avirup Sil | Clare Voss
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The identification of complex semantic structures such as events and entity relations, already a challenging Information Extraction task, is doubly difficult from sources written in under-resourced and under-annotated languages. We investigate the suitability of cross-lingual structure transfer techniques for these tasks. We exploit relation- and event-relevant language-universal features, leveraging both symbolic (including part-of-speech and dependency path) and distributional (including type representation and contextualized representation) information. By representing all entity mentions, event triggers, and contexts into this complex and structured multilingual common space, using graph convolutional networks, we can train a relation or event extractor from source language annotations and apply it to the target language. Extensive experiments on cross-lingual relation and event transfer among English, Chinese, and Arabic demonstrate that our approach achieves performance comparable to state-of-the-art supervised models trained on up to 3,000 manually annotated mentions: up to 62.6% F-score for Relation Extraction, and 63.1% F-score for Event Argument Role Labeling. The event argument role labeling model transferred from English to Chinese achieves similar performance as the model trained from Chinese. We thus find that language-universal symbolic and distributional representations are complementary for cross-lingual structure transfer.

CFO: A Framework for Building Production NLP Systems
Rishav Chakravarti | Cezar Pendus | Andrzej Sakrajda | Anthony Ferritto | Lin Pan | Michael Glass | Vittorio Castelli | J. William Murdock | Radu Florian | Salim Roukos | Avi Sil
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

This paper introduces a novel orchestration framework, called CFO (Computation Flow Orchestrator), for building, experimenting with, and deploying interactive NLP (Natural Language Processing) and IR (Information Retrieval) systems to production environments. We then demonstrate a question answering system built using this framework which incorporates state-of-the-art BERT based MRC (Machine Reading Com- prehension) with IR components to enable end-to-end answer retrieval. Results from the demo system are shown to be high quality in both academic and industry domain specific settings. Finally, we discuss best practices when (pre-)training BERT based MRC models for production systems. Screencast links: - Short video (< 3 min): http: //ibm.biz/gaama_demo - Supplementary long video (< 13 min): http://ibm.biz/gaama_cfo_demo

2018

Neural Cross-Lingual Coreference Resolution And Its Application To Entity Linking
Gourab Kundu | Avi Sil | Radu Florian | Wael Hamza
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We propose an entity-centric neural crosslingual coreference model that builds on multi-lingual embeddings and language independent features. We perform both intrinsic and extrinsic evaluations of our model. In the intrinsic evaluation, we show that our model, when trained on English and tested on Chinese and Spanish, achieves competitive results to the models trained directly on Chinese and Spanish respectively. In the extrinsic evaluation, we show that our English model helps achieve superior entity linking accuracy on Chinese and Spanish test sets than the top 2015 TAC system without using any annotated data from Chinese or Spanish.

Multi-lingual Entity Discovery and Linking
Avi Sil | Heng Ji | Dan Roth | Silviu-Petru Cucerzan
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The primary goals of this tutorial are to review the framework of cross-lingual EL and motivate it as a broad paradigm for the Information Extraction task. We will start by discussing the traditional EL techniques and metrics and address questions relevant to the adequacy of these to across domains and languages. We will then present more recent approaches such as Neural EL, discuss the basic building blocks of a state-of-the-art neural EL system and analyze some of the current results on English EL. We will then proceed to Cross-lingual EL and discuss methods that work across languages. In particular, we will discuss and compare multiple methods that make use of multi-lingual word embeddings. We will also present EL methods that work for both name tagging and linking in very low resource languages. Finally, we will discuss the uses of cross-lingual EL in a variety of applications like search engines and commercial product selling applications. Also, contrary to the 2014 EL tutorial, we will also focus on Entity Discovery which is an essential component of EL.

Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP
Georgiana Dinu | Miguel Ballesteros | Avirup Sil | Sam Bowman | Wael Hamza | Anders Sogaard | Tahira Naseem | Yoav Goldberg
Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP

2017

Improving Slot Filling Performance with Attentive Neural Networks on Dependency Structures
Lifu Huang | Avirup Sil | Heng Ji | Radu Florian
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Slot Filling (SF) aims to extract the values of certain types of attributes (or slots, such as person:cities_of_residence) for a given entity from a large collection of source documents. In this paper we propose an effective DNN architecture for SF with the following new strategies: (1). Take a regularized dependency graph instead of a raw sentence as input to DNN, to compress the wide contexts between query and candidate filler; (2). Incorporate two attention mechanisms: local attention learned from query and candidate filler, and global attention learned from external knowledge bases, to guide the model to better select indicative contexts to determine slot type. Experiments show that this framework outperforms state-of-the-art on both relation extraction (16% absolute F-score gain) and slot filling validation for each individual system (up to 8.5% absolute F-score gain).

2016

Liberal Event Extraction and Event Schema Induction
Lifu Huang | Taylor Cassidy | Xiaocheng Feng | Heng Ji | Clare R. Voss | Jiawei Han | Avirup Sil
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

One for All: Towards Language Independent Named Entity Linking
Avirup Sil | Radu Florian
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2014

Towards Temporal Scoping of Relational Facts based on Wikipedia Data
Avirup Sil | Silviu-Petru Cucerzan
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2012

Linking Named Entities to Any Database
Avirup Sil | Ernest Cronin | Penghai Nie | Yinfei Yang | Ana-Maria Popescu | Alexander Yates
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Automatic Grading of Scientific Inquiry
Avirup Sil | Angela Shelton | Diane Jass Ketelhut | Alexander Yates
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

2011

Extracting STRIPS Representations of Actions and Events
Avirup Sil | Alexander Yates
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

Machine Reading Between the Lines: A Simple Evaluation Framework for Extracted Knowledge Bases
Avirup Sil | Alexander Yates
Proceedings of the RANLP 2011 Workshop on Information Extraction and Knowledge Acquisition

Co-authors

Rishav Chakravarti 7

Sara Rosenthal 6

Vittorio Castelli 5

Revanth Gangi Reddy 5

Alexander Yates 4

Mihaela Bornea 3

Shih-Fu Chang 3

Michael Glass 3

Silviu Cucerzan 2

Alfio Gliozzo 2

Vishwajeet Kumar 2

J. Scott McCarley 2

Kathleen McKeown 2

J. William Murdock 2

Christopher Potts 2

Jon Saad-Falcon 2

Andrzej Sakrajda 2

Keshav Santhanam 2

Efsun Sarioglu Kayi 2

Ananya Subburathinam 2

Emre Can Acikgoz 1

Pegah Alipoormolabashi 1

Aruna Balasubramanian 1

Niranjan Balasubramanian 1

Miguel Ballesteros 1

Samarth Bharadwaj 1

G P Shrivatsa Bhargav 1

Bishwaranjan Bhattacharjee 1

Biddut Sarker Bijoy 1

Santosh Borse 1

Samuel Bowman 1

Juergen Bross 1

Mustafa Canim 1

Taylor Cassidy 1

Saneem Chemmengath 1

Ernest Cronin 1

Georgiana Dinu 1

Yousef El-Kurdi 1

Kshitij Fadnis 1

Nicolas Rodolfo Fauceglia 1

Xiaocheng Feng 1

Yoav Goldberg 1

Hannaneh Hajishirzi 1

Dilek Hakkani-Tur 1

Mohammad Saqib Hasan 1

Diane Jass Ketelhut 1

Dinesh Khandelwal 1

Young-Suk Lee 1

Xi Victoria Lin 1

Michael McCawley 1

Tahira Naseem 1

Graham Neubig 1

John F. Pitrelli 1

Sandeep Polisetty 1

Ana-Maria Popescu 1

Saurabh Pujar 1

Sebastian Ruder 1

Angela Shelton 1

David Shimshoni 1

Aditya Singhal 1

Anders Søgaard 1

Christoph Tillmann 1

Hanghang Tong 1

Aashka Trivedi 1

Rosario Uceda-Sosa 1

Pengcheng Yin 1

Matei Zaharia 1

ChengXiang Zhai 1

Venues