Md Arafat Sultan - ACL Anthology

Md Arafat Sultan

Also published as: Md. Sultan, Md. Arafat Sultan, Md Sultan

2024

FIRST: Faster Improved Listwise Reranking with Single Token Decoding
Revanth Gangi Reddy | JaeHyeok Doo | Yifei Xu | Md Arafat Sultan | Deevya Swain | Avirup Sil | Heng Ji
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) have significantly advanced the field of information retrieval, particularly for reranking. Listwise LLM rerankers have showcased superior performance and generalizability compared to existing supervised approaches. However, conventional listwise LLM reranking methods lack efficiency as they provide ranking output in the form of a generated ordered sequence of candidate passage identifiers. Further, they are trained with the typical language modeling objective, which treats all ranking errors uniformly–potentially at the cost of misranking highly relevant passages. Addressing these limitations, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Further, we incorporate a learning-to-rank loss during training, prioritizing ranking accuracy for the more relevant passages. Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Finally, to illustrate the practical effectiveness of listwise LLM rerankers, we investigate their application in providing relevance feedback for retrievers during inference. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.

Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation
Jiachen Zhao | Wenlong Zhao | Andrew Drozdov | Benjamin Rozonoyer | Md Arafat Sultan | Jay-Yoon Lee | Mohit Iyyer | Andrew McCallum
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted large language models (LLMs) exhibit room for improvement. In this paper, we present the discovery that a student model distilled from a few-shot prompted LLM can commonly generalize better than its teacher to unseen examples on such tasks. We find that the student is able to learn a general pattern from the high-quality pseudolabels produced by the teacher during knowledge distillation (KD), and favorably not a general pattern from the low-quality pseudolabels. Leveraging this discovery, we propose a new method, Multistage Collaborative Knowledge Distillation from an LLM (MCKD), for these tasks. MCKD first few-shot prompts an LLM to produce pseudolabels for unlabeled data. Then at each stage of an iterative KD process, a new pair of students is trained on disjoint partitions of the pseudolabeled data, and produces new and improved pseudolabels for their unseen partitions. We conduct extensive experiments on four syntactic and semantic parsing datasets and show the effectiveness of MCKD for low-resource semi-supervised sequence generation. On CRAFT biomedical parsing, for example, 3-stage MCKD with 50 labeled examples outperforms an LLM teacher and vanilla KD by 7.5% and 3.7% parsing F1, respectively, and matches the performance of supervised finetuning with 500 labeled examples.

Structured Chain-of-Thought Prompting for Few-Shot Generation of Content-Grounded QA Conversations
Md Arafat Sultan | Jatin Ganhotra | Ramón Fernandez Astudillo
Findings of the Association for Computational Linguistics: EMNLP 2024

We introduce a structured chain-of-thought (SCoT) prompting approach to generating content-grounded multi-turn question-answer conversations with a pre-trained large language model (LLM). At the core of our proposal is a structured breakdown of the complex task into a number of states in a state machine, so that actions corresponding to various subtasks, e.g., content reading and utterance generation, can be executed in their own dedicated states. Each state leverages a unique set of resources, including prompts and (optionally) additional tools, to augment the generation process. Automatic evaluation shows that SCoT prompting with designated states for hallucination mitigation can increase agent faithfulness to grounding documents by up to 16.8%. When used as training data, our open-domain conversations synthesized from only 6 Wikipedia-based seed demonstrations train strong conversational QA agents. In out-of-domain evaluation, for example, we observe improvements of up to 13.9% in F1-score against ground truth over target domain gold data when the latter is augmented with our generated examples.

2023

Knowledge Distillation ≈ Label Smoothing: Fact or Fallacy?
Md Sultan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Originally proposed as a method for knowledge transfer from one model to another, some recent studies have suggested that knowledge distillation (KD) is in fact a form of regularization. Perhaps the strongest argument of all for this new perspective comes from its apparent similarities with label smoothing (LS). Here we re-examine this stated equivalence between the two methods by comparing the predictive confidences of the models they train. Experiments on four text classification tasks involving models of different sizes show that: (a) In most settings, KD and LS drive model confidence in completely opposite directions, and (b) In KD, the student inherits not only its knowledge but also its confidence from the teacher, reinforcing the classical knowledge transfer view.

The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PrimeQA: a one-stop and open-source QA repository with an aim to democratize QA research and facilitate easy replication of state-of-the-art (SOTA) QA methods. PrimeQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation. It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on public benchmarks, and expanding pre-existing methods. PrimeQA is available at: https://github.com/primeqa.

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers
Jon Saad-Falcon | Omar Khattab | Keshav Santhanam | Radu Florian | Martin Franz | Salim Roukos | Avirup Sil | Md Sultan | Christopher Potts
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using large language models (LLMs) to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains and achieves substantially lower latency than standard reranking methods.

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
Keshav Santhanam | Jon Saad-Falcon | Martin Franz | Omar Khattab | Avi Sil | Radu Florian | Md Arafat Sultan | Salim Roukos | Matei Zaharia | Christopher Potts
Findings of the Association for Computational Linguistics: ACL 2023

Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.

Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs
Young-Suk Lee | Md Sultan | Yousef El-Kurdi | Tahira Naseem | Asim Munawar | Radu Florian | Salim Roukos | Ramón Astudillo
Findings of the Association for Computational Linguistics: EMNLP 2023

Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B–40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful examples than their larger un-tuned counterparts.

2022

Not to Overfit or Underfit the Source Domains? An Empirical Study of Domain Generalization in Question Answering
Md Arafat Sultan | Avi Sil | Radu Florian
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Machine learning models are prone to overfitting their training (source) domains, which is commonly believed to be the reason why they falter in novel target domains. Here we examine the contrasting view that multi-source domain generalization (DG) is first and foremost a problem of mitigating source domain underfitting: models not adequately learning the signal already present in their multi-domain training data. Experiments on a reading comprehension DG benchmark show that as a model learns its source domains better—using familiar methods such as knowledge distillation (KD) from a bigger model—its zero-shot out-of-domain utility improves at an even faster pace. Improved source domain learning also demonstrates superior out-of-domain generalization over three popular existing DG approaches that aim to limit overfitting. Our implementation of KD-based domain generalization is available via PrimeQA at: https://ibm.biz/domain-generalization-with-kd.

Learning Cross-Lingual IR from an English Retriever
Yulong Li | Martin Franz | Md Arafat Sultan | Bhavani Iyer | Young-Suk Lee | Avirup Sil
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.

Towards Robust Neural Retrieval with Source Domain Synthetic Pre-Finetuning
Revanth Gangi Reddy | Vikas Yadav | Md Arafat Sultan | Martin Franz | Vittorio Castelli | Heng Ji | Avirup Sil
Proceedings of the 29th International Conference on Computational Linguistics

Research on neural IR has so far been focused primarily on standard supervised learning settings, where it outperforms traditional term matching baselines. Many practical use cases of such models, however, may involve previously unseen target domains. In this paper, we propose to improve the out-of-domain generalization of Dense Passage Retrieval (DPR) - a popular choice for neural IR - through synthetic data augmentation only in the source domain. We empirically show that pre-finetuning DPR with additional synthetic data in its source domain (Wikipedia), which we generate using a fine-tuned sequence-to-sequence generator, can be a low-cost yet effective first step towards its generalization. Across five different test sets, our augmented model shows more robust performance than DPR in both in-domain and zero-shot out-of-domain evaluation.

2020

On the Importance of Diversity in Question Generation for QA
Md Arafat Sultan | Shubham Chandel | Ramón Fernandez Astudillo | Vittorio Castelli
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Automatic question generation (QG) has shown promise as a source of synthetic training data for question answering (QA). In this paper we ask: Is textual diversity in QG beneficial for downstream QA? Using top-p nucleus sampling to derive samples from a transformer-based question generator, we show that diversity-promoting QG indeed provides better QA training than likelihood maximization approaches such as beam search. We also show that standard QG evaluation metrics such as BLEU, ROUGE and METEOR are inversely correlated with diversity, and propose a diversity-aware intrinsic measure of overall QG quality that correlates well with extrinsic evaluation on QA.

Multi-Stage Pre-training for Low-Resource Domain Adaptation
Rong Zhang | Revanth Gangi Reddy | Md Arafat Sultan | Vittorio Castelli | Anthony Ferritto | Radu Florian | Efsun Sarioglu Kayi | Salim Roukos | Avi Sil | Todd Ward
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Transfer learning techniques are particularly useful for NLP tasks where a sizable amount of high-quality annotated data is difficult to obtain. Current approaches directly adapt a pretrained language model (LM) on in-domain text before fine-tuning to downstream tasks. We show that extending the vocabulary of the LM with domain-specific terms leads to further gains. To a bigger effect, we utilize structure in the unlabeled data to create auxiliary synthetic tasks, which helps the LM transfer to downstream tasks. We apply these approaches incrementally on a pretrained Roberta-large LM and show considerable performance gain on three tasks in the IT domain: Extractive Reading Comprehension, Document Ranking and Duplicate Question Detection.

GPT-too: A Language-Model-First Approach for AMR-to-Text Generation
Manuel Mager | Ramón Fernandez Astudillo | Tahira Naseem | Md Arafat Sultan | Young-Suk Lee | Radu Florian | Salim Roukos
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Abstract Meaning Representations (AMRs) are broad-coverage sentence-level semantic graphs. Existing approaches to generating text from AMR have focused on training sequence-to-sequence or graph-to-sequence models on AMR annotated data only. In this paper, we propose an alternative approach that combines a strong pre-trained language model with cycle consistency-based re-scoring. Despite the simplicity of the approach, our experimental results show these models outperform all previous techniques on the English LDC2017T10 dataset, including the recent use of transformer architectures. In addition to the standard evaluation metrics, we provide human evaluation experiments that further substantiate the strength of our approach.

Answer Span Correction in Machine Reading Comprehension
Revanth Gangi Reddy | Md Arafat Sultan | Efsun Sarioglu Kayi | Rong Zhang | Vittorio Castelli | Avi Sil
Findings of the Association for Computational Linguistics: EMNLP 2020

Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the “answerability” of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions. We explore the nature of such errors and propose a post-processing correction method that yields statistically significant performance improvements over state-of-the-art MRC systems in both monolingual and multilingual evaluation.

2019

Cross-Task Knowledge Transfer for Query-Based Text Summarization
Elozino Egonmwan | Vittorio Castelli | Md Arafat Sultan
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

We demonstrate the viability of knowledge transfer between two related tasks: machine reading comprehension (MRC) and query-based text summarization. Using an MRC model trained on the SQuAD1.1 dataset as a core system component, we first build an extractive query-based summarizer. For better precision, this summarizer also compresses the output of the MRC model using a novel sentence compression technique. We further leverage pre-trained machine translation systems to abstract our extracted summaries. Our models achieve state-of-the-art results on the publicly available CNN/Daily Mail and Debatepedia datasets, and can serve as simple yet powerful baselines for future systems. We also hope that these results will encourage research on transfer learning from large MRC corpora to query-based summarization.

2016

DLS@CU at SemEval-2016 Task 1: Supervised Models of Sentence Similarity
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Bayesian Supervised Domain Adaptation for Short Text Similarity
Md Arafat Sultan | Jordan Boyd-Graber | Tamara Sumner
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Fast and Easy Short Answer Grading with High Accuracy
Md Arafat Sultan | Cristobal Salazar | Tamara Sumner
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A Joint Model for Answer Sentence Ranking and Answer Extraction
Md Arafat Sultan | Vittorio Castelli | Radu Florian
Transactions of the Association for Computational Linguistics, Volume 4

Answer sentence ranking and answer extraction are two key challenges in question answering that have traditionally been treated in isolation, i.e., as independent tasks. In this article, we (1) explain how both tasks are related at their core by a common quantity, and (2) propose a simple and intuitive joint probabilistic model that addresses both via joint computation but task-specific application of that quantity. In our experiments with two TREC datasets, our joint model substantially outperforms state-of-the-art systems in both tasks.

2015

Feature-Rich Two-Stage Logistic Regression for Monolingual Alignment
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Transactions of the Association for Computational Linguistics, Volume 2

We present a simple, easy-to-replicate monolingual aligner that demonstrates state-of-the-art performance while relying on almost no supervision and a very small number of external resources. Based on the hypothesis that words with similar meanings represent potential pairs for alignment if located in similar contexts, we propose a system that operates by finding such pairs. In two intrinsic evaluations on alignment test data, our system achieves F1 scores of 88–92%, demonstrating 1–3% absolute improvement over the previous best system. Moreover, in two extrinsic evaluations our aligner outperforms existing aligners, and even a naive application of the aligner approaches state-of-the-art performance in each extrinsic task.

DLS@CU: Sentence Similarity from Word Alignment
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

DLS@CU-CORE: A Simple Machine Learning Model of Semantic Textual Similarity
Md. Sultan | Steven Bethard | Tamara Sumner
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

Identifying science concepts and student misconceptions in an interactive essay writing tutor
Steven Bethard | Ifeyinwa Okoye | Md. Arafat Sultan | Haojie Hang | James H. Martin | Tamara Sumner
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

Co-authors

Venues