Kevin Small


2024

pdf bib
EVEDIT: Event-based Knowledge Editing for Deterministic Knowledge Propagation
Jiateng Liu | Pengfei Yu | Yuji Zhang | Sha Li | Zixuan Zhang | Ruhi Sarikaya | Kevin Small | Heng Ji
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The dynamic nature of real-world information necessitates knowledge editing (KE) in large language models (LLMs). The edited knowledge should propagate and facilitate the deduction of new information based on existing model knowledge. We term the existing related knowledge in LLM serving as the origination of knowledge propagation as ”deduction anchors”. However, current KE approaches, which only operate on (subject, relation, object) triple. We both theoretically and empirically observe that this simplified setting often leads to uncertainty when determining the deduction anchors, causing low confidence in their answers. To mitigate this issue, we propose a novel task of event-based knowledge editing that pairs facts with event descriptions. This task manifests not only a closer simulation of real-world editing scenarios but also a more logically sound setting, implicitly defining the deduction anchor and enabling LLMs to propagate knowledge confidently. We curate a new benchmark dataset Evedit derived from the CounterFact dataset and validate its superiority in improving model confidence. Moreover, while we observe that the event-based setting is significantly challenging for existing approaches, we propose a novel approach Self-Edit that showcases stronger performance, achieving 55.6% consistency improvement while maintaining the naturalness of generation.

pdf bib
Unified Embeddings for Multimodal Retrieval via Frozen LLMs
Ziyang Wang | Heba Elfardy | Markus Dreyer | Kevin Small | Mohit Bansal
Findings of the Association for Computational Linguistics: EACL 2024

In this work, We present Unified Embeddings for Multimodal Retrieval (UniMuR), a simple but effective approach that embeds multimodal inputs and retrieves visual and textual outputs via frozen Large Language Models (LLMs). Specifically, UniMuR jointly retrieves multimodal outputs via a unified multimodal embedding and applies dual alignment training to account for both visual and textual semantics. Thus, unlike previous approaches, UniMuR significantly reduces LLM’s modality bias towards generating text-only outputs. Meanwhile, the proposed unified multimodal embedding mitigates the inconsistency between visual and textual outputs and provides coherent multimodal outputs. Furthermore, benefiting from the joint training of visual and textual semantics, UniMuR also achieves strong image/text retrieval ability. Compared to existing approaches, UniMuR achieves better zero-shot multimodal response retrieval performance on MMDialog, improving the overall R@1 by 6.5% while boosting the image retrieval rate and having better cross-modal consistency on multimodal outputs. UniMuR also achieves 2.4% and 3.9% improvement on context-based image retrieval tasks on MMDialog and VisDial respectively when compared to previous approaches, validating its generalization ability across multiple tasks.

pdf bib
Towards Better Generalization in Open-Domain Question Answering by Mitigating Context Memorization
Zixuan Zhang | Revanth Gangi Reddy | Kevin Small | Tong Zhang | Heng Ji
Findings of the Association for Computational Linguistics: NAACL 2024

Open-domain Question Answering (OpenQA) aims at answering factual questions with an external large-scale knowledge corpus. However, real-world knowledge is not static; it updates and evolves continually. Such a dynamic characteristic of knowledge poses a vital challenge for these models, as the trained models need to constantly adapt to the latest information to make sure that the answers remain accurate. In addition, it is still unclear how well an OpenQA model can transfer to completely new knowledge domains. In this paper, we investigate the generalization performance of a retrieval-augmented QA model in two specific scenarios: 1) adapting to updated versions of the same knowledge corpus; 2) switching to completely different knowledge domains. We observe that the generalization challenges of OpenQA models stem from the reader’s over-reliance on memorizing the knowledge from the external corpus, which hinders the model from generalizing to a new knowledge corpus. We introduce Corpus-Invariant Tuning (CIT), a simple but effective training strategy, to mitigate the knowledge over-memorization by controlling the likelihood of retrieved contexts during training. Extensive experimental results on multiple OpenQA benchmarks show that CIT achieves significantly better generalizability without compromising the model’s performance in its original corpus and domain.

pdf bib
Learning When to Retrieve, What to Rewrite, and How to Respond in Conversational QA
Nirmal Roy | Leonardo F. R. Ribeiro | Rexhina Blloshmi | Kevin Small
Findings of the Association for Computational Linguistics: EMNLP 2024

Augmenting Large Language Models (LLMs) with information retrieval capabilities (i.e., Retrieval-Augmented Generation (RAG)) has proven beneficial for knowledge-intensive tasks. However, understanding users’ contextual search intent when generating responses is an understudied topic for conversational question answering (QA). This conversational extension leads to additional concerns when compared to single-turn QA as it is more challenging for systems to comprehend conversational context and manage retrieved passages over multiple turns. In this work, we propose a method for enabling LLMs to decide when to retrieve in RAG settings given a conversational context. When retrieval is deemed necessary, the LLM then rewrites the conversation for passage retrieval and judges the relevance of returned passages before response generation. Operationally, we build on the single-turn SELF-RAG framework (Asai et al., 2023) and propose SELF-multi-RAG for conversational settings. SELF-multi-RAG demonstrates improved capabilities over single-turn variants with respect to retrieving relevant passages (by using summarized conversational context) and assessing the quality of generated responses. Experiments on three conversational QA datasets validate the enhanced response generation capabilities of SELF-multi-RAG with improvements of ~13% measured by human annotation.

2023

pdf bib
PLAtE: A Large-scale Dataset for List Page Web Extraction
Aidan San | Yuan Zhuang | Jan Bakus | Colin Lockard | David Ciemiewicz | Sandeep Atluri | Kevin Small | Yangfeng Ji | Heba Elfardy
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track)

Recently, neural models have been leveraged to significantly improve the performance of information extraction from semi-structured websites. However, a barrier for continued progress is the small number of datasets large enough to train these models. In this work, we introduce the PLAtE (Pages of Lists Attribute Extraction) benchmark dataset as a challenging new web extraction task. PLAtE focuses on shopping data, specifically extractions from product review pages with multiple items encompassing the tasks of: (1) finding product list segmentation boundaries and (2) extracting attributes for each product. PLAtE is composed of 52,898 items collected from 6,694 pages and 156,014 attributes, making it the first large-scale list page web extraction dataset. We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.

pdf bib
Background Summarization of Event Timelines
Adithya Pratapa | Kevin Small | Markus Dreyer
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Generating concise summaries of news events is a challenging natural language processing task. While journalists often curate timelines to highlight key sub-events, newcomers to a news event face challenges in catching up on its historical context. In this paper, we address this need by introducing the task of background news summarization, which complements each timeline update with a background summary of relevant preceding events. We construct a dataset by merging existing timeline datasets and asking human annotators to write a background summary for each timestep of each news event. We establish strong baseline performance using state-of-the-art summarization systems and propose a query-focused variant to generate background summaries. To evaluate background summary quality, we present a question-answering-based evaluation metric, Background Utility Score (BUS), which measures the percentage of questions about a current event timestep that a background summary answers. Our experiments show the effectiveness of instruction fine-tuned systems such as Flan-T5, in addition to strong zero-shot performance using GPT-3.5.

pdf bib
Enhancing Multi-Document Summarization with Cross-Document Graph-based Information Extraction
Zixuan Zhang | Heba Elfardy | Markus Dreyer | Kevin Small | Heng Ji | Mohit Bansal
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Information extraction (IE) and summarization are closely related, both tasked with presenting a subset of the information contained in a natural language text. However, while IE extracts structural representations, summarization aims to abstract the most salient information into a generated text summary – thus potentially encountering the technical limitations of current text generation methods (e.g., hallucination). To mitigate this risk, this work uses structured IE graphs to enhance the abstractive summarization task. Specifically, we focus on improving Multi-Document Summarization (MDS) performance by using cross-document IE output, incorporating two novel components: (1) the use of auxiliary entity and event recognition systems to focus the summary generation model; (2) incorporating an alignment loss between IE nodes and their text spans to reduce inconsistencies between the IE graphs and text representations. Operationally, both the IE nodes and corresponding text spans are projected into the same embedding space and pairwise distance is minimized. Experimental results on multiple MDS benchmarks show that summaries generated by our model are more factually consistent with the source documents than baseline models while maintaining the same level of abstractiveness.

2022

pdf bib
Answer Consolidation: Formulation and Benchmarking
Wenxuan Zhou | Qiang Ning | Heba Elfardy | Kevin Small | Muhao Chen
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Current question answering (QA) systems primarily consider the single-answer scenario, where each question is assumed to be paired with one correct answer. However, in many real-world QA applications, multiple answer scenarios arise where consolidating answers into a comprehensive and non-redundant set of answers is a more efficient user interface. In this paper, we formulate the problem of answer consolidation, where answers are partitioned into multiple groups, each representing different aspects of the answer set. Then, given this partitioning, a comprehensive and non-redundant set of answers can be constructed by picking one answer from each group. To initiate research on answer consolidation, we construct a dataset consisting of 4,699 questions and 24,006 sentences and evaluate multiple models. Despite a promising performance achieved by the best-performing supervised models, we still believe this task has room for further improvements.

pdf bib
NewsClaims: A New Benchmark for Claim Detection from News with Attribute Knowledge
Revanth Gangi Reddy | Sai Chetan Chinthakindi | Zhenhailong Wang | Yi Fung | Kathryn Conger | Ahmed ELsayed | Martha Palmer | Preslav Nakov | Eduard Hovy | Kevin Small | Heng Ji
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Claim detection and verification are crucial for news understanding and have emerged as promising technologies for mitigating misinformation and disinformation in the news. However, most existing work has focused on claim sentence analysis while overlooking additional crucial attributes (e.g., the claimer and the main object associated with the claim).In this work, we present NewsClaims, a new benchmark for attribute-aware claim detection in the news domain. We extend the claim detection problem to include extraction of additional attributes related to each claim and release 889 claims annotated over 143 news articles. NewsClaims aims to benchmark claim detection systems in emerging scenarios, comprising unseen topics with little or no training data. To this end, we see that zero-shot and prompt-based baselines show promising performance on this benchmark, while still considerably behind human performance.

pdf bib
Building a Dataset for Automatically Learning to Detect Questions Requiring Clarification
Ivano Lauriola | Kevin Small | Alessandro Moschitti
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Question Answering (QA) systems aim to return correct and concise answers in response to user questions. QA research generally assumes all questions are intelligible and unambiguous, which is unrealistic in practice as questions frequently encountered by virtual assistants are ambiguous or noisy. In this work, we propose to make QA systems more robust via the following two-step process: (1) classify if the input question is intelligible and (2) for such questions with contextual ambiguity, return a clarification question. We describe a new open-domain clarification corpus containing user questions sampled from Quora, which is useful for building machine learning approaches to solving these tasks.

pdf bib
A Zero-Shot Claim Detection Framework Using Question Answering
Revanth Gangi Reddy | Sai Chetan Chinthakindi | Yi R. Fung | Kevin Small | Heng Ji
Proceedings of the 29th International Conference on Computational Linguistics

In recent years, there has been an increasing interest in claim detection as an important building block for misinformation detection. This involves detecting more fine-grained attributes relating to the claim, such as the claimer, claim topic, claim object pertaining to the topic, etc. Yet, a notable bottleneck of existing claim detection approaches is their portability to emerging events and low-resource training data settings. In this regard, we propose a fine-grained claim detection framework that leverages zero-shot Question Answering (QA) using directed questions to solve a diverse set of sub-tasks such as topic filtering, claim object detection, and claimer detection. We show that our approach significantly outperforms various zero-shot, few-shot and task-specific baselines on the NewsClaims benchmark (Reddy et al., 2021).

2021

pdf bib
Summary-Oriented Question Generation for Informational Queries
Xusen Yin | Li Zhou | Kevin Small | Jonathan May
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)

Users frequently ask simple factoid questions for question answering (QA) systems, attenuating the impact of myriad recent works that support more complex questions. Prompting users with automatically generated suggested questions (SQs) can improve user understanding of QA system capabilities and thus facilitate more effective use. We aim to produce self-explanatory questions that focus on main document topics and are answerable with variable length passages as appropriate. We satisfy these requirements by using a BERT-based Pointer-Generator Network trained on the Natural Questions (NQ) dataset. Our model shows SOTA performance of SQ generation on the NQ dataset (20.1 BLEU-4). We further apply our model on out-of-domain news articles, evaluating with a QA system due to the lack of gold questions and demonstrate that our model produces better SQs for news articles – with further confirmation via a human evaluation.

pdf bib
Generating Self-Contained and Summary-Centric Question Answer Pairs via Differentiable Reward Imitation Learning
Li Zhou | Kevin Small | Yong Zhang | Sandeep Atluri
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Motivated by suggested question generation in conversational news recommendation systems, we propose a model for generating question-answer pairs (QA pairs) with self-contained, summary-centric questions and length-constrained, article-summarizing answers. We begin by collecting a new dataset of news articles with questions as titles and pairing them with summaries of varying length. This dataset is used to learn a QA pair generation model producing summaries as answers that balance brevity with sufficiency jointly with their corresponding questions. We then reinforce the QA pair generation process with a differentiable reward function to mitigate exposure bias, a common problem in natural language generation. Both automatic metrics and human evaluation demonstrate these QA pairs successfully capture the central gists of the articles and achieve high answer accuracy.

2020

pdf bib
Fluent Response Generation for Conversational Question Answering
Ashutosh Baheti | Alan Ritter | Kevin Small
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Question answering (QA) is an important aspect of open-domain conversational agents, garnering specific research focus in the conversational QA (ConvQA) subtask. One notable limitation of recent ConvQA efforts is the response being answer span extraction from the target corpus, thus ignoring the natural language generation (NLG) aspect of high-quality conversational agents. In this work, we propose a method for situating QA responses within a SEQ2SEQ NLG approach to generate fluent grammatical answer responses while maintaining correctness. From a technical perspective, we use data augmentation to generate training data for an end-to-end system. Specifically, we develop Syntactic Transformations (STs) to produce question-specific candidate answer responses and rank them using a BERT-based classifier (Devlin et al., 2019). Human evaluation on SQuAD 2.0 data (Rajpurkar et al., 2018) demonstrate that the proposed model outperforms baseline CoQA and QuAC models in generating conversational responses. We further show our model’s scalability by conducting tests on the CoQA dataset. The code and data are available at https://github.com/abaheti95/QADialogSystem.

2010

pdf bib
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing
Burr Settles | Kevin Small | Katrin Tomanek
Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing

pdf bib
Object Search: Supporting Structured Queries in Web Search Engines
Kim Pham | Nicholas Rizzolo | Kevin Small | Kevin Chen-Chuan Chang | Dan Roth
Proceedings of the NAACL HLT 2010 Workshop on Semantic Search

2009

pdf bib
Interactive Feature Space Construction using Semantic Information
Dan Roth | Kevin Small
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

2007

pdf bib
All links are not the same: evaluating word alignments for statistical machine translation
Paul C. Davis | Zhuli Xie | Kevin Small
Proceedings of Machine Translation Summit XI: Papers