Jaewoo Kang


2024

pdf bib
CompAct: Compressing Retrieved Documents Actively for Question Answering
Chanwoong Yoon | Taewhoo Lee | Hyeon Hwang | Minbyul Jeong | Jaewoo Kang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Retrieval-augmented generation supports language models to strengthen their factual groundings by providing external contexts. However, language models often face challenges when given extensive information, diminishing their effectiveness in solving questions. Context compression tackles this issue by filtering out irrelevant information, but current methods still struggle in realistic scenarios where crucial information cannot be captured with a single-step approach. To overcome this limitation, we introduce CompAct, a novel framework that employs an active strategy to condense extensive documents without losing key information. Our experiments demonstrate that CompAct brings significant improvements in both performance and compression rate on multi-hop question-answering benchmarks. CompAct flexibly operates as a cost-efficient plug-in module with various off-the-shelf retrievers or readers, achieving exceptionally high compression rates (47x).

pdf bib
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim | Seunghyun Yoon | Trung Bui | Handong Zhao | Quan Tran | Franck Dernoncourt | Jaewoo Kang
Findings of the Association for Computational Linguistics: EACL 2024

Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 7.6% and 9.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.

pdf bib
KU-DMIS at MEDIQA-CORR 2024: Exploring the Reasoning Capabilities of Small Language Models in Medical Error Correction
Hyeon Hwang | Taewhoo Lee | Hyunjae Kim | Jaewoo Kang
Proceedings of the 6th Clinical Natural Language Processing Workshop

Recent advancements in large language models (LM) like OpenAI’s GPT-4 have shown promise in healthcare, particularly in medical question answering and clinical applications. However, their deployment raises privacy concerns and their size limits use in resource-constrained environments.Smaller open-source LMs have emerged as alternatives, but their reliability in medicine remains underexplored.This study evaluates small LMs in the medical field using the MEDIQA-CORR 2024 task, which assesses the ability of models to identify and correct errors in clinical notes. Initially, zero-shot inference and simple fine-tuning of small models resulted in poor performance. When fine-tuning with chain-of-thought (CoT) reasoning using synthetic data generated by GPT-4, their performance significantly improved. Meerkat-7B, a small LM trained with medical CoT reasoning, demonstrated notable performance gains. Our model outperforms other small non-commercial LMs and some larger models, achieving a 73.36 aggregate score on MEDIQA-CORR 2024.

pdf bib
KU-DMIS at EHRSQL 2024 : Generating SQL query via question templatization in EHR
Hajung Kim | Chanhwi Kim | Hoonick Lee | Kyochul Jang | Jiwoo Lee | Kyungjae Lee | Gangwoo Kim | Jaewoo Kang
Proceedings of the 6th Clinical Natural Language Processing Workshop

Transforming natural language questions into SQL queries is crucial for precise data retrieval from electronic health record (EHR) databases. A significant challenge in this process is detecting and rejecting unanswerable questions that request information outside the database’s scope or exceed the system’s capabilities. In this paper, we introduce a novel text-to-SQL framework that focuses on standardizing the structure of questions into a templated format. Our framework begins by fine-tuning GPT-3.5-turbo, a powerful large language model (LLM), with detailed prompts involving the table schemas of the EHR database system. Our approach shows promising results on the EHRSQL-2024 benchmark dataset, part of the ClinicalNLP shared task. Although fine-tuning GPT achieves third place on the development set, it struggled with the diverse questions in the test set. With our framework, we improve our system’s adaptability and achieve fourth position in the official leaderboard of the EHRSQL-2024 challenge.

pdf bib
CookingSense: A Culinary Knowledgebase with Multidisciplinary Assertions
Donghee Choi | Mogan Gim | Donghyeon Park | Mujeen Sung | Hyunjae Kim | Jaewoo Kang | Jihun Choi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces CookingSense, a descriptive collection of knowledge assertions in the culinary domain extracted from various sources, including web data, scientific papers, and recipes, from which knowledge covering a broad range of aspects is acquired. CookingSense is constructed through a series of dictionary-based filtering and language model-based semantic filtering techniques, which results in a rich knowledgebase of multidisciplinary food-related assertions. Additionally, we present FoodBench, a novel benchmark to evaluate culinary decision support systems. From evaluations with FoodBench, we empirically prove that CookingSense improves the performance of retrieval augmented language models. We also validate the quality and variety of assertions in CookingSense through qualitative analysis.

2023

pdf bib
Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations
Hyunjae Kim | Jaehyo Yoo | Seunghyun Yoon | Jaewoo Kang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most weakly supervised named entity recognition (NER) models rely on domain-specific dictionaries provided by experts. This approach is infeasible in many domains where dictionaries do not exist. While a phrase retrieval model was used to construct pseudo-dictionaries with entities retrieved from Wikipedia automatically in a recent study, these dictionaries often have limited coverage because the retriever is likely to retrieve popular entities rather than rare ones. In this study, we present a novel framework, HighGEN, that generates NER datasets with high-coverage pseudo-dictionaries. Specifically, we create entity-rich dictionaries with a novel search method, called phrase embedding search, which encourages the retriever to search a space densely populated with various entities. In addition, we use a new verification process based on the embedding distance between candidate entity mentions and entity types to reduce the false-positive noise in weak labels generated by high-coverage dictionaries. We demonstrate that HighGEN outperforms the previous best model by an average F1 score of 4.7 across five NER benchmark datasets.

pdf bib
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization
Gangwoo Kim | Hajung Kim | Lei Ji | Seongsu Bae | Chanhwi Kim | Mujeen Sung | Hyunjae Kim | Kun Yan | Eric Chang | Jaewoo Kang
The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

In this paper, we introduce CheXOFA, a new pre-trained vision-language model (VLM) for the chest X-ray domain. Our model is initially pre-trained on various multimodal datasets within the general domain before being transferred to the chest X-ray domain. Following a prominent VLM, we unify various domain-specific tasks into a simple sequence-to-sequence schema. It enables the model to effectively learn the required knowledge and skills from limited resources in the domain. Demonstrating superior performance on the benchmark datasets provided by the BioNLP shared task (Delbrouck et al., 2023), our model benefits from its training across multiple tasks and domains. With subtle techniques including ensemble and factual calibration, our system achieves first place on the RadSum23 leaderboard for the hidden test set.

pdf bib
Optimizing Test-Time Query Representations for Dense Retrieval
Mujeen Sung | Jungsoo Park | Jaewoo Kang | Danqi Chen | Jinhyuk Lee
Findings of the Association for Computational Linguistics: ACL 2023

Recent developments of dense retrieval rely on quality representations of queries and contexts from pre-trained query and context encoders. In this paper, we introduce TOUR (Test-Time Optimization of Query Representations), which further optimizes instance-level query representations guided by signals from test-time retrieval results. We leverage a cross-encoder re-ranker to provide fine-grained pseudo labels over retrieval results and iteratively optimize query representations with gradient descent. Our theoretical analysis reveals that TOUR can be viewed as a generalization of the classical Rocchio algorithm for pseudo relevance feedback, and we present two variants that leverage pseudo-labels as hard binary or soft continuous labels. We first apply TOUR on phrase retrieval with our proposed phrase re-ranker, and also evaluate its effectiveness on passage retrieval with an off-the-shelf re-ranker. TOUR greatly improves end-to-end open-domain question answering accuracy, as well as passage retrieval performance. TOUR also consistently improves direct re-ranking by up to 2.0% while running 1.3–2.4x faster with an efficient implementation.

pdf bib
Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
Gangwoo Kim | Sungdong Kim | Byeongguk Jeon | Joonsuk Park | Jaewoo Kang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Questions in open-domain question answering are often ambiguous, allowing multiple interpretations. One approach to handling them is to identify all possible interpretations of the ambiguous question (AQ) and to generate a long-form answer addressing them all, as suggested by Stelmakh et al., (2022). While it provides a comprehensive response without bothering the user for clarification, considering multiple dimensions of ambiguity and gathering corresponding knowledge remains a challenge. To cope with the challenge, we propose a novel framework, Tree of Clarifications (ToC): It recursively constructs a tree of disambiguations for the AQ—via few-shot prompting leveraging external knowledge—and uses it to generate a long-form answer. ToC outperforms existing baselines on ASQA in a few-shot setup across the metrics, while surpassing fully-supervised baselines trained on the whole training set in terms of Disambig-F1 and Disambig-ROUGE. Code is available at https://github.com/gankim/tree-of-clarifications.

2022

pdf bib
Consistency Training with Virtual Adversarial Discrete Perturbation
Jungsoo Park | Gyuwan Kim | Jaewoo Kang
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Consistency training regularizes a model by enforcing predictions of original and perturbed inputs to be similar. Previous studies have proposed various augmentation methods for the perturbation but are limited in that they are agnostic to the training model. Thus, the perturbed samples may not aid in regularization due to their ease of classification from the model. In this context, we propose an augmentation method of adding a discrete noise that would incur the highest divergence between predictions. This virtual adversarial discrete noise obtained by replacing a small portion of tokens while keeping original semantics as much as possible efficiently pushes a training model’s decision boundary. Experimental results show that our proposed method outperforms other consistency training baselines with text editing, paraphrasing, or a continuous noise on semi-supervised text classification tasks and a robustness benchmark.

pdf bib
FaVIQ: FAct Verification from Information-seeking Questions
Jungsoo Park | Sewon Min | Jaewoo Kang | Luke Zettlemoyer | Hannaneh Hajishirzi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite significant interest in developing general purpose fact checking models, it is challenging to construct a large-scale fact verification dataset with realistic real-world claims. Existing claims are either authored by crowdworkers, thereby introducing subtle biases thatare difficult to control for, or manually verified by professional fact checkers, causing them to be expensive and limited in scale. In this paper, we construct a large-scale challenging fact verification dataset called FAVIQ, consisting of 188k claims derived from an existing corpus of ambiguous information-seeking questions. The ambiguities in the questions enable automatically constructing true and false claims that reflect user confusions (e.g., the year of the movie being filmed vs. being released). Claims in FAVIQ are verified to be natural, contain little lexical bias, and require a complete understanding of the evidence for verification. Our experiments show that the state-of-the-art models are far from solving our new task. Moreover, training on our data helps in professional fact-checking, outperforming models trained on the widely used dataset FEVER or in-domain data by up to 17% absolute. Altogether, our data will serve as a challenging benchmark for natural language understanding and support future progress in professional fact checking.

pdf bib
Generating Information-Seeking Conversations from Unlabeled Documents
Gangwoo Kim | Sungdong Kim | Kang Min Yoo | Jaewoo Kang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Synthesizing datasets for conversational question answering (CQA) from unlabeled documents remains challenging due to its interactive nature.Moreover, while modeling information needs is an essential key, only few studies have discussed it.In this paper, we introduce a novel framework, **SimSeek**, (**Sim**ulating information-**Seek**ing conversation from unlabeled documents), and compare its two variants.In our baseline, **SimSeek-sym**, a questioner generates follow-up questions upon the predetermined answer by an answerer.On the contrary, **SimSeek-asym** first generates the question and then finds its corresponding answer under the conversational context.Our experiments show that they can synthesize effective training resources for CQA and conversational search tasks.As a result, conversations from **SimSeek-asym** not only make more improvements in our experiments but also are favorably reviewed in a human evaluation.We finally release a large-scale resource of synthetic conversations, **Wiki-SimSeek**, containing 2 million CQA pairs built upon Wikipedia documents.With the dataset, our CQA model achieves the state-of-the-art performance on a recent CQA benchmark, QuAC.The code and dataset are available at https://github.com/naver-ai/simseek

pdf bib
Simple Questions Generate Named Entity Recognition Datasets
Hyunjae Kim | Jaehyo Yoo | Seunghyun Yoon | Jinhyuk Lee | Jaewoo Kang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Recent named entity recognition (NER) models often rely on human-annotated datasets requiring the vast engagement of professional knowledge on the target domain and entities. This work introduces an ask-to-generate approach, which automatically generates NER datasets by asking simple natural language questions to an open-domain question answering system (e.g., “Which disease?”). Despite using fewer training resources, our models solely trained on the generated datasets largely outperform strong low-resource models by 19.5 F1 score across six popular NER benchmarks. Our models also show competitive performance with rich-resource models that additionally leverage in-domain dictionaries provided by domain experts. In few-shot NER, we outperform the previous best model by 5.2 F1 score on three benchmarks and achieve new state-of-the-art performance.

pdf bib
Biomedical NER for the Enterprise with Distillated BERN2 and the Kazu Framework
Wonjin Yoon | Richard Jackson | Elliot Ford | Vladimir Poroshin | Jaewoo Kang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

In order to assist the drug discovery/development process, pharmaceutical companies often apply biomedical NER and linking techniques over internal and public corpora. Decades of study of the field of BioNLP has produced a plethora of algorithms, systems and datasets. However, our experience has been that no single open source system meets all the requirements of a modern pharmaceutical company. In this work, we describe these requirements according to our experience of the industry, and present Kazu, a highly extensible, scalable open source framework designed to support BioNLP for the pharmaceutical sector. Kazu is a built around a computationally efficient version of the BERN2 NER model (TinyBERN2), and subsequently wraps several other BioNLP technologies into one coherent system.

pdf bib
KU_ED at SocialDisNER: Extracting Disease Mentions in Tweets Written in Spanish
Antoine Lain | Wonjin Yoon | Hyunjae Kim | Jaewoo Kang | Ian Simpson
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes our system developed for the Social Media Mining for Health (SMM4H) 2022 SocialDisNER task. We used several types of pre-trained language models, which are trained on Spanish biomedical literature or Spanish Tweets. We showed the difference in performance depending on the quality of the tokenization as well as introducing silver standard annotations when training the model. Our model obtained a strict F1 of 80.3% on the test set, which is an improvement of +12.8% F1 (24.6 std) over the average results across all submissions to the SocialDisNER challenge.

2021

pdf bib
Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering
Gangwoo Kim | Hyunjae Kim | Jungsoo Park | Jaewoo Kang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models’ performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.

pdf bib
Learning Dense Representations of Phrases at Scale
Jinhyuk Lee | Mujeen Sung | Jaewoo Kang | Danqi Chen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Open-domain question answering can be reformulated as a phrase retrieval problem, without the need for processing documents on-demand during inference (Seo et al., 2019). However, current phrase retrieval models heavily depend on sparse representations and still underperform retriever-reader approaches. In this work, we show for the first time that we can learn dense representations of phrases alone that achieve much stronger performance in open-domain QA. We present an effective method to learn phrase representations from the supervision of reading comprehension tasks, coupled with novel negative sampling methods. We also propose a query-side fine-tuning strategy, which can support transfer learning and reduce the discrepancy between training and inference. On five popular open-domain QA datasets, our model DensePhrases improves over previous phrase retrieval models by 15%-25% absolute accuracy and matches the performance of state-of-the-art retriever-reader models. Our model is easy to parallelize due to pure dense representations and processes more than 10 questions per second on CPUs. Finally, we directly use our pre-indexed dense phrase representations for two slot filling tasks, showing the promise of utilizing DensePhrases as a dense knowledge base for downstream tasks.

pdf bib
“Killing Me” Is Not a Spoiler: Spoiler Detection Model using Graph Neural Networks with Dependency Relation-Aware Attention Mechanism
Buru Chang | Inggeol Lee | Hyunjae Kim | Jaewoo Kang
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Several machine learning-based spoiler detection models have been proposed recently to protect users from spoilers on review websites. Although dependency relations between context words are important for detecting spoilers, current attention-based spoiler detection models are insufficient for utilizing dependency relations. To address this problem, we propose a new spoiler detection model called SDGNN that is based on syntax-aware graph neural networks. In the experiments on two real-world benchmark datasets, we show that our SDGNN outperforms the existing spoiler detection models.

pdf bib
Can Language Models be Biomedical Knowledge Bases?
Mujeen Sung | Jinhyuk Lee | Sean Yi | Minji Jeon | Sungdong Kim | Jaewoo Kang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models (LMs) have become ubiquitous in solving various natural language processing (NLP) tasks. There has been increasing interest in what knowledge these LMs contain and how we can extract that knowledge, treating LMs as knowledge bases (KBs). While there has been much work on probing LMs in the general domain, there has been little attention to whether these powerful LMs can be used as domain-specific KBs. To this end, we create the BioLAMA benchmark, which is comprised of 49K biomedical factual knowledge triples for probing biomedical LMs. We find that biomedical LMs with recently proposed probing methods can achieve up to 18.51% Acc@5 on retrieving biomedical knowledge. Although this seems promising given the task difficulty, our detailed analyses reveal that most predictions are highly correlated with prompt templates without any subjects, hence producing similar results on each relation and hindering their capabilities to be used as domain-specific KBs. We hope that BioLAMA can serve as a challenging benchmark for biomedical factual probing.

2020

pdf bib
Contextualized Sparse Representations for Real-Time Open-Domain Question Answering
Jinhyuk Lee | Minjoon Seo | Hannaneh Hajishirzi | Jaewoo Kang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Open-domain question answering can be formulated as a phrase retrieval problem, in which we can expect huge scalability and speed benefit but often suffer from low accuracy due to the limitation of existing phrase representation models. In this paper, we aim to improve the quality of each phrase embedding by augmenting it with a contextualized sparse representation (Sparc). Unlike previous sparse vectors that are term-frequency-based (e.g., tf-idf) or directly learned (only few thousand dimensions), we leverage rectified self-attention to indirectly learn sparse vectors in n-gram vocabulary space. By augmenting the previous phrase retrieval model (Seo et al., 2019) with Sparc, we show 4%+ improvement in CuratedTREC and SQuAD-Open. Our CuratedTREC score is even better than the best known retrieve & read model with at least 45x faster inference speed.

pdf bib
Biomedical Entity Representations with Synonym Marginalization
Mujeen Sung | Hwisang Jeon | Jinhyuk Lee | Jaewoo Kang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Biomedical named entities often play important roles in many biomedical text mining tools. However, due to the incompleteness of provided synonyms and numerous variations in their surface forms, normalization of biomedical entities is very challenging. In this paper, we focus on learning representations of biomedical entities solely based on the synonyms of entities. To learn from the incomplete synonyms, we use a model-based candidate selection and maximize the marginal likelihood of the synonyms present in top candidates. Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BioSyn consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.

pdf bib
Answering Questions on COVID-19 in Real-Time
Jinhyuk Lee | Sean S. Yi | Minbyul Jeong | Mujeen Sung | WonJin Yoon | Yonghwa Choi | Miyoung Ko | Jaewoo Kang
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The recent outbreak of the novel coronavirus is wreaking havoc on the world and researchers are struggling to effectively combat it. One reason why the fight is difficult is due to the lack of information and knowledge. In this work, we outline our effort to contribute to shrinking this knowledge vacuum by creating covidAsk, a question answering (QA) system that combines biomedical text mining and QA techniques to provide answers to questions in real-time. Our system also leverages information retrieval (IR) approaches to provide entity-level answers that are complementary to QA models. Evaluation of covidAsk is carried out by using a manually created dataset called COVID-19 Questions which is based on information from various sources, including the CDC and the WHO. We hope our system will be able to aid researchers in their search for knowledge and information not only for COVID-19, but for future pandemics as well.

pdf bib
Adversarial Subword Regularization for Robust Neural Machine Translation
Jungsoo Park | Mujeen Sung | Jinhyuk Lee | Jaewoo Kang
Findings of the Association for Computational Linguistics: EMNLP 2020

Exposing diverse subword segmentations to neural machine translation (NMT) models often improves the robustness of machine translation as NMT models can experience various subword candidates. However, the diversification of subword segmentations mostly relies on the pre-trained subword language models from which erroneous segmentations of unseen words are less likely to be sampled. In this paper, we present adversarial subword regularization (ADVSR) to study whether gradient signals during training can be a substitute criterion for exposing diverse subword segmentations. We experimentally show that our model-based adversarial samples effectively encourage NMT models to be less sensitive to segmentation errors and improve the performance of NMT models in low-resource and out-domain datasets.

pdf bib
Look at the First Sentence: Position Bias in Question Answering
Miyoung Ko | Jinhyuk Lee | Hyunjae Kim | Gangwoo Kim | Jaewoo Kang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many extractive question answering models are trained to predict start and end positions of answers. The choice of predicting answers as positions is mainly due to its simplicity and effectiveness. In this study, we hypothesize that when the distribution of the answer positions is highly skewed in the training set (e.g., answers lie only in the k-th sentence of each passage), QA models predicting answers as positions can learn spurious positional cues and fail to give answers in different positions. We first illustrate this position bias in popular extractive QA models such as BiDAF and BERT and thoroughly examine how position bias propagates through each layer of BERT. To safely deliver position information without position bias, we train models with various de-biasing methods including entropy regularization and bias ensembling. Among them, we found that using the prior distribution of answer positions as a bias model is very effective at reducing position bias, recovering the performance of BERT from 37.48% to 81.64% when trained on a biased SQuAD dataset.

2018

pdf bib
Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering
Jinhyuk Lee | Seongjun Yun | Hyunjae Kim | Miyoung Ko | Jaewoo Kang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recently, open-domain question answering (QA) has been combined with machine comprehension models to find answers in a large knowledge source. As open-domain QA requires retrieving relevant documents from text corpora to answer questions, its performance largely depends on the performance of document retrievers. However, since traditional information retrieval systems are not effective in obtaining documents with a high probability of containing answers, they lower the performance of QA systems. Simply extracting more documents increases the number of irrelevant documents, which also degrades the performance of QA systems. In this paper, we introduce Paragraph Ranker which ranks paragraphs of retrieved documents for a higher answer recall with less noise. We show that ranking paragraphs and aggregating answers using Paragraph Ranker improves performance of open-domain QA pipeline on the four open-domain QA datasets by 7.8% on average.