RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports
Sarvesh Soni | Meghana Gudala | Atieh Pajouhi | Kirk Roberts
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present a radiology question answering dataset, RadQA, with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines. We conduct a thorough analysis of the proposed dataset by examining the broad categories of disagreement in annotation (providing insights on the errors made by humans) and the reasoning requirements to answer a question (uncovering the huge dependence on medical knowledge for answering the questions). The advanced transformer language models achieve the best F1 score of 63.55 on the test set, however, the best human performance is 90.31 (with an average of 84.52). This demonstrates the challenging nature of RadQA that leaves ample scope for future method research.
Evaluation of Dataset Selection for Pre-Training and Fine-Tuning Transformer Language Models for Clinical Question Answering
Sarvesh Soni | Kirk Roberts
Proceedings of the Twelfth Language Resources and Evaluation Conference
We evaluate the performance of various Transformer language models, when pre-trained and fine-tuned on different combinations of open-domain, biomedical, and clinical corpora on two clinical question answering (QA) datasets (CliCR and emrQA). We perform our evaluations on the task of machine reading comprehension, which involves training the model to answer a question given an unstructured context paragraph. We conduct a total of 48 experiments on different combinations of the large open-domain and domain-specific corpora. We found that an initial fine-tuning on an open-domain dataset, SQuAD, consistently improves the clinical QA performance across all the model variants.
A Paraphrase Generation System for EHR Question Answering
Sarvesh Soni | Kirk Roberts
Proceedings of the 18th BioNLP Workshop and Shared Task
This paper proposes a dataset and method for automatically generating paraphrases for clinical questions relating to patient-specific information in electronic health records (EHRs). Crowdsourcing is used to collect 10,578 unique questions across 946 semantically distinct paraphrase clusters. This corpus is then used with a deep learning-based question paraphrasing method utilizing variational autoencoder and LSTM encoder/decoder. The ultimate use of such a method is to improve the performance of automatic question answering methods for EHRs.