Data Augmentation for Biomedical Factoid Question Answering

Dimitris Pappas, Prodromos Malakasiotis, Ion Androutsopoulos


Abstract
We study the effect of seven data augmentation (DA) methods in factoid question answering, focusing on the biomedical domain, where obtaining training instances is particularly difficult. We experiment with data from the BIOASQ challenge, which we augment with training instances obtained from an artificial biomedical machine reading comprehension dataset, or via back-translation, information retrieval, word substitution based on WORD2VEC embeddings, or masked language modeling, question generation, or extending the given passage with additional context. We show that DA can lead to very significant performance gains, even when using large pre-trained Transformers, contributing to a broader discussion of if/when DA benefits large pre-trained models. One of the simplest DA methods, WORD2VEC-based word substitution, performed best and is recommended. We release our artificial training instances and code.
Anthology ID:
2022.bionlp-1.6
Volume:
Proceedings of the 21st Workshop on Biomedical Language Processing
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, Junichi Tsujii
Venue:
BioNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
63–81
Language:
URL:
https://aclanthology.org/2022.bionlp-1.6
DOI:
10.18653/v1/2022.bionlp-1.6
Bibkey:
Cite (ACL):
Dimitris Pappas, Prodromos Malakasiotis, and Ion Androutsopoulos. 2022. Data Augmentation for Biomedical Factoid Question Answering. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 63–81, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Data Augmentation for Biomedical Factoid Question Answering (Pappas et al., BioNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.bionlp-1.6.pdf
Video:
 https://aclanthology.org/2022.bionlp-1.6.mp4
Code
 dpappas/Data-Augmentation-for-Biomedical-Factoid-Question-Answering
Data
BIOMRCBioASQSQuAD