An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering

Shayne Longpre, Yi Lu, Zhucheng Tu, Chris DuBois


Abstract
To produce a domain-agnostic question answering model for the Machine Reading Question Answering (MRQA) 2019 Shared Task, we investigate the relative benefits of large pre-trained language models, various data sampling strategies, as well as query and context paraphrases generated by back-translation. We find a simple negative sampling technique to be particularly effective, even though it is typically used for datasets that include unanswerable questions, such as SQuAD 2.0. When applied in conjunction with per-domain sampling, our XLNet (Yang et al., 2019)-based submission achieved the second best Exact Match and F1 in the MRQA leaderboard competition.
Anthology ID:
D19-5829
Volume:
Proceedings of the 2nd Workshop on Machine Reading for Question Answering
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, Danqi Chen
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
220–227
Language:
URL:
https://aclanthology.org/D19-5829
DOI:
10.18653/v1/D19-5829
Bibkey:
Cite (ACL):
Shayne Longpre, Yi Lu, Zhucheng Tu, and Chris DuBois. 2019. An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 220–227, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
An Exploration of Data Augmentation and Sampling Techniques for Domain-Agnostic Question Answering (Longpre et al., 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-5829.pdf
Data
BioASQDROPDuoRCHotpotQANatural QuestionsNewsQARACESQuADSearchQATriviaQA