Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data

Dian Yu, Kai Sun, Dong Yu, Claire Cardie


Abstract
Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create large-scale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chinese, ExamQA. We next use incomplete, yet relevant snippets returned by a web search engine as the context for each QA instance to convert it into a weakly-labeled MRC instance. To better use the weakly-labeled data to improve a target MRC task, we evaluate and compare several methods and further propose a self-teaching paradigm. Experimental results show that, upon state-of-the-art MRC baselines, we can obtain +5.1% in accuracy on a multiple-choice Chinese MRC dataset, Cˆ3, and +3.8% in exact match on an extractive Chinese MRC dataset, CMRC 2018, demonstrating the usefulness of the generated QA-based weakly-labeled data for different types of MRC tasks as well as the effectiveness of self-teaching. ExamQA will be available at https://dataset.org/examqa/.
Anthology ID:
2021.findings-emnlp.6
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
56–68
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.6
DOI:
10.18653/v1/2021.findings-emnlp.6
Bibkey:
Cite (ACL):
Dian Yu, Kai Sun, Dong Yu, and Claire Cardie. 2021. Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 56–68, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data (Yu et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.6.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.6.mp4
Data
C3CMRCCMRC 2018DRCDHeadQAJEC-QAMedQA