Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems

Alina Kramchaninova, Arne Defauw


Abstract
Deep learning models have significantly advanced the state of the art of question answering systems. However, the majority of datasets available for training such models have been annotated by humans, are open-domain, and are composed primarily in English. To deal with these limitations, we introduce a pipeline that creates synthetic data from natural text. To illustrate the domain-adaptability of our approach, as well as its multilingual potential, we use our pipeline to obtain synthetic data in English and Dutch. We combine the synthetic data with non-synthetic data (SQuAD 2.0) and evaluate multilingual BERT models on the question answering task. Models trained with synthetically augmented data demonstrate a clear improvement in performance when evaluated on the domain-specific test set, compared to the models trained exclusively on SQuAD 2.0. We expect our work to be beneficial for training domain-specific question-answering systems when the amount of available data is limited.
Anthology ID:
2022.eamt-1.18
Volume:
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Month:
June
Year:
2022
Address:
Ghent, Belgium
Editors:
Helena Moniz, Lieve Macken, Andrew Rufener, Loïc Barrault, Marta R. Costa-jussà, Christophe Declercq, Maarit Koponen, Ellie Kemp, Spyridon Pilos, Mikel L. Forcada, Carolina Scarton, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, Margot Fonteyne
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
151–160
Language:
URL:
https://aclanthology.org/2022.eamt-1.18
DOI:
Bibkey:
Cite (ACL):
Alina Kramchaninova and Arne Defauw. 2022. Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 151–160, Ghent, Belgium. European Association for Machine Translation.
Cite (Informal):
Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems (Kramchaninova & Defauw, EAMT 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.eamt-1.18.pdf
Data
SQuAD