Building a Swedish Question-Answering Model

Hannes von Essen, Daniel Hesslow


Abstract
High quality datasets for question answering exist in a few languages, but far from all. Producing such datasets for new languages requires extensive manual labour. In this work we look at different methods for using existing datasets to train question-answering models in languages lacking such datasets. We show that machine translation followed by cross-lingual projection is a viable way to create a full question-answering dataset in a new language. We introduce new methods both for bitext alignment, using optimal transport, and for direct cross-lingual projection, utilizing multilingual BERT. We show that our methods produce good Swedish question-answering models without any manual work. Finally, we apply our proposed methods on Spanish and evaluate it on the XQuAD and MLQA benchmarks where we achieve new state-of-the-art values of 80.4 F1 and 62.9 Exact Match (EM) points on the Spanish XQuAD corpus and 70.8 F1 and 53.0 EM on the Spanish MLQA corpus, showing that the technique is readily applicable to other languages.
Anthology ID:
2020.pam-1.16
Volume:
Proceedings of the Probability and Meaning Conference (PaM 2020)
Month:
June
Year:
2020
Address:
Gothenburg
Editors:
Christine Howes, Stergios Chatzikyriakidis, Adam Ek, Vidya Somashekarappa
Venue:
PaM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
117–127
Language:
URL:
https://aclanthology.org/2020.pam-1.16
DOI:
Bibkey:
Cite (ACL):
Hannes von Essen and Daniel Hesslow. 2020. Building a Swedish Question-Answering Model. In Proceedings of the Probability and Meaning Conference (PaM 2020), pages 117–127, Gothenburg. Association for Computational Linguistics.
Cite (Informal):
Building a Swedish Question-Answering Model (von Essen & Hesslow, PaM 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.pam-1.16.pdf
Code
 vottivott/building-a-swedish-qa-model
Data
MLQAXQuAD