Quasi: a synthetic Question-Answering dataset in Swedish using GPT-3 and zero-shot learning

Dmytro Kalpakchi, Johan Boye


Abstract
This paper describes the creation and evaluation of a synthetic dataset of Swedish multiple-choice questions (MCQs) for reading comprehension using GPT-3. Although GPT-3 is trained mostly on English data, with only 0.11% of Swedish texts in its training material, the model still managed to generate MCQs in Swedish. About 44% of the generated MCQs turned out to be of sufficient quality, i.e. they were grammatically correct and relevant, with exactly one answer alternative being correct and the others being plausible but wrong. We provide a detailed analysis of the errors and shortcomings of the rejected MCQs, as well an analysis of the level of difficulty of the accepted MCQs. In addition to giving insights into GPT-3, the synthetic dataset could be used for training and evaluation of special-purpose MCQ-generating models.
Anthology ID:
2023.nodalida-1.48
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
477–491
Language:
URL:
https://aclanthology.org/2023.nodalida-1.48
DOI:
Bibkey:
Cite (ACL):
Dmytro Kalpakchi and Johan Boye. 2023. Quasi: a synthetic Question-Answering dataset in Swedish using GPT-3 and zero-shot learning. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 477–491, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Quasi: a synthetic Question-Answering dataset in Swedish using GPT-3 and zero-shot learning (Kalpakchi & Boye, NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.48.pdf