A Question Answering Benchmark Database for Hungarian

Attila Novák, Borbála Novák, Tamás Zombori, Gergő Szabó, Zsolt Szántó, Richárd Farkas


Abstract
Within the research presented in this article, we created a new question answering benchmark database for Hungarian called MILQA. When creating the dataset, we basically followed the principles of the English SQuAD 2.0, however, like in some more recent English question answering datasets, we introduced a number of innovations beyond SQuAD: e.g., yes/no-questions, list-like answers consisting of several text spans, long answers, questions requiring calculation and other question types where you cannot simply copy the answer from the text. For all these non-extractive question types, the pragmatically adequate form of the answer was also added to make the training of generative models possible. We implemented and evaluated a set of baseline retrieval and answer span extraction models on the dataset. BM25 performed better than any vector-based solution for retrieval. Cross-lingual transfer from English significantly improved span extraction models.
Anthology ID:
2023.law-1.19
Volume:
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Jakob Prange, Annemarie Friedrich
Venue:
LAW
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
188–198
Language:
URL:
https://aclanthology.org/2023.law-1.19
DOI:
10.18653/v1/2023.law-1.19
Bibkey:
Cite (ACL):
Attila Novák, Borbála Novák, Tamás Zombori, Gergő Szabó, Zsolt Szántó, and Richárd Farkas. 2023. A Question Answering Benchmark Database for Hungarian. In Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII), pages 188–198, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
A Question Answering Benchmark Database for Hungarian (Novák et al., LAW 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.law-1.19.pdf