WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia

Dina Pisarevskaya, Tatiana Shavrina


Abstract
The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. Compiling factual questions datasets requires manual annotations, limiting the training data’s potential size. We present the WikiOmnia dataset, a new publicly available set of QA pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generation and filtration pipeline. To ensure high quality of generated QA pairs, diverse manual and automated evaluation techniques were applied. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).
Anthology ID:
2022.gem-1.10
Volume:
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Antoine Bosselut, Khyathi Chandu, Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Yacine Jernite, Jekaterina Novikova, Laura Perez-Beltrachini
Venue:
GEM
SIG:
SIGGEN
Publisher:
Association for Computational Linguistics
Note:
Pages:
125–135
Language:
URL:
https://aclanthology.org/2022.gem-1.10
DOI:
10.18653/v1/2022.gem-1.10
Bibkey:
Cite (ACL):
Dina Pisarevskaya and Tatiana Shavrina. 2022. WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 125–135, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia (Pisarevskaya & Shavrina, GEM 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.gem-1.10.pdf