Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora

Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, Krister Lindén


Abstract
This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.
Anthology ID:
2020.vardial-1.16
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
173–185
Language:
URL:
https://aclanthology.org/2020.vardial-1.16
DOI:
Bibkey:
Cite (ACL):
Tommi Jauhiainen, Heidi Jauhiainen, Niko Partanen, and Krister Lindén. 2020. Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 173–185, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corpora (Jauhiainen et al., VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.16.pdf