Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus

Eleni Metheniti, Guenter Neumann


Abstract
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.
Anthology ID:
2020.lrec-1.481
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3905–3912
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.481
DOI:
Bibkey:
Cite (ACL):
Eleni Metheniti and Guenter Neumann. 2020. Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3905–3912, Marseille, France. European Language Resources Association.
Cite (Informal):
Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus (Metheniti & Neumann, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.481.pdf