Creating a POS Gold Standard Corpus of Modern Ukrainian

Vasyl Starko, Andriy Rysin


Abstract
This paper presents an ongoing project to create the Ukrainian Brown Corpus (BRUK), a disambiguated corpus of Modern Ukrainian. Inspired by and loosely based on the original Brown University corpus, BRUK contains one million words, spans 11 years (2010–2020), and represents edited written Ukrainian. Using stratified random sampling, we have selected fragments of texts from multiple sources to ensure maximum variety, fill nine predefined categories, and produce a balanced corpus. BRUK has been automatically POS-tagged with the help of our tools (a large morphological dictionary of Ukrainian and a tagger). A manually disambiguated and validated subset of BRUK (450,000 words) has been made available online. This gold standard, the biggest of its kind for Ukrainian, fills a critical need in the NLP ecosystem for this language. The ultimate goal is to produce a fully disambiguated one-million corpus of Modern Ukrainian.
Anthology ID:
2023.unlp-1.11
Volume:
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editor:
Mariana Romanyshyn
Venue:
UNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
91–95
Language:
URL:
https://aclanthology.org/2023.unlp-1.11
DOI:
10.18653/v1/2023.unlp-1.11
Bibkey:
Cite (ACL):
Vasyl Starko and Andriy Rysin. 2023. Creating a POS Gold Standard Corpus of Modern Ukrainian. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 91–95, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Creating a POS Gold Standard Corpus of Modern Ukrainian (Starko & Rysin, UNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.unlp-1.11.pdf
Video:
 https://aclanthology.org/2023.unlp-1.11.mp4