Creating a POS Gold Standard Corpus of Modern Ukrainian
Vasyl Starko
Andriy Rysin
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
This paper presents an ongoing project to create the Ukrainian Brown Corpus (BRUK), a disambiguated corpus of Modern Ukrainian. Inspired by and loosely based on the original Brown University corpus, BRUK contains one million words, spans 11 years (2010–2020), and represents edited written Ukrainian. Using stratified random sampling, we have selected fragments of texts from multiple sources to ensure maximum variety, fill nine predefined categories, and produce a balanced corpus. BRUK has been automatically POS-tagged with the help of our tools (a large morphological dictionary of Ukrainian and a tagger). A manually disambiguated and validated subset of BRUK (450,000 words) has been made available online. This gold standard, the biggest of its kind for Ukrainian, fills a critical need in the NLP ecosystem for this language. The ultimate goal is to produce a fully disambiguated one-million corpus of Modern Ukrainian.
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych
Olga Kanishcheva
Preslav Nakov
Jakub Piskorski
Lidia Pivovarova
Vasyl Starko
Josef Steinberger
Roman Yangarber
Michał Marcińczuk
Senja Pollak
Pavel Přibáň
Marko Robnik-Šikonja
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Slav-NER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski
Bogdan Babych
Zara Kancheva
Olga Kanishcheva
Maria Lebedeva
Michał Marcińczuk
Preslav Nakov
Petya Osenova
Lidia Pivovarova
Senja Pollak
Pavel Přibáň
Ivaylo Radev
Marko Robnik-Sikonja
Vasyl Starko
Josef Steinberger
Roman Yangarber
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
This paper describes Slav-NER: the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.