Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale

Dmytro Chaplynskyi


Abstract
This paper addresses the need for massive corpora for a low-resource language and presents the publicly available UberText 2.0 corpus for the Ukrainian language and discusses the methodology of its construction. While the collection and maintenance of such a corpus is more of a data extraction and data engineering task, the corpus itself provides a solid foundation for natural language processing tasks. It can enable the creation of contemporary language models and word embeddings, resulting in a better performance of numerous downstream tasks for the Ukrainian language. In addition, the paper and software developed can be used as a guidance and model solution for other low-resource languages. The resulting corpus is available for download on the project page. It has 3.274 billion tokens, consists of 8.59 million texts and takes up 32 gigabytes of space.
Anthology ID:
2023.unlp-1.1
Volume:
Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editor:
Mariana Romanyshyn
Venue:
UNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–10
Language:
URL:
https://aclanthology.org/2023.unlp-1.1
DOI:
10.18653/v1/2023.unlp-1.1
Bibkey:
Cite (ACL):
Dmytro Chaplynskyi. 2023. Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale. In Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale (Chaplynskyi, UNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.unlp-1.1.pdf
Video:
 https://aclanthology.org/2023.unlp-1.1.mp4