A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages

Catherine Arnett, Tyler A. Chang, Benjamin Bergen


Abstract
How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.
Anthology ID:
2024.sigul-1.1
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2024.sigul-1.1
DOI:
Bibkey:
Cite (ACL):
Catherine Arnett, Tyler A. Chang, and Benjamin Bergen. 2024. A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 1–9, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages (Arnett et al., SIGUL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigul-1.1.pdf