A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages

Catherine Arnett; Tyler A. Chang; Benjamin Bergen

A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages

Catherine Arnett, Tyler A. Chang, Benjamin Bergen

Abstract

How should text dataset sizes be compared across languages? Even for content-matched (parallel) corpora, UTF-8 encoded text can require a dramatically different number of bytes for different languages. In our work, we define the byte premium between two languages as the ratio of bytes used to encode content-matched text in those languages. We compute byte premiums for 1155 languages, and we use linear regressions to estimate byte premiums for other languages. We release a tool to obtain byte premiums for any two languages, enabling comparisons of dataset sizes across languages for more equitable multilingual model development and data practices.

Anthology ID:: 2024.sigul-1.1
Volume:: Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Maite Melero, Sakriani Sakti, Claudia Soria
Venues:: SIGUL | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 1–9
Language:
URL:: https://aclanthology.org/2024.sigul-1.1/
DOI:
Bibkey:
Cite (ACL):: Catherine Arnett, Tyler A. Chang, and Benjamin Bergen. 2024. A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 1–9, Torino, Italia. ELRA and ICCL.
Cite (Informal):: A Bit of a Problem: Measurement Disparities in Dataset Sizes across Languages (Arnett et al., SIGUL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.sigul-1.1.pdf

PDF Cite Search Fix data