Building Data Infrastructure for Low-Resource Languages

Sarah Luger; Rafael Mosquera-Gómez; Alex Miłowski; Thom Vaughan; Sara Hincapie-Monsalve; Pedro Ortiz Suarez; Kurt Bollacker

doi:10.18653/v1/2025.loresmt-1.14

Building Data Infrastructure for Low-Resource Languages

Sarah Luger, Rafael Mosquera-Gómez, Alex Miłowski, Thom Vaughan, Sara Hincapie-Monsalve, Pedro Ortiz Suarez, Kurt Bollacker

Abstract

The MLCommons Datasets Working Group presents a comprehensive initiative to advance the development and accessibility of artificial intelligence (AI) training and testing resources. This paper introduces three key projects aimed at addressing critical gaps in the AI data ecosystem: the Unsupervised People’s Speech Dataset, containing over 821,000 hours of speech across 89+ languages; a strategic collaboration with Common Crawl to enhance web crawling capabilities for low-resource languages; and a framework for knowledge graph extraction evaluation. By focusing on languages other than English (LOTE) and creating permissively licensed, high-quality datasets, these initiatives aim to democratize AI development and improve model performance across diverse linguistic contexts. This work represents a significant step toward more inclusive and capable AI systems that can serve global communities.

Anthology ID:: 2025.loresmt-1.14
Volume:: Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)
Month:: May
Year:: 2025
Address:: Albuquerque, New Mexico, U.S.A.
Editors:: Atul Kr. Ojha, Chao-hong Liu, Ekaterina Vylomova, Flammie Pirinen, Jonathan Washington, Nathaniel Oco, Xiaobing Zhao
Venues:: LoResMT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 154–160
Language:
URL:: https://aclanthology.org/2025.loresmt-1.14/
DOI:: 10.18653/v1/2025.loresmt-1.14
Bibkey:
Cite (ACL):: Sarah Luger, Rafael Mosquera-Gómez, Alex Miłowski, Thom Vaughan, Sara Hincapie-Monsalve, Pedro Ortiz Suarez, and Kurt Bollacker. 2025. Building Data Infrastructure for Low-Resource Languages. In Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025), pages 154–160, Albuquerque, New Mexico, U.S.A.. Association for Computational Linguistics.
Cite (Informal):: Building Data Infrastructure for Low-Resource Languages (Luger et al., LoResMT 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.loresmt-1.14.pdf

PDF Cite Search Fix data