HPLT’s First Release of Data and Models

Nikolay Arefyev, Mikko Aulamo, Pinzhen Chen, Ona De Gibert Bonet, Barry Haddow, Jindřich Helcl, Bhavitvya Malik, Gema Ramírez-Sánchez, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Jaume Zaragoza-Bernabeu


Abstract
The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.
Anthology ID:
2024.eamt-2.27
Volume:
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Month:
June
Year:
2024
Address:
Sheffield, UK
Editors:
Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Mikel Forcada, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
Note:
Pages:
53–54
Language:
URL:
https://aclanthology.org/2024.eamt-2.27
DOI:
Bibkey:
Cite (ACL):
Nikolay Arefyev, Mikko Aulamo, Pinzhen Chen, Ona De Gibert Bonet, Barry Haddow, Jindřich Helcl, Bhavitvya Malik, Gema Ramírez-Sánchez, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, and Jaume Zaragoza-Bernabeu. 2024. HPLT’s First Release of Data and Models. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2), pages 53–54, Sheffield, UK. European Association for Machine Translation (EAMT).
Cite (Informal):
HPLT’s First Release of Data and Models (Arefyev et al., EAMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.eamt-2.27.pdf