HPLT’s Second Data Release
Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Jaume Zaragoza-Bernabeu
Correct Metadata for
Abstract
We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.- Anthology ID:
- 2025.mtsummit-2.21
- Volume:
- Proceedings of Machine Translation Summit XX: Volume 2
- Month:
- June
- Year:
- 2025
- Address:
- Geneva, Switzerland
- Editors:
- Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Samuel Läubli, Martin Volk, Miquel Esplà-Gomis, Vincent Vandeghinste, Helena Moniz, Sara Szoc
- Venue:
- MTSummit
- SIG:
- Publisher:
- European Association for Machine Translation
- Note:
- Pages:
- 101–102
- Language:
- URL:
- https://aclanthology.org/2025.mtsummit-2.21/
- DOI:
- Bibkey:
- Cite (ACL):
- Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, and Jaume Zaragoza-Bernabeu. 2025. HPLT’s Second Data Release. In Proceedings of Machine Translation Summit XX: Volume 2, pages 101–102, Geneva, Switzerland. European Association for Machine Translation.
- Cite (Informal):
- HPLT’s Second Data Release (Arefyev et al., MTSummit 2025)
- Copy Citation:
- PDF:
- https://aclanthology.org/2025.mtsummit-2.21.pdf
Export citation
@inproceedings{arefyev-etal-2025-hplts,
title = "{HPLT}{'}s Second Data Release",
author = {Arefyev, Nikolay and
Aulamo, Mikko and
Ba{\~n}{\'o}n, Marta and
Burchell, Laurie and
Chen, Pinzhen and
Fedorova, Mariia and
de Gibert, Ona and
Guillou, Liane and
Haddow, Barry and
Haji{\v{c}}, Jan and
Helcl, Jind{\v{r}}ich and
Henriksson, Erik and
Kutuzov, Andrey and
Laippala, Veronika and
Malik, Bhavitvya and
Mehryary, Farrokh and
Mikhailov, Vladislav and
Myntti, Amanda and
O{'}Brien, Dayy{\'a}n and
Oepen, Stephan and
Pyysalo, Sampo and
Ram{\'i}rez-S{\'a}nchez, Gema and
Samuel, David and
Stepachev, Pavel and
Tiedemann, J{\"o}rg and
Vari{\v{s}}, Du{\v{s}}an and
Zaragoza-Bernabeu, Jaume},
editor = {Bouillon, Pierrette and
Gerlach, Johanna and
Girletti, Sabrina and
Volkart, Lise and
Rubino, Raphael and
Sennrich, Rico and
L{\"a}ubli, Samuel and
Volk, Martin and
Espl{\`a}-Gomis, Miquel and
Vandeghinste, Vincent and
Moniz, Helena and
Szoc, Sara},
booktitle = "Proceedings of Machine Translation Summit XX: Volume 2",
month = jun,
year = "2025",
address = "Geneva, Switzerland",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2025.mtsummit-2.21/",
pages = "101--102",
ISBN = "978-2-9701897-1-8",
abstract = "We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="arefyev-etal-2025-hplts">
<titleInfo>
<title>HPLT’s Second Data Release</title>
</titleInfo>
<name type="personal">
<namePart type="given">Nikolay</namePart>
<namePart type="family">Arefyev</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Mikko</namePart>
<namePart type="family">Aulamo</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Marta</namePart>
<namePart type="family">Bañón</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Laurie</namePart>
<namePart type="family">Burchell</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Pinzhen</namePart>
<namePart type="family">Chen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Mariia</namePart>
<namePart type="family">Fedorova</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Ona</namePart>
<namePart type="family">de Gibert</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Liane</namePart>
<namePart type="family">Guillou</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Barry</namePart>
<namePart type="family">Haddow</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jan</namePart>
<namePart type="family">Hajič</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jindřich</namePart>
<namePart type="family">Helcl</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Erik</namePart>
<namePart type="family">Henriksson</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Andrey</namePart>
<namePart type="family">Kutuzov</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Veronika</namePart>
<namePart type="family">Laippala</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Bhavitvya</namePart>
<namePart type="family">Malik</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Farrokh</namePart>
<namePart type="family">Mehryary</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Vladislav</namePart>
<namePart type="family">Mikhailov</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Amanda</namePart>
<namePart type="family">Myntti</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Dayyán</namePart>
<namePart type="family">O’Brien</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Stephan</namePart>
<namePart type="family">Oepen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Sampo</namePart>
<namePart type="family">Pyysalo</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Gema</namePart>
<namePart type="family">Ramírez-Sánchez</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">David</namePart>
<namePart type="family">Samuel</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Pavel</namePart>
<namePart type="family">Stepachev</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jörg</namePart>
<namePart type="family">Tiedemann</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Dušan</namePart>
<namePart type="family">Variš</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jaume</namePart>
<namePart type="family">Zaragoza-Bernabeu</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2025-06</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of Machine Translation Summit XX: Volume 2</title>
</titleInfo>
<name type="personal">
<namePart type="given">Pierrette</namePart>
<namePart type="family">Bouillon</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Johanna</namePart>
<namePart type="family">Gerlach</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Sabrina</namePart>
<namePart type="family">Girletti</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Lise</namePart>
<namePart type="family">Volkart</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Raphael</namePart>
<namePart type="family">Rubino</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Rico</namePart>
<namePart type="family">Sennrich</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Samuel</namePart>
<namePart type="family">Läubli</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Martin</namePart>
<namePart type="family">Volk</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Miquel</namePart>
<namePart type="family">Esplà-Gomis</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Vincent</namePart>
<namePart type="family">Vandeghinste</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Helena</namePart>
<namePart type="family">Moniz</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Sara</namePart>
<namePart type="family">Szoc</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>European Association for Machine Translation</publisher>
<place>
<placeTerm type="text">Geneva, Switzerland</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
<identifier type="isbn">978-2-9701897-1-8</identifier>
</relatedItem>
<abstract>We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.</abstract>
<identifier type="citekey">arefyev-etal-2025-hplts</identifier>
<location>
<url>https://aclanthology.org/2025.mtsummit-2.21/</url>
</location>
<part>
<date>2025-06</date>
<extent unit="page">
<start>101</start>
<end>102</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings %T HPLT’s Second Data Release %A Arefyev, Nikolay %A Aulamo, Mikko %A Bañón, Marta %A Burchell, Laurie %A Chen, Pinzhen %A Fedorova, Mariia %A de Gibert, Ona %A Guillou, Liane %A Haddow, Barry %A Hajič, Jan %A Helcl, Jindřich %A Henriksson, Erik %A Kutuzov, Andrey %A Laippala, Veronika %A Malik, Bhavitvya %A Mehryary, Farrokh %A Mikhailov, Vladislav %A Myntti, Amanda %A O’Brien, Dayyán %A Oepen, Stephan %A Pyysalo, Sampo %A Ramírez-Sánchez, Gema %A Samuel, David %A Stepachev, Pavel %A Tiedemann, Jörg %A Variš, Dušan %A Zaragoza-Bernabeu, Jaume %Y Bouillon, Pierrette %Y Gerlach, Johanna %Y Girletti, Sabrina %Y Volkart, Lise %Y Rubino, Raphael %Y Sennrich, Rico %Y Läubli, Samuel %Y Volk, Martin %Y Esplà-Gomis, Miquel %Y Vandeghinste, Vincent %Y Moniz, Helena %Y Szoc, Sara %S Proceedings of Machine Translation Summit XX: Volume 2 %D 2025 %8 June %I European Association for Machine Translation %C Geneva, Switzerland %@ 978-2-9701897-1-8 %F arefyev-etal-2025-hplts %X We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence. %U https://aclanthology.org/2025.mtsummit-2.21/ %P 101-102
Markdown (Informal)
[HPLT’s Second Data Release](https://aclanthology.org/2025.mtsummit-2.21/) (Arefyev et al., MTSummit 2025)
- HPLT’s Second Data Release (Arefyev et al., MTSummit 2025)
ACL
- Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, and Jaume Zaragoza-Bernabeu. 2025. HPLT’s Second Data Release. In Proceedings of Machine Translation Summit XX: Volume 2, pages 101–102, Geneva, Switzerland. European Association for Machine Translation.