HPLT’s Second Data Release
Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Jaume Zaragoza-Bernabeu
Correct Metadata for
Abstract
We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.- Anthology ID:
- 2025.mtsummit-2.21
- Volume:
- Proceedings of Machine Translation Summit XX: Volume 2
- Month:
- June
- Year:
- 2025
- Address:
- Geneva, Switzerland
- Editors:
- Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Samuel Läubli, Martin Volk, Miquel Esplà-Gomis, Vincent Vandeghinste, Helena Moniz, Sara Szoc
- Venue:
- MTSummit
- SIG:
- Publisher:
- European Association for Machine Translation
- Note:
- Pages:
- 101–102
- Language:
- URL:
- https://aclanthology.org/2025.mtsummit-2.21/
- DOI:
- Bibkey:
- Cite (ACL):
- Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, and Jaume Zaragoza-Bernabeu. 2025. HPLT’s Second Data Release. In Proceedings of Machine Translation Summit XX: Volume 2, pages 101–102, Geneva, Switzerland. European Association for Machine Translation.
- Cite (Informal):
- HPLT’s Second Data Release (Arefyev et al., MTSummit 2025)
- Copy Citation:
- PDF:
- https://aclanthology.org/2025.mtsummit-2.21.pdf
Export citation
@inproceedings{arefyev-etal-2025-hplts, title = "{HPLT}{'}s Second Data Release", author = {Arefyev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Burchell, Laurie and Chen, Pinzhen and Fedorova, Mariia and de Gibert, Ona and Guillou, Liane and Haddow, Barry and Haji{\v{c}}, Jan and Helcl, Jind{\v{r}}ich and Henriksson, Erik and Kutuzov, Andrey and Laippala, Veronika and Malik, Bhavitvya and Mehryary, Farrokh and Mikhailov, Vladislav and Myntti, Amanda and O{'}Brien, Dayy{\'a}n and Oepen, Stephan and Pyysalo, Sampo and Ram{\'i}rez-S{\'a}nchez, Gema and Samuel, David and Stepachev, Pavel and Tiedemann, J{\"o}rg and Vari{\v{s}}, Du{\v{s}}an and Zaragoza-Bernabeu, Jaume}, editor = {Bouillon, Pierrette and Gerlach, Johanna and Girletti, Sabrina and Volkart, Lise and Rubino, Raphael and Sennrich, Rico and L{\"a}ubli, Samuel and Volk, Martin and Espl{\`a}-Gomis, Miquel and Vandeghinste, Vincent and Moniz, Helena and Szoc, Sara}, booktitle = "Proceedings of Machine Translation Summit XX: Volume 2", month = jun, year = "2025", address = "Geneva, Switzerland", publisher = "European Association for Machine Translation", url = "https://aclanthology.org/2025.mtsummit-2.21/", pages = "101--102", ISBN = "978-2-9701897-1-8", abstract = "We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence." }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="arefyev-etal-2025-hplts"> <titleInfo> <title>HPLT’s Second Data Release</title> </titleInfo> <name type="personal"> <namePart type="given">Nikolay</namePart> <namePart type="family">Arefyev</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mikko</namePart> <namePart type="family">Aulamo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marta</namePart> <namePart type="family">Bañón</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Laurie</namePart> <namePart type="family">Burchell</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pinzhen</namePart> <namePart type="family">Chen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mariia</namePart> <namePart type="family">Fedorova</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ona</namePart> <namePart type="family">de Gibert</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Liane</namePart> <namePart type="family">Guillou</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barry</namePart> <namePart type="family">Haddow</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="family">Hajič</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jindřich</namePart> <namePart type="family">Helcl</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Erik</namePart> <namePart type="family">Henriksson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andrey</namePart> <namePart type="family">Kutuzov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Veronika</namePart> <namePart type="family">Laippala</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bhavitvya</namePart> <namePart type="family">Malik</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Farrokh</namePart> <namePart type="family">Mehryary</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vladislav</namePart> <namePart type="family">Mikhailov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Amanda</namePart> <namePart type="family">Myntti</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dayyán</namePart> <namePart type="family">O’Brien</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stephan</namePart> <namePart type="family">Oepen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sampo</namePart> <namePart type="family">Pyysalo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Gema</namePart> <namePart type="family">Ramírez-Sánchez</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">David</namePart> <namePart type="family">Samuel</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pavel</namePart> <namePart type="family">Stepachev</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jörg</namePart> <namePart type="family">Tiedemann</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dušan</namePart> <namePart type="family">Variš</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jaume</namePart> <namePart type="family">Zaragoza-Bernabeu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2025-06</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of Machine Translation Summit XX: Volume 2</title> </titleInfo> <name type="personal"> <namePart type="given">Pierrette</namePart> <namePart type="family">Bouillon</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Johanna</namePart> <namePart type="family">Gerlach</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sabrina</namePart> <namePart type="family">Girletti</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lise</namePart> <namePart type="family">Volkart</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Raphael</namePart> <namePart type="family">Rubino</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rico</namePart> <namePart type="family">Sennrich</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Samuel</namePart> <namePart type="family">Läubli</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Martin</namePart> <namePart type="family">Volk</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Miquel</namePart> <namePart type="family">Esplà-Gomis</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vincent</namePart> <namePart type="family">Vandeghinste</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Helena</namePart> <namePart type="family">Moniz</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sara</namePart> <namePart type="family">Szoc</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>European Association for Machine Translation</publisher> <place> <placeTerm type="text">Geneva, Switzerland</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> <identifier type="isbn">978-2-9701897-1-8</identifier> </relatedItem> <abstract>We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.</abstract> <identifier type="citekey">arefyev-etal-2025-hplts</identifier> <location> <url>https://aclanthology.org/2025.mtsummit-2.21/</url> </location> <part> <date>2025-06</date> <extent unit="page"> <start>101</start> <end>102</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T HPLT’s Second Data Release %A Arefyev, Nikolay %A Aulamo, Mikko %A Bañón, Marta %A Burchell, Laurie %A Chen, Pinzhen %A Fedorova, Mariia %A de Gibert, Ona %A Guillou, Liane %A Haddow, Barry %A Hajič, Jan %A Helcl, Jindřich %A Henriksson, Erik %A Kutuzov, Andrey %A Laippala, Veronika %A Malik, Bhavitvya %A Mehryary, Farrokh %A Mikhailov, Vladislav %A Myntti, Amanda %A O’Brien, Dayyán %A Oepen, Stephan %A Pyysalo, Sampo %A Ramírez-Sánchez, Gema %A Samuel, David %A Stepachev, Pavel %A Tiedemann, Jörg %A Variš, Dušan %A Zaragoza-Bernabeu, Jaume %Y Bouillon, Pierrette %Y Gerlach, Johanna %Y Girletti, Sabrina %Y Volkart, Lise %Y Rubino, Raphael %Y Sennrich, Rico %Y Läubli, Samuel %Y Volk, Martin %Y Esplà-Gomis, Miquel %Y Vandeghinste, Vincent %Y Moniz, Helena %Y Szoc, Sara %S Proceedings of Machine Translation Summit XX: Volume 2 %D 2025 %8 June %I European Association for Machine Translation %C Geneva, Switzerland %@ 978-2-9701897-1-8 %F arefyev-etal-2025-hplts %X We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence. %U https://aclanthology.org/2025.mtsummit-2.21/ %P 101-102
Markdown (Informal)
[HPLT’s Second Data Release](https://aclanthology.org/2025.mtsummit-2.21/) (Arefyev et al., MTSummit 2025)
- HPLT’s Second Data Release (Arefyev et al., MTSummit 2025)
ACL
- Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, Ona de Gibert, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Andrey Kutuzov, Veronika Laippala, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, and Jaume Zaragoza-Bernabeu. 2025. HPLT’s Second Data Release. In Proceedings of Machine Translation Summit XX: Volume 2, pages 101–102, Geneva, Switzerland. European Association for Machine Translation.