Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task

Jose Cols


Abstract
This paper presents the Seed-CAT submission to the WMT24 Open Language Data Initiative shared task. We detail our data collection method, which involves a computer-aided translation tool developed explicitly for translating Seed corpora. We release a professionally translated Spanish corpus and a provenance dataset documenting the translation process. The quality of the data was validated on the FLORES+ benchmark with English-Spanish neural machine translation models, achieving an average chrF++ score of 34.9.
Anthology ID:
2024.wmt-1.50
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
624–635
Language:
URL:
https://aclanthology.org/2024.wmt-1.50
DOI:
Bibkey:
Cite (ACL):
Jose Cols. 2024. Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task. In Proceedings of the Ninth Conference on Machine Translation, pages 624–635, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Spanish Corpus and Provenance with Computer-Aided Translation for the WMT24 OLDI Shared Task (Cols, WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.50.pdf