Findings of the WMT 2023 Shared Task on Parallel Data Curation

Steve Sloto, Brian Thompson, Huda Khayrallah, Tobias Domhan, Thamme Gowda, Philipp Koehn


Abstract
Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.
Anthology ID:
2023.wmt-1.5
Volume:
Proceedings of the Eighth Conference on Machine Translation
Month:
December
Year:
2023
Address:
Singapore
Editors:
Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
95–102
Language:
URL:
https://aclanthology.org/2023.wmt-1.5
DOI:
10.18653/v1/2023.wmt-1.5
Bibkey:
Cite (ACL):
Steve Sloto, Brian Thompson, Huda Khayrallah, Tobias Domhan, Thamme Gowda, and Philipp Koehn. 2023. Findings of the WMT 2023 Shared Task on Parallel Data Curation. In Proceedings of the Eighth Conference on Machine Translation, pages 95–102, Singapore. Association for Computational Linguistics.
Cite (Informal):
Findings of the WMT 2023 Shared Task on Parallel Data Curation (Sloto et al., WMT 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.wmt-1.5.pdf