Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages

Philipp Koehn


Abstract
We introduce neural methods and a toxicity filtering step to the hierarchical web mining approach of Paracrawl (Bañón et al., 2020), showing large improvements. We apply these methods to web-scale parallel corpus mining for 9 South and East Asian national languages, creating training resources for machine translation that yield better translation quality for most of these languages than existing publicly available datasets in OPUS. Our methods also generally lead to better results than the global mining approach of Schwenk et al. (2021).
Anthology ID:
2024.wmt-1.132
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1454–1466
Language:
URL:
https://aclanthology.org/2024.wmt-1.132
DOI:
Bibkey:
Cite (ACL):
Philipp Koehn. 2024. Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages. In Proceedings of the Ninth Conference on Machine Translation, pages 1454–1466, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and East Asian Languages (Koehn, WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.132.pdf