CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

Thuy Vu, Alessandro Moschitti


Abstract
We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial level. CDA works in two steps: (i) projecting documents of a web domain to a shared multilingual space; then (ii) aligning them based on the similarity of their representations in such space. We leverage lexical translation models to build vector representations using TF×IDF. CDA achieves performance comparable with state-of-the-art systems in the WMT-16 Bilingual Document Alignment Shared Task benchmark while operating in multilingual space. Besides, we created two web-scale datasets to examine the robustness of CDA in an industrial setting involving up to 28 languages and millions of documents. The experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.
Anthology ID:
2021.eacl-main.266
Volume:
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Month:
April
Year:
2021
Address:
Online
Editors:
Paola Merlo, Jorg Tiedemann, Reut Tsarfaty
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3053–3061
Language:
URL:
https://aclanthology.org/2021.eacl-main.266
DOI:
10.18653/v1/2021.eacl-main.266
Bibkey:
Cite (ACL):
Thuy Vu and Alessandro Moschitti. 2021. CDA: a Cost Efficient Content-based Multilingual Web Document Aligner. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3053–3061, Online. Association for Computational Linguistics.
Cite (Informal):
CDA: a Cost Efficient Content-based Multilingual Web Document Aligner (Vu & Moschitti, EACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.eacl-main.266.pdf