Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance

Ahmed El-Kishky, Francisco Guzmán


Abstract
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by 7% on high-resource language pairs, 15% on mid-resource language pairs, and 22% on low-resource language pairs.
Anthology ID:
2020.aacl-main.62
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
616–625
Language:
URL:
https://aclanthology.org/2020.aacl-main.62
DOI:
Bibkey:
Cite (ACL):
Ahmed El-Kishky and Francisco Guzmán. 2020. Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 616–625, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance (El-Kishky & Guzmán, AACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.aacl-main.62.pdf