Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Alexander Jones; Derry Tanti Wijaya

Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Abstract

Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called “pseudo-parallel” sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways: after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.

Anthology ID:: 2021.bucc-1.7
Volume:: Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Month:: September
Year:: 2021
Address:: Online (Virtual Mode)
Editors:: Reinhard Rapp, Serge Sharoff, Pierre Zweigenbaum
Venue:: BUCC
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 46–59
Language:
URL:: https://aclanthology.org/2021.bucc-1.7/
DOI:
Bibkey:
Cite (ACL):: Alexander Jones and Derry Tanti Wijaya. 2021. Majority Voting with Bidirectional Pre-translation For Bitext Retrieval. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 46–59, Online (Virtual Mode). INCOMA Ltd..
Cite (Informal):: Majority Voting with Bidirectional Pre-translation For Bitext Retrieval (Jones & Wijaya, BUCC 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.bucc-1.7.pdf
Code: AlexJonesNLP/alt-bitexts
Data: BUCC, Tatoeba

PDF Cite Search Code Fix data