Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Alexander Jones, Derry Tanti Wijaya


Abstract
Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called “pseudo-parallel” sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways: after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual/bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.
Anthology ID:
2021.bucc-1.7
Volume:
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Month:
September
Year:
2021
Address:
Online (Virtual Mode)
Venue:
BUCC
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
46–59
Language:
URL:
https://aclanthology.org/2021.bucc-1.7
DOI:
Bibkey:
Cite (ACL):
Alexander Jones and Derry Tanti Wijaya. 2021. Majority Voting with Bidirectional Pre-translation For Bitext Retrieval. In Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021), pages 46–59, Online (Virtual Mode). INCOMA Ltd..
Cite (Informal):
Majority Voting with Bidirectional Pre-translation For Bitext Retrieval (Jones & Wijaya, BUCC 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.bucc-1.7.pdf
Code
 AlexJonesNLP/alt-bitexts
Data
BUCCTatoeba