Unsupervised Parallel Sentence Extraction from Comparable Corpora

Viktor Hangya, Fabienne Braune, Yuliya Kalasouskaya, Alexander Fraser


Abstract
Mining parallel sentences from comparable corpora is of great interest for many downstream tasks. In the BUCC 2017 shared task, systems performed well by training on gold standard parallel sentences. However, we often want to mine parallel sentences without bilingual supervision. We present a simple approach relying on bilingual word embeddings trained in an unsupervised fashion. We incorporate orthographic similarity in order to handle words with similar surface forms. In addition, we propose a dynamic threshold method to decide if a candidate sentence-pair is parallel which eliminates the need to fine tune a static value for different datasets. Since we do not employ any language specific engineering our approach is highly generic. We show that our approach is effective, on three language-pairs, without the use of any bilingual signal which is important because parallel sentence mining is most useful in low resource scenarios.
Anthology ID:
2018.iwslt-1.2
Volume:
Proceedings of the 15th International Conference on Spoken Language Translation
Month:
October 29-30
Year:
2018
Address:
Brussels
Editors:
Marco Turchi, Jan Niehues, Marcello Frederico
Venue:
IWSLT
SIG:
SIGSLT
Publisher:
International Conference on Spoken Language Translation
Note:
Pages:
7–13
Language:
URL:
https://aclanthology.org/2018.iwslt-1.2
DOI:
Bibkey:
Cite (ACL):
Viktor Hangya, Fabienne Braune, Yuliya Kalasouskaya, and Alexander Fraser. 2018. Unsupervised Parallel Sentence Extraction from Comparable Corpora. In Proceedings of the 15th International Conference on Spoken Language Translation, pages 7–13, Brussels. International Conference on Spoken Language Translation.
Cite (Informal):
Unsupervised Parallel Sentence Extraction from Comparable Corpora (Hangya et al., IWSLT 2018)
Copy Citation:
PDF:
https://aclanthology.org/2018.iwslt-1.2.pdf
Data
BUCC