Parallel Sentence Mining by Constrained Decoding

Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, Faheem Kirefu


Abstract
We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.
Anthology ID:
2020.acl-main.152
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1672–1678
Language:
URL:
https://aclanthology.org/2020.acl-main.152
DOI:
10.18653/v1/2020.acl-main.152
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.152.pdf
Video:
 http://slideslive.com/38929223