BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora

Sainik Mahata, Dipankar Das, Sivaji Bandyopadhyay


Abstract
A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm.
Anthology ID:
W17-2511
Volume:
Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Month:
August
Year:
2017
Address:
Vancouver, Canada
Editors:
Serge Sharoff, Pierre Zweigenbaum, Reinhard Rapp
Venue:
BUCC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
56–59
Language:
URL:
https://aclanthology.org/W17-2511/
DOI:
10.18653/v1/W17-2511
Bibkey:
Cite (ACL):
Sainik Mahata, Dipankar Das, and Sivaji Bandyopadhyay. 2017. BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 56–59, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora (Mahata et al., BUCC 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-2511.pdf