The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval

Constantine Lignos; Daniel Cohen; Yen-Chieh Lien; Pratik Mehta; W. Bruce Croft; Scott Miller

doi:10.18653/v1/D19-1353

The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval

Constantine Lignos, Daniel Cohen, Yen-Chieh Lien, Pratik Mehta, W. Bruce Croft, Scott Miller

Abstract

When performing cross-language information retrieval (CLIR) for lower-resourced languages, a common approach is to retrieve over the output of machine translation (MT). However, there is no established guidance on how to optimize the resulting MT-IR system. In this paper, we examine the relationship between the performance of MT systems and both neural and term frequency-based IR models to identify how CLIR performance can be best predicted from MT quality. We explore performance at varying amounts of MT training data, byte pair encoding (BPE) merge operations, and across two IR collections and retrieval models. We find that the choice of IR collection can substantially affect the predictive power of MT tuning decisions and evaluation, potentially introducing dissociations between MT-only and overall CLIR performance.

Anthology ID:: D19-1353
Volume:: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:: EMNLP | IJCNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3497–3502
Language:
URL:: https://aclanthology.org/D19-1353/
DOI:: 10.18653/v1/D19-1353
Bibkey:
Cite (ACL):: Constantine Lignos, Daniel Cohen, Yen-Chieh Lien, Pratik Mehta, W. Bruce Croft, and Scott Miller. 2019. The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3497–3502, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: The Challenges of Optimizing Machine Translation for Low Resource Cross-Language Information Retrieval (Lignos et al., EMNLP-IJCNLP 2019)
Copy Citation:
PDF:: https://aclanthology.org/D19-1353.pdf
Attachment:: D19-1353.Attachment.zip

PDF Cite Search Attachment Fix data