Translation of Multiword Expressions Using Parallel Suffix Arrays

Paul McNamee, James Mayfield


Abstract
Accurately translating multiword expressions is important to obtain good performance in machine translation, cross-language information retrieval, and other multilingual tasks in human language technology. Existing approaches to inducing translation equivalents of multiword units have focused on agglomerating individual words or on aligning words in a statistical machine translation system. We present a different approach based upon information theoretic heuristics and the exact counting of frequencies of occurrence of multiword strings in aligned parallel corpora. We are applying a technique introduced by Yamamoto and Church that uses suffix arrays and longest common prefix arrays. Evaluation of the method in multiple language pairs was performed using bilingual lexicons of domain-specific terminology as a gold standard. We found that performance of 50-70%, as measured by mean reciprocal rank, can be obtained for terms that occur more than 10 or so times.
Anthology ID:
2006.amta-papers.12
Volume:
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers
Month:
August 8-12
Year:
2006
Address:
Cambridge, Massachusetts, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
100–109
Language:
URL:
https://aclanthology.org/2006.amta-papers.12
DOI:
Bibkey:
Cite (ACL):
Paul McNamee and James Mayfield. 2006. Translation of Multiword Expressions Using Parallel Suffix Arrays. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 100–109, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
Cite (Informal):
Translation of Multiword Expressions Using Parallel Suffix Arrays (McNamee & Mayfield, AMTA 2006)
Copy Citation:
PDF:
https://aclanthology.org/2006.amta-papers.12.pdf