Corpus Variations for Translation Lexicon Induction

Rebecca Hwa; Carol Nichols; Khalil Sima’an

Corpus Variations for Translation Lexicon Induction

Rebecca Hwa, Carol Nichols, Khalil Sima’an

Abstract

Lexical mappings (word translations) between languages are an invaluable resource for multilingual processing. While the problem of extracting lexical mappings from parallel corpora is well-studied, the task is more challenging when the language samples are from non-parallel corpora. The goal of this work is to investigate one such scenario: finding lexical mappings between dialects of a diglossic language, in which people conduct their written communications in a prestigious formal dialect, but they communicate verbally in a colloquial dialect. Because the two dialects serve different socio-linguistic functions, parallel corpora do not naturally exist between them. An example of a diglossic dialect pair is Modern Standard Arabic (MSA) and Levantine Arabic. In this paper, we evaluate the applicability of a standard algorithm for inducing lexical mappings between comparable corpora (Rapp, 1999) to such diglossic corpora pairs. The focus of the paper is an in-depth error analysis, exploring the notion of relatedness in diglossic corpora and scrutinizing the effects of various dimensions of relatedness (such as mode, topic, style, and statistics) on the quality of the resulting translation lexicon.

Anthology ID:: 2006.amta-papers.9
Volume:: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers
Month:: August 8-12
Year:: 2006
Address:: Cambridge, Massachusetts, USA
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
Note:
Pages:: 74–81
Language:
URL:: https://aclanthology.org/2006.amta-papers.9/
DOI:
Bibkey:
Cite (ACL):: Rebecca Hwa, Carol Nichols, and Khalil Sima’an. 2006. Corpus Variations for Translation Lexicon Induction. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 74–81, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
Cite (Informal):: Corpus Variations for Translation Lexicon Induction (Hwa et al., AMTA 2006)
Copy Citation:
PDF:: https://aclanthology.org/2006.amta-papers.9.pdf

PDF Cite Search Fix data