Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation

Márton Miháltz; Gábor Pohl

Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation

Abstract

In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module of the MT system are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their proper target language translations. Since manually annotating training examples is very costly, we are experimenting with a method to produce examples automatically from parallel corpora. Our algorithm relies on monolingual and bilingual lexicons and dictionaries in addition to statistical methods in order to annotate examples extracted from a large English-Hungarian parallel corpus accurately aligned at sentence level. In the paper, we present an experiment with the English noun state, where we categorized the different occurrences in the Hunglish parallel corpus. For this noun, most of the examples were covered by multiword lexical items originating from our lexical sources.

Anthology ID:: L06-1402
Volume:: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:: May
Year:: 2006
Address:: Genoa, Italy
Editors:: Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/649_pdf.pdf
DOI:
Bibkey:
Cite (ACL):: Márton Miháltz and Gábor Pohl. 2006. Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):: Exploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation (Miháltz & Pohl, LREC 2006)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/649_pdf.pdf

PDF Cite Search Fix data