Data Issues in English-to-Hindi Machine Translation

Ondřej Bojar; Pavel Straňák; Daniel Zeman

Data Issues in English-to-Hindi Machine Translation

Ondřej Bojar, Pavel Straňák, Daniel Zeman

Abstract

Statistical machine translation to morphologically richer languages is a challenging task and more so if the source and target languages differ in word order. Current state-of-the-art MT systems thus deliver mediocre results. Adding more parallel data often helps improve the results; if it doesn't, it may be caused by various problems such as different domains, bad alignment or noise in the new data. In this paper we evaluate the English-to-Hindi MT task from this data perspective. We discuss several available parallel data sources and provide cross-evaluation results on their combinations using two freely available statistical MT systems. We demonstrate various problems encountered in the data and describe automatic methods of data cleaning and normalization. We also show that the contents of two independently distributed data sets can unexpectedly overlap, which negatively affects translation quality. Together with the error analysis, we also present a new tool for viewing aligned corpora, which makes it easier to detect difficult parts in the data even for a developer not speaking the target language.

Anthology ID:: L10-1524
Volume:: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:: May
Year:: 2010
Address:: Valletta, Malta
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/756_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Ondřej Bojar, Pavel Straňák, and Daniel Zeman. 2010. Data Issues in English-to-Hindi Machine Translation. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):: Data Issues in English-to-Hindi Machine Translation (Bojar et al., LREC 2010)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2010/pdf/756_Paper.pdf

PDF Cite Search Fix data