Same domain different discourse style - A case study on Language Resources for data-driven Machine Translation

Monica Gavrila, Walther v. Hahn, Cristina Vertan


Abstract
Data-driven machine translation (MT) approaches became very popular during last years, especially for language pairs for which it is difficult to find specialists to develop transfer rules. Statistical (SMT) or example-based (EBMT) systems can provide reasonable translation quality for assimilation purposes, as long as a large amount of training data is available. Especially SMT systems rely on parallel aligned corpora which have to be statistical relevant for the given language pair. The construction of large domain specific parallel corpora is time- and cost-consuming; the current practice relies on one or two big such corpora per language pair. Recent developed strategies ensure certain portability to other domains through specialized lexicons or small domain specific corpora. In this paper we discuss the influence of different discourse styles on statistical machine translation systems. We investigate how a pure SMT performs when training and test data belong to same domain but the discourse style varies.
Anthology ID:
L12-1596
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3441–3446
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1003_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Monica Gavrila, Walther v. Hahn, and Cristina Vertan. 2012. Same domain different discourse style - A case study on Language Resources for data-driven Machine Translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3441–3446, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Same domain different discourse style - A case study on Language Resources for data-driven Machine Translation (Gavrila et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1003_Paper.pdf