Building a Corpus of Errors and Quality in Machine Translation: Experiments on Error Impact

Ângela Costa, Rui Correia, Luísa Coheur


Abstract
In this paper we describe a corpus of automatic translations annotated with both error type and quality. The 300 sentences that we have selected were generated by Google Translate, Systran and two in-house Machine Translation systems that use Moses technology. The errors present on the translations were annotated with an error taxonomy that divides errors in five main linguistic categories (Orthography, Lexis, Grammar, Semantics and Discourse), reflecting the language level where the error is located. After the error annotation process, we accessed the translation quality of each sentence using a four point comprehension scale from 1 to 5. Both tasks of error and quality annotation were performed by two different annotators, achieving good levels of inter-annotator agreement. The creation of this corpus allowed us to use it as training data for a translation quality classifier. We concluded on error severity by observing the outputs of two machine learning classifiers: a decision tree and a regression model.
Anthology ID:
L16-1044
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
288–292
Language:
URL:
https://aclanthology.org/L16-1044
DOI:
Bibkey:
Cite (ACL):
Ângela Costa, Rui Correia, and Luísa Coheur. 2016. Building a Corpus of Errors and Quality in Machine Translation: Experiments on Error Impact. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 288–292, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Building a Corpus of Errors and Quality in Machine Translation: Experiments on Error Impact (Costa et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1044.pdf