PE2rr Corpus: Manual Error Annotation of Automatically Pre-annotated MT Post-edits

Maja Popović, Mihael Arčan


Abstract
We present a freely available corpus containing source language texts from different domains along with their automatically generated translations into several distinct morphologically rich languages, their post-edited versions, and error annotations of the performed post-edit operations. We believe that the corpus will be useful for many different applications. The main advantage of the approach used for creation of the corpus is the fusion of post-editing and error classification tasks, which have usually been seen as two independent tasks, although naturally they are not. We also show benefits of coupling automatic and manual error classification which facilitates the complex manual error annotation task as well as the development of automatic error classification tools. In addition, the approach facilitates annotation of language pair related issues.
Anthology ID:
L16-1005
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
27–32
Language:
URL:
https://aclanthology.org/L16-1005
DOI:
Bibkey:
Cite (ACL):
Maja Popović and Mihael Arčan. 2016. PE2rr Corpus: Manual Error Annotation of Automatically Pre-annotated MT Post-edits. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 27–32, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
PE2rr Corpus: Manual Error Annotation of Automatically Pre-annotated MT Post-edits (Popović & Arčan, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1005.pdf