DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations

Ekaterina Lapshinova-Koltunski, Maja Popović, Maarit Koponen


Abstract
This paper describes a new corpus of human translations which contains both professional and students translations. The data consists of English sources – texts from news and reviews – and their translations into Russian and Croatian, as well as of the subcorpus containing translations of the review texts into Finnish. All target languages represent mid-resourced and less or mid-investigated ones. The corpus will be valuable for studying variation in translation as it allows a direct comparison between human translations of the same source texts. The corpus will also be a valuable resource for evaluating machine translation systems. We believe that this resource will facilitate understanding and improvement of the quality issues in both human and machine translation. In the paper, we describe how the data was collected, provide information on translator groups and summarise the differences between the human translations at hand based on our preliminary results with shallow features.
Anthology ID:
2022.lrec-1.186
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1751–1760
Language:
URL:
https://aclanthology.org/2022.lrec-1.186
DOI:
Bibkey:
Cite (ACL):
Ekaterina Lapshinova-Koltunski, Maja Popović, and Maarit Koponen. 2022. DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1751–1760, Marseille, France. European Language Resources Association.
Cite (Informal):
DiHuTra: a Parallel Corpus to Analyse Differences between Human Translations (Lapshinova-Koltunski et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.186.pdf
Code
 katjakaterina/dihutra