Automatic Discrimination of Human and Neural Machine Translation: A Study with Multiple Pre-Trained Models and Longer Context

Tobias van der Werff, Rik van Noord, Antonio Toral


Abstract
We address the task of automatically distinguishing between human-translated (HT) and machine translated (MT) texts. Following recent work, we fine-tune pre-trained language models (LMs) to perform this task. Our work differs in that we use state-of-the-art pre-trained LMs, as well as the test sets of the WMT news shared tasks as training data, to ensure the sentences were not seen during training of the MT system itself. Moreover, we analyse performance for a number of different experimental setups, such as adding translationese data, going beyond the sentence-level and normalizing punctuation. We show that (i) choosing a state-of-the-art LM can make quite a difference: our best baseline system (DeBERTa) outperforms both BERT and RoBERTa by over 3% accuracy, (ii) adding translationese data is only beneficial if there is not much data available, (iii) considerable improvements can be obtained by classifying at the document-level and (iv) normalizing punctuation and thus avoiding (some) shortcuts has no impact on model performance.
Anthology ID:
2022.eamt-1.19
Volume:
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Month:
June
Year:
2022
Address:
Ghent, Belgium
Editors:
Helena Moniz, Lieve Macken, Andrew Rufener, Loïc Barrault, Marta R. Costa-jussà, Christophe Declercq, Maarit Koponen, Ellie Kemp, Spyridon Pilos, Mikel L. Forcada, Carolina Scarton, Joachim Van den Bogaert, Joke Daems, Arda Tezcan, Bram Vanroy, Margot Fonteyne
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
161–170
Language:
URL:
https://aclanthology.org/2022.eamt-1.19
DOI:
Bibkey:
Cite (ACL):
Tobias van der Werff, Rik van Noord, and Antonio Toral. 2022. Automatic Discrimination of Human and Neural Machine Translation: A Study with Multiple Pre-Trained Models and Longer Context. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 161–170, Ghent, Belgium. European Association for Machine Translation.
Cite (Informal):
Automatic Discrimination of Human and Neural Machine Translation: A Study with Multiple Pre-Trained Models and Longer Context (van der Werff et al., EAMT 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.eamt-1.19.pdf
Code
 tobiasvanderwerff/HT-vs-MT