Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Tom Kocmi; Christian Federmann

Large Language Models Are State-of-the-Art Evaluators of Translation Quality

Abstract

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22’s Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

Anthology ID:: 2023.eamt-1.19
Volume:: Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Month:: June
Year:: 2023
Address:: Tampere, Finland
Editors:: Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz
Venue:: EAMT
SIG:
Publisher:: European Association for Machine Translation
Note:
Pages:: 193–203
Language:
URL:: https://aclanthology.org/2023.eamt-1.19/
DOI:
Bibkey:
Cite (ACL):: Tom Kocmi and Christian Federmann. 2023. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
Cite (Informal):: Large Language Models Are State-of-the-Art Evaluators of Translation Quality (Kocmi & Federmann, EAMT 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.eamt-1.19.pdf

PDF Cite Search Fix data