Kirill Semenov


2023

pdf bib
Findings of the WMT 2023 Shared Task on Machine Translation with Terminologies
Kirill Semenov | Vilém Zouhar | Tom Kocmi | Dongdong Zhang | Wangchunshu Zhou | Yuchen Eleanor Jiang
Proceedings of the Eighth Conference on Machine Translation

The WMT 2023 Terminology Shared Task investigates progress in machine translation of texts with specialized vocabulary. The participants were given the source text and segment-level terminology dictionaries for three language pairs: Chinese→English, English→Czech, and German→English. We evaluate 21 submissions from 7 teams on two main criteria: general translation quality and the effectiveness of translating specialized terminology. Systems took varied approaches — incorporating terminology at inference time or weakly supervised training that uses terminology access. While incorporating terminology dictionaries leads to improvement in the translation quality, incorporating an equal amount of information from the reference leads to similar results. This challenges the position of terminologies being the crux of meaning in translation, it can also be explained by inadequate metrics which are not terminology-centric.

2022

pdf bib
Automated Evaluation Metric for Terminology Consistency in MT
Kirill Semenov | Ondřej Bojar
Proceedings of the Seventh Conference on Machine Translation (WMT)

The most widely used metrics for machine translation tackle sentence-level evaluation. However, at least for professional domains such as legal texts, it is crucial to measure the consistency of the translation of the terms throughout the whole text. This paper introduces an automated metric for the term consistency evaluation in machine translation (MT). To demonstrate the metric’s performance, we used the Czech-to-English translated texts from the ELITR 2021 agreement corpus and the outputs of the MT systems that took part in WMT21 News Task. We show different modes of our evaluation algorithm and try to interpret the differences in the ranking of the translation systems based on sentence-level metrics and our approach. We also demonstrate that the proposed metric scores significantly differ from the widespread automated metric scores, and correlate with the human assessment.