Tom Kocmi

2025

Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgment. However, these results are often obtained by averaging predictions across large test sets without any insights into the strengths and weaknesses of these metrics across different error types. Challenge sets are used to probe specific dimensions of metric behavior but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from basic alterations at the word/character level to more intricate errors based on discourse and real-world knowledge. We conducted a large-scale study by benchmarking ACES on 47 metrics submitted to the WMT 2022 and WMT 2023 metrics shared tasks. We also measure their sensitivity to a range of linguistic phenomena. We further investigate claims that large language models (LLMs) are effective as MT evaluators, addressing the limitations of previous studies by using a dataset that covers a range of linguistic phenomena and language pairs and includes both low- and medium-resource languages. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods are unreliable. We expose a number of major flaws with existing methods: Most metrics ignore the source sentence; metrics tend to prefer surface level overlap; and over-reliance on language-agnostic representations leads to confusion when the target language is similar to the source language. To further encourage detailed evaluation beyond singular scores, we expand ACES to include error span annotations, denoted as SPAN-ACES, and we use this dataset to evaluate span-based error metrics, showing that these metrics also need considerable improvement. Based on our observations, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing metrics to explicitly focus on the source sentence, focusing on semantic content rather than relying on the lexical overlap, and choosing the right pre-trained model for obtaining representations.

pdf bib abs
Estimating Machine Translation Difficulty
Lorenzo Proietti | Stefano Perrella | Vilém Zouhar | Roberto Navigli | Tom Kocmi
Findings of the Association for Computational Linguistics: EMNLP 2025

Machine translation quality has steadily improved over the years, achieving near-perfect translations in recent benchmarks.These high-quality outputs make it difficult to distinguish between state-of-the-art models and to identify areas for future improvement.In this context, automatically identifying texts where machine translation systems struggle holds promise for developing more discriminative evaluations and guiding future research.In this work, we address this gap by formalizing the task of translation difficulty estimation, defining a text’s difficulty based on the expected quality of its translations.We introduce a new metric to evaluate difficulty estimators and use it to assess both baselines and novel approaches.Finally, we demonstrate the practical utility of difficulty estimators by using them to construct more challenging benchmarks for machine translation. Our results show that dedicated models outperform both heuristic-based methods and LLM-as-a-judge approaches, with sentinel-src achieving the best performance.Thus, we release two improved models for difficulty estimation, sentinel-src-24 and sentinel-src-25, which can be used to scan large collections of texts and select those most likely to challenge contemporary machine translation systems.

pdf bib
AI-Assisted Human Evaluation of Machine Translation
Vilém Zouhar | Tom Kocmi | Mrinmaya Sachan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

pdf bib
Proceedings of the Tenth Conference on Machine Translation
Barry Haddow | Tom Kocmi | Philipp Koehn | Christof Monz
Proceedings of the Tenth Conference on Machine Translation

This paper presents the results of the General Machine Translation Task organized as part of the 2025 Conference on Machine Translation (WMT). Participants were invited to build systems for any of 30 language pairs. For half of these pairs, we conducted a human evaluation on test sets spanning four to five different domains.We evaluated 60 systems in total: 36 submitted by participants and 24 for which we collected translations from large language models (LLMs) and popular online translation providers.This year, we focused on creating challenging test sets by developing a difficulty sampling technique and using more complex source data. We evaluated system outputs with professional annotators using the Error Span Annotation (ESA) protocol, except for two language pairs, for which we used Multidimensional Quality Metrics (MQM) instead.We continued the trend of increasingly moving towards document-level translation, providing the source texts as whole documents containing multiple paragraphs.

The WMT25 Multilingual Instruction Shared Task (MIST) introduces a benchmark to evaluate large language models (LLMs) across 30 languages. The benchmark covers five types of problems: machine translation, linguistic reasoning, open-ended generation, cross-lingual summarization, and LLM-as-a-judge.We provide automatic evaluation and collect human annotations, which highlight the limitations of automatic evaluation and allow further research into metric meta-evaluation. We run on our benchmark a diverse set of open- and closed-weight LLMs, providing a broad assessment of the multilingual capabilities of current LLMs. Results highlight substantial variation across sub-tasks and languages, revealing persistent challenges in reasoning, cross-lingual generation, and evaluation reliability. This work establishes a standardized framework for measuring future progress in multilingual LLM development.

The WMT25 Shared Task on Automated Translation Evaluation Systems evaluates metrics and quality estimation systems that assess the quality of language translation systems. This task unifies and consolidates the separate WMT shared tasks on Machine Translation Evaluation Metrics and Quality Estimation from previous years. Our primary goal is to encourage the development and assessment of new state-of-the-art translation quality evaluation systems. The shared task this year consisted of three subtasks: (1) segment-level quality score prediction, (2) span-level translation error annotation, and (3) quality-informed segment-level error correction. The evaluation data for the shared task were provided by the General MT shared task and were complemented by “challenge sets” from both the organizers and participants. Task 1 results indicate the strong performance of large LLMs at the system level, whilereference-based baseline metrics outperform LLMs at the segment level. Task 2 results indicate that accurate error detection and balancing precision and recall are persistent challenges. Task 3 results show that minimal editing is challenging even when informed by quality indicators. Robustness across the broad diversity of languages remains a major challenge across all three subtasks.

We present Command A Translate, an LLMbased machine translation model built off Cohere’s Command A. It reaches state-of-the-art machine translation quality via direct preference optimization. Our meticulously designed data preparation pipeline emphasizes robust quality control and a novel difficulty filtering – a key innovation that distinguishes Command A Translate. Furthermore, we extend our model and participate at WMT with a system (CommandA-WMT) that uses two models and post-editing steps of step-by-step reasoning and limited Minimum Bayes Risk decoding.

2024

pdf bib abs
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Tom Kocmi | Vilém Zouhar | Christian Federmann | Matt Post
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the “dynamic range” of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask “what point difference x in metric y is required between two systems for humans to notice?”. We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing LLMs for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conduct an investigation into several prompting designs, and propose a new prompting method called Error Analysis Prompting (EAPrompt) by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2023). This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM, Freitag et al., (2021)) and produces explainable and reliable MT evaluations at both the system and segment level. Experimental Results from WMT22 metrics shared task validate the effectiveness of EAPrompt on various LLMs, with different structures. Further analysis confirms that EAPrompt effectively distinguishes major errors from minor ones, while also sharing a similar distribution of the number of errors with MQM. These findings highlight the potential of EAPrompt as a human-like evaluator prompting technique for MT evaluation. We will release our code and scripts to facilitate the community.

pdf bib abs
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References
Tianyi Tang | Hongyuan Lu | Yuchen Jiang | Haoyang Huang | Dongdong Zhang | Xin Zhao | Tom Kocmi | Furu Wei
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model’s hypotheses. To address this issue, this paper presents a simple and effective method, named **Div-Ref**, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to diversify the expression of a single reference into multiple high-quality ones to cover the semantic space of the reference sentence as much as possible. We conduct comprehensive experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation. This idea is compatible with recent LLM-based evaluation which can similarly derive advantages from incorporating multiple references. *We strongly encourage future generation benchmarks to include more references, even if they are generated by LLMs, which is once for all.* We release all the code and data at https://github.com/RUCAIBox/Div-Ref to facilitate research.

pdf bib abs
SLIDE: Reference-free Evaluation for Machine Translation using a Sliding Document Window
Vikas Raunak | Tom Kocmi | Matt Post
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Reference-based metrics that operate at the sentence-level typically outperform quality estimation metrics, which have access only to the source and system output.This is unsurprising, since references resolve ambiguities that may be present in the source.In this paper, we investigate whether additional source context can effectively substitute for a reference.We present a metric named SLIDE (SLIding Document Evaluator), which operates on blocks of sentences. SLIDE leverages a moving window that slides over each document in the test set, feeding each chunk of sentences into an unmodified, off-the-shelf quality estimation model.We find that SLIDE obtains significantly higher pairwise system accuracy than its sentence-level baseline, in some cases even eliminating the gap with reference-base metrics.This suggests that source context may provide the same information as a human reference in disambiguating source ambiguities. This finding is especially pertinent for reference-free document-level evaluation, wherein SLIDE could provide higher-quality pairwise system assessments while only requiring document boundary annotations.

pdf bib
Proceedings of the Ninth Conference on Machine Translation
Barry Haddow | Tom Kocmi | Philipp Koehn | Christof Monz
Proceedings of the Ninth Conference on Machine Translation

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).

The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.

High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

2023

pdf bib abs
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Vilém Zouhar | Shehzaad Dhuliawala | Wangchunshu Zhou | Nico Daheim | Tom Kocmi | Yuchen Eleanor Jiang | Mrinmaya Sachan
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).

pdf bib abs
Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Tom Kocmi | Christian Federmann
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22’s Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

pdf bib abs
Overview of the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023
Tirthankar Ghosal | Ondřej Bojar | Marie Hledíková | Tom Kocmi | Anna Nedoluzhko
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

In this article, we report the findings of the second shared task on Automatic Minuting (AutoMin) held as a Generation Challenge at the 16th International Natural Language Generation (INLG) Conference 2023. The second Automatic Minuting shared task is a successor to the first AutoMin which took place in 2021. The primary objective of the AutoMin shared task is to garner participation of the speech and natural language processing and generation community to create automatic methods for generating minutes from multi-party meetings. Five teams from diverse backgrounds participated in the shared task this year. A lot has changed in the Generative AI landscape since the last AutoMin especially with the emergence and wide adoption of Large Language Models (LLMs) to different downstream tasks. Most of the contributions are based on some form of an LLM and we are also adding current outputs of GPT4 as a benchmark. Furthermore, we examine the applicability of GPT-4 for automatic scoring of minutes. Compared to the previous instance of AutoMin, we also add another domain, the minutes for EU Parliament sessions, and we experiment with a more fine-grained manual evaluation. More details on the event can be found at https://ufal.github.io/automin-2023/.

pdf bib
Proceedings of the Eighth Conference on Machine Translation
Philipp Koehn | Barry Haddow | Tom Kocmi | Christof Monz
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year’s success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.

The WMT 2023 Terminology Shared Task investigates progress in machine translation of texts with specialized vocabulary. The participants were given the source text and segment-level terminology dictionaries for three language pairs: Chinese→English, English→Czech, and German→English. We evaluate 21 submissions from 7 teams on two main criteria: general translation quality and the effectiveness of translating specialized terminology. Systems took varied approaches — incorporating terminology at inference time or weakly supervised training that uses terminology access. While incorporating terminology dictionaries leads to improvement in the translation quality, incorporating an equal amount of information from the reference leads to similar results. This challenges the position of terminologies being the crux of meaning in translation, it can also be explained by inadequate metrics which are not terminology-centric.

pdf bib abs
eBLEU: Unexpectedly Good Machine Translation Evaluation Using Simple Word Embeddings
Muhammad ElNokrashy | Tom Kocmi
Proceedings of the Eighth Conference on Machine Translation

We propose eBLEU, a metric inspired by BLEU metric that uses embedding similarities instead of string matches. We introduce meaning diffusion vectors to enable matching n-grams of semantically similar words in a BLEU-like algorithm, using efficient, non-contextual word embeddings like fastText. On WMT23 data, eBLEU beats BLEU and ChrF by around 3.8% system-level score, approaching BERTScore at −0.9% absolute difference. In WMT22 scenarios, eBLEU outperforms f101spBLEU and ChrF in MQM by 2.2%−3.6%. Curiously, on MTurk evaluations, eBLEU surpasses past methods by 3.9%−8.2% (f200spBLEU, COMET-22). eBLEU presents an interesting middle-ground between traditional metrics and pretrained metrics.

pdf bib abs
Cometoid: Distilling Strong Reference-based Machine Translation Metrics into Even Stronger Quality Estimation Metrics
Thamme Gowda | Tom Kocmi | Marcin Junczys-Dowmunt
Proceedings of the Eighth Conference on Machine Translation

This paper describes our submissions to the 2023 Conference on Machine Translation (WMT-23) Metrics shared task. Knowledge distillation is commonly used to create smaller student models that mimic larger teacher model while reducing the model size and hence inference cost in production. In this work, we apply knowledge distillation to machine translation evaluation metrics and distill existing reference-based teacher metrics into reference-free (quality estimation; QE) student metrics. We mainly focus on students of Unbabel’s COMET22 reference-based metric. When evaluating on the official WMT-22 Metrics evaluation task, our distilled Cometoid QE metrics outperform all other QE metrics on that set while matching or out-performing the reference-based teacher metric. Our metrics never see the human ground-truth scores directly – only the teacher metric was trained on human scores by its original creators. We also distill ChrF sentence-level scores into a neural QE metric and find that our reference-free (and fully human-score-free) student metric ChrFoid outperforms its teacher metric by over 7% pairwise accuracy on the same WMT-22 task, rivaling other existing QE metrics.

pdf bib abs
GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
Tom Kocmi | Christian Federmann
Proceedings of the Eighth Conference on Machine Translation

This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans. Compared to previous works, our method has language-agnostic prompts, thus avoiding the need for manual prompt preparation for new languages. While preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking, we advise caution when using it in academic works to demonstrate improvements over other methods due to its dependence on the proprietary, black-box GPT model.

pdf bib abs
Evaluating Metrics for Document-context Evaluation in Machine Translation
Vikas Raunak | Tom Kocmi | Matt Post
Proceedings of the Eighth Conference on Machine Translation

We describe our submission of a new metric, SLIDE (Raunak et al., 2023), to the WMT 2023 metrics task. SLIDE is a reference-free quality-estimation metric that works by constructing a fixed sentence-length window over the documents in a test set, concatenating chunks and then sending them for scoring as a single unit by COMET (Rei et al, 2022). We find that SLIDE improves dramatically over its context-less counterpart on the two WMT22 evaluation campaigns (MQM and DA+SQM).

2022

pdf bib
NTREX-128 – News Test References for MT Evaluation of 128 Languages
Christian Federmann | Tom Kocmi | Ying Xin
Proceedings of the First Workshop on Scaling Up Multilingual Evaluation

This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).

This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are remarkably robust across different domains and challenges.

pdf bib abs
Searching for a Higher Power in the Human Evaluation of MT
Johnny Wei | Tom Kocmi | Christian Federmann
Proceedings of the Seventh Conference on Machine Translation (WMT)

In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an “early stopping” collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27% efficiency gain when spending 3x the current budget, or 18% savings at the current evaluation power.

pdf bib abs
MS-COMET: More and Better Human Judgements Improve Metric Performance
Tom Kocmi | Hitokazu Matsushita | Christian Federmann
Proceedings of the Seventh Conference on Machine Translation (WMT)

We develop two new metrics that build on top of the COMET architecture. The main contribution is collecting a ten-times larger corpus of human judgements than COMET and investigating how to filter out problematic human judgements. We propose filtering human judgements where human reference is statistically worse than machine translation. Furthermore, we average scores of all equal segments evaluated multiple times. The results comparing automatic metrics on source-based DA and MQM-style human judgement show state-of-the-art performance on a system-level pair-wise system ranking. We release both of our metrics for public use.

2021

pdf bib abs
On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs
Roman Grundkiewicz | Marcin Junczys-Dowmunt | Christian Federmann | Tom Kocmi
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Recent studies emphasize the need of document context in human evaluation of machine translations, but little research has been done on the impact of user interfaces on annotator productivity and the reliability of assessments. In this work, we compare human assessment data from the last two WMT evaluation campaigns collected via two different methods for document-level evaluation. Our analysis shows that a document-centric approach to evaluation where the annotator is presented with the entire document context on a screen leads to higher quality segment and document level assessments. It improves the correlation between segment and document scores and increases inter-annotator agreement for document scores but is considerably more time consuming for annotators.

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

pdf bib abs
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
Tom Kocmi | Christian Federmann | Roman Grundkiewicz | Marcin Junczys-Dowmunt | Hitokazu Matsushita | Arul Menezes
Proceedings of the Sixth Conference on Machine Translation

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system’s quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on – to the best of our knowledge – the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

2020

pdf bib abs
Efficiently Reusing Old Models Across Languages via Transfer Learning
Tom Kocmi | Ondřej Bojar
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Recent progress in neural machine translation (NMT) is directed towards larger neural networks trained on an increasing amount of hardware resources. As a result, NMT models are costly to train, both financially, due to the electricity and hardware cost, and environmentally, due to the carbon footprint. It is especially true in transfer learning for its additional cost of training the “parent” model before transferring knowledge and training the desired “child” model. In this paper, we propose a simple method of re-using an already trained model for different language pairs where there is no need for modifications in model architecture. Our approach does not need a separate parent model for each investigated language pair, as it is typical in NMT transfer learning. To show the applicability of our method, we recycle a Transformer model trained by different researchers and use it to seed models for different language pairs. We achieve better translation quality and shorter convergence times than when training from random initialization.

This paper presents the results of the news translation task and the similar language translation task, both organised alongside the Conference on Machine Translation (WMT) 2020. In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the similar language translation task, participants built machine translation systems for translating between closely related pairs of languages.

pdf bib abs
CUNI Submission for the Inuktitut Language in WMT News 2020
Tom Kocmi
Proceedings of the Fifth Conference on Machine Translation

This paper describes CUNI submission to the WMT 2020 News Translation Shared Task for the low-resource scenario Inuktitut–English in both translation directions. Our system combines transfer learning from a Czech–English high-resource language pair and backtranslation. We notice surprising behaviour when using synthetic data, which can be possibly attributed to a narrow domain of training and test data. We are using the Transformer model in a constrained submission.

pdf bib abs
Gender Coreference and Bias Evaluation at WMT 2020
Tom Kocmi | Tomasz Limisiewicz | Gabriel Stanovsky
Proceedings of the Fifth Conference on Machine Translation

Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT: Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information.

pdf bib abs
CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20
Ivana Kvapilíková | Tom Kocmi | Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation

This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian. We experimented with training on synthetic data and pre-training on a related language pair. In the fully unsupervised scenario, we achieved 25.5 and 23.7 BLEU translating from and into Upper Sorbian, respectively. Our low-resource systems relied on transfer learning from German-Czech parallel data and achieved 57.4 BLEU and 56.1 BLEU, which is an improvement of 10 BLEU points over the baseline trained only on the available small German-Upper Sorbian parallel corpus.

2019

pdf bib abs
CUNI Submission for Low-Resource Languages in WMT News 2019
Tom Kocmi | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the CUNI submission to the WMT 2019 News Translation Shared Task for the low-resource languages: Gujarati-English and Kazakh-English. We participated in both language pairs in both translation directions. Our system combines transfer learning from a different high-resource language pair followed by training on backtranslated monolingual data. Thanks to the simultaneous training in both directions, we can iterate the backtranslation process. We are using the Transformer model in a constrained submission.

2018

pdf bib abs
CUNI Basque-to-English Submission in IWSLT18
Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the 15th International Conference on Spoken Language Translation

We present our submission to the IWSLT18 Low Resource task focused on the translation from Basque-to-English. Our submission is based on the current state-of-the-art self-attentive neural network architecture, Transformer. We further improve this strong baseline by exploiting available monolingual data using the back-translation technique. We also present further improvements gained by a transfer learning, a technique that trains a model using a high-resource language pair (Czech-English) and then fine-tunes the model using the target low-resource language pair (Basque-English).

pdf bib
SumeCzech: Large Czech News-Based Summarization Dataset
Milan Straka | Nikita Mediankin | Tom Kocmi | Zdeněk Žabokrtský | Vojtěch Hudeček | Jan Hajič
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Trivial Transfer Learning for Low-Resource Neural Machine Translation
Tom Kocmi | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Research Papers

Transfer learning has been proven as an effective technique for neural machine translation under low-resource conditions. Existing methods require a common target language, language relatedness, or specific training tricks and regimes. We present a simple transfer learning method, where we first train a “parent” model for a high-resource language pair and then continue the training on a low-resource pair only by replacing the training corpus. This “child” model performs significantly better than the baseline trained for low-resource pair only. We are the first to show this for targeting different languages, and we observe the improvements even for unrelated languages with different alphabets.

pdf bib abs
CUNI Submissions in WMT18
Tom Kocmi | Roman Sudarikov | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We participated in the WMT 2018 shared news translation task in three language pairs: English-Estonian, English-Finnish, and English-Czech. Our main focus was the low-resource language pair of Estonian and English for which we utilized Finnish parallel data in a simple method. We first train a “parent model” for the high-resource language pair followed by adaptation on the related low-resource language pair. This approach brings a substantial performance boost over the baseline system trained only on Estonian-English parallel data. Our systems are based on the Transformer architecture. For the English to Czech translation, we have evaluated our last year models of hybrid phrase-based approach and neural machine translation mainly for comparison purposes.

pdf bib
CUNI NMT System for WAT 2018 Translation Tasks
Tom Kocmi | Shantipriya Parida | Ond?ej Bojar
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

2017

pdf bib abs
LanideNN: Multilingual Language Identification on Character Window
Tom Kocmi | Ondřej Bojar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We aim one step further and propose a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages. Our method is based on Bidirectional Recurrent Neural Networks and it performs well in monolingual and multilingual language identification tasks on six datasets covering 131 languages. The method keeps the accuracy also for short documents and across domains, so it is ideal for off-the-shelf use without preparation of training data.

pdf bib abs
Curriculum Learning and Minibatch Bucketing in Neural Machine Translation
Tom Kocmi | Ondřej Bojar
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We examine the effects of particular orderings of sentence pairs on the on-line training of neural machine translation (NMT). We focus on two types of such orderings: (1) ensuring that each minibatch contains sentences similar in some aspect and (2) gradual inclusion of some sentence types as the training progresses (so called “curriculum learning”). In our English-to-Czech experiments, the internal homogeneity of minibatches has no effect on the training but some of our “curricula” achieve a small improvement over the baseline.

pdf bib
CUNI submission in WMT17: Chimera goes neural
Roman Sudarikov | David Mareček | Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the Second Conference on Machine Translation

pdf bib
Results of the WMT17 Neural MT Training Task
Ondřej Bojar | Jindřich Helcl | Tom Kocmi | Jindřich Libovický | Tomáš Musil
Proceedings of the Second Conference on Machine Translation

pdf bib abs
CUNI NMT System for WAT 2017 Translation Tasks
Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

The paper presents this year’s CUNI submissions to the WAT 2017 Translation Task focusing on the Japanese-English translation, namely Scientific papers subtask, Patents subtask and Newswire subtask. We compare two neural network architectures, the standard sequence-to-sequence with attention (Seq2Seq) and an architecture using convolutional sentence encoder (FBConv2Seq), both implemented in the NMT framework Neural Monkey that we currently participate in developing. We also compare various types of preprocessing of the source Japanese sentences and their impact on the overall results. Furthermore, we include the results of our experiments with out-of-domain data obtained by combining the corpora provided for each subtask.

pdf bib
An Exploration of Word Embedding Initialization in Deep-Learning Tasks
Tom Kocmi | Ondřej Bojar
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

2016

pdf bib abs
UFAL Submissions to the IWSLT 2016 MT Track
Ondřej Bojar | Ondřej Cífka | Jindřich Helcl | Tom Kocmi | Roman Sudarikov
Proceedings of the 13th International Conference on Spoken Language Translation

We present our submissions to the IWSLT 2016 machine translation task, as our first attempt to translate subtitles and one of our early experiments with neural machine translation (NMT). We focus primarily on English→Czech translation direction but perform also basic adaptation experiments for NMT with German and also the reverse direction. Three MT systems are tested: (1) our Chimera, a tight combination of phrase-based MT and deep linguistic processing, (2) Neural Monkey, our implementation of a NMT system in TensorFlow and (3) Nematus, an established NMT system.