Mariya Shmatova
2025
Findings of the WMT25 General Machine Translation Shared Task: Time to Stop Evaluating on Easy Test Sets
Tom Kocmi | Ekaterina Artemova | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Konstantin Dranch | Anton Dvorkovich | Sergey Dukanov | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Howard Lakougna | Jessica Lundin | Christof Monz | Kenton Murray | Masaaki Nagata | Stefano Perrella | Lorenzo Proietti | Martin Popel | Maja Popović | Parker Riley | Mariya Shmatova | Steinthór Steingrímsson | Lisa Yankovskaya | Vilém Zouhar
Proceedings of the Tenth Conference on Machine Translation
Tom Kocmi | Ekaterina Artemova | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Konstantin Dranch | Anton Dvorkovich | Sergey Dukanov | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Howard Lakougna | Jessica Lundin | Christof Monz | Kenton Murray | Masaaki Nagata | Stefano Perrella | Lorenzo Proietti | Martin Popel | Maja Popović | Parker Riley | Mariya Shmatova | Steinthór Steingrímsson | Lisa Yankovskaya | Vilém Zouhar
Proceedings of the Tenth Conference on Machine Translation
This paper presents the results of the General Machine Translation Task organized as part of the 2025 Conference on Machine Translation (WMT). Participants were invited to build systems for any of 30 language pairs. For half of these pairs, we conducted a human evaluation on test sets spanning four to five different domains.We evaluated 60 systems in total: 36 submitted by participants and 24 for which we collected translations from large language models (LLMs) and popular online translation providers.This year, we focused on creating challenging test sets by developing a difficulty sampling technique and using more complex source data. We evaluated system outputs with professional annotators using the Error Span Annotation (ESA) protocol, except for two language pairs, for which we used Multidimensional Quality Metrics (MQM) instead.We continued the trend of increasingly moving towards document-level translation, providing the source texts as whole documents containing multiple paragraphs.
Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation
Tom Kocmi | Sweta Agrawal | Ekaterina Artemova | Eleftherios Avramidis | Eleftheria Briakou | Pinzhen Chen | Marzieh Fadaee | Markus Freitag | Roman Grundkiewicz | Yupeng Hou | Philipp Koehn | Julia Kreutzer | Saab Mansour | Stefano Perrella | Lorenzo Proietti | Parker Riley | Eduardo Sánchez | Patricia Schmidtova | Mariya Shmatova | Vilém Zouhar
Proceedings of the Tenth Conference on Machine Translation
Tom Kocmi | Sweta Agrawal | Ekaterina Artemova | Eleftherios Avramidis | Eleftheria Briakou | Pinzhen Chen | Marzieh Fadaee | Markus Freitag | Roman Grundkiewicz | Yupeng Hou | Philipp Koehn | Julia Kreutzer | Saab Mansour | Stefano Perrella | Lorenzo Proietti | Parker Riley | Eduardo Sánchez | Patricia Schmidtova | Mariya Shmatova | Vilém Zouhar
Proceedings of the Tenth Conference on Machine Translation
The WMT25 Multilingual Instruction Shared Task (MIST) introduces a benchmark to evaluate large language models (LLMs) across 30 languages. The benchmark covers five types of problems: machine translation, linguistic reasoning, open-ended generation, cross-lingual summarization, and LLM-as-a-judge.We provide automatic evaluation and collect human annotations, which highlight the limitations of automatic evaluation and allow further research into metric meta-evaluation. We run on our benchmark a diverse set of open- and closed-weight LLMs, providing a broad assessment of the multilingual capabilities of current LLMs. Results highlight substantial variation across sub-tasks and languages, revealing persistent challenges in reasoning, cross-lingual generation, and evaluation reliability. This work establishes a standardized framework for measuring future progress in multilingual LLM development.
2024
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Benjamin Marie | Christof Monz | Kenton Murray | Masaaki Nagata | Martin Popel | Maja Popović | Mariya Shmatova | Steinthór Steingrímsson | Vilém Zouhar
Proceedings of the Ninth Conference on Machine Translation
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Benjamin Marie | Christof Monz | Kenton Murray | Masaaki Nagata | Martin Popel | Maja Popović | Mariya Shmatova | Steinthór Steingrímsson | Vilém Zouhar
Proceedings of the Ninth Conference on Machine Translation
This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Tom Kocmi | Vilém Zouhar | Eleftherios Avramidis | Roman Grundkiewicz | Marzena Karpinska | Maja Popović | Mrinmaya Sachan | Mariya Shmatova
Proceedings of the Ninth Conference on Machine Translation
Tom Kocmi | Vilém Zouhar | Eleftherios Avramidis | Roman Grundkiewicz | Marzena Karpinska | Maja Popović | Mrinmaya Sachan | Mariya Shmatova
Proceedings of the Ninth Conference on Machine Translation
High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
2023
Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Philipp Koehn | Benjamin Marie | Christof Monz | Makoto Morishita | Kenton Murray | Masaaki Nagata | Toshiaki Nakazawa | Martin Popel | Maja Popović | Mariya Shmatova | Jun Suzuki
Proceedings of the Eighth Conference on Machine Translation
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Philipp Koehn | Benjamin Marie | Christof Monz | Makoto Morishita | Kenton Murray | Masaaki Nagata | Toshiaki Nakazawa | Martin Popel | Maja Popović | Mariya Shmatova | Jun Suzuki
Proceedings of the Eighth Conference on Machine Translation
This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).
2016
YSDA Participation in the WMT’16 Quality Estimation Shared Task
Anna Kozlova | Mariya Shmatova | Anton Frolov
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Anna Kozlova | Mariya Shmatova | Anton Frolov
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
2014
Measuring the Impact of Spelling Errors on the Quality of Machine Translation
Irina Galinskaya | Valentin Gusev | Elena Mescheryakova | Mariya Shmatova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Irina Galinskaya | Valentin Gusev | Elena Mescheryakova | Mariya Shmatova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we show how different types of spelling errors influence the quality of machine translation. We also propose a method to evaluate the impact of spelling errors correction on translation quality without expensive manual work of providing reference translations.
Search
Fix author
Co-authors
- Eleftherios Avramidis 5
- Roman Grundkiewicz 5
- Tom Kocmi 5
- Markus Freitag 4
- Philipp Koehn 4
- Maja Popović 4
- Vilém Zouhar 4
- Rachel Bawden 3
- Ondřej Bojar 3
- Anton Dvorkovich 3
- Mark Fishel 3
- Thamme Gowda 3
- Barry Haddow 3
- Marzena Karpinska 3
- Christof Monz 3
- Kenton Murray 3
- Masaaki Nagata 3
- Martin Popel 3
- Ekaterina Artemova 2
- Christian Federmann 2
- Benjamin Marie 2
- Stefano Perrella 2
- Lorenzo Proietti 2
- Parker Riley 2
- Steinþór Steingrímsson 2
- Sweta Agrawal 1
- Eleftheria Briakou 1
- Pinzhen Chen 1
- Konstantin Dranch 1
- Sergey Dukanov 1
- Marzieh Fadaee 1
- Anton Frolov 1
- Irina Galinskaya 1
- Valentin Gusev 1
- Yupeng Hou 1
- Anna Kozlova 1
- Julia Kreutzer 1
- Howard Lakougna 1
- Jessica Lundin 1
- Saab Mansour 1
- Elena Mescheryakova 1
- Makoto Morishita 1
- Toshiaki Nakazawa 1
- Mrinmaya Sachan 1
- Patrícia Schmidtová 1
- Jun Suzuki 1
- Eduardo Sánchez 1
- Lisa Yankovskaya 1