Eleftherios Avramidis


2024

pdf bib
Findings of the WMT24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Marzena Karpinska | Philipp Koehn | Benjamin Marie | Christof Monz | Kenton Murray | Masaaki Nagata | Martin Popel | Maja Popović | Mariya Shmatova | Steinthór Steingrímsson | Vilém Zouhar
Proceedings of the Ninth Conference on Machine Translation

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).

pdf bib
Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared Task
Markus Freitag | Nitika Mathur | Daniel Deutsch | Chi-Kiu Lo | Eleftherios Avramidis | Ricardo Rei | Brian Thompson | Frederic Blain | Tom Kocmi | Jiayi Wang | David Ifeoluwa Adelani | Marianna Buchicchio | Chrysoula Zerva | Alon Lavie
Proceedings of the Ninth Conference on Machine Translation

The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.

pdf bib
Occiglot at WMT24: European Open-source Large Language Models Evaluated on Translation
Eleftherios Avramidis | Annika Grützner-Zahn | Manuel Brack | Patrick Schramowski | Pedro Ortiz Suarez | Malte Ostendorff | Fabio Barth | Shushen Manakhimova | Vivien Macketanz | Georg Rehm | Kristian Kersting
Proceedings of the Ninth Conference on Machine Translation

This document describes the submission of the very first version of the Occiglot open-source large language model to the General MT Shared Task of the 9th Conference of Machine Translation (WMT24). Occiglot is an open-source, community-based LLM based on Mistral-7B, which went through language-specific continual pre-training and subsequent instruction tuning, including instructions relevant to machine translation.We examine the automatic metric scores for translating the WMT24 test set and provide a detailed linguistically-motivated analysis.Despite Occiglot performing worse than many of the other system submissions, we observe that it performs better than Mistral7B, which has been based upon, which indicates the positive effect of the language specific continual-pretraining and instruction tuning. We see the submission of this very early version of the model as a motivation to unite community forces and pursue future LLM research on the translation task.

pdf bib
Investigating the Linguistic Performance of Large Language Models in Machine Translation
Shushen Manakhimova | Vivien Macketanz | Eleftherios Avramidis | Ekaterina Lapshinova-Koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Ninth Conference on Machine Translation

This paper summarizes the results of our test suite evaluation on 39 machine translation systems submitted at the Shared Task of the Ninth Conference of Machine Translation (WMT24). It offers a fine-grained linguistic evaluation of machine translation outputs for English–German and English–Russian, resulting from significant manual linguistic effort. Based on our results, LLMs are inferior to NMT in English–German, both in overall scores and when translating specific linguistic phenomena, such as punctuation, complex future verb tenses, and stripping. LLMs show quite a competitive performance in English-Russian, although top-performing systems might struggle with some cases of named entities and terminology, function words, mediopassive voice, and semantic roles. Additionally, some LLMs generate very verbose or empty outputs, posing challenges to the evaluation process.

pdf bib
Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLMs than on Encoder-Decoder Systems
Eleftherios Avramidis | Shushen Manakhimova | Vivien Macketanz | Sebastian Möller
Proceedings of the Ninth Conference on Machine Translation

This year’s MT metrics challenge set submission by DFKI expands on previous years’ linguistically motivated challenge sets. It includes 137,000 items extracted from 100 MT systems for the two language directions (English to German, English to Russian), covering more than 100 linguistically motivated phenomena organized into 14 linguistic categories. The metrics with the statistically significant best performance in our linguistically motivated analysis are MetricX-24-Hybrid and MetricX-24 for English to German, and MetricX-24 for English to Russian. Metametrics and XCOMET are in the next ranking positions in both language pairs. Metrics are more accurate in detecting linguistic errors in translations by large language models (LLMs) than in translations based on the encoder-decoder neural machine translation (NMT) architecture. Some of the most difficult phenomena for the metrics to score are the transitive past progressive, multiple connectors, and the ditransitive simple future I for English to German, and pseudogapping, contact clauses, and cleft sentences for English to Russian. Despite its overall low performance, the LLM-based metric Gemba performs best in scoring German negation errors.

pdf bib
Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
Tom Kocmi | Vilém Zouhar | Eleftherios Avramidis | Roman Grundkiewicz | Marzena Karpinska | Maja Popović | Mrinmaya Sachan | Mariya Shmatova
Proceedings of the Ninth Conference on Machine Translation

High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

pdf bib
Exploring the Potential of Large Language Models in Adaptive Machine Translation for Generic Text and Subtitles
Abdelhadi Soudi | Mohamed Hannani | Kristof Van Laerhoven | Eleftherios Avramidis
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

pdf bib
DGS-Fabeln-1: A Multi-Angle Parallel Corpus of Fairy Tales between German Sign Language and German Text
Fabrizio Nunnari | Eleftherios Avramidis | Cristina España-Bonet | Marco González | Anna Hennes | Patrick Gebhard
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the acquisition process and the data of DGS-Fabeln-1, a parallel corpus of German text and videos containing German fairy tales interpreted into the German Sign Language (DGS) by a native DGS signer. The corpus contains 573 segments of videos with a total duration of 1 hour and 32 minutes, corresponding with 1428 written sentences. It is the first corpus of semi-naturally expressed DGS that has been filmed from 7 angles, and one of the few sign language (SL) corpora globally which have been filmed from more than 3 angles and where the listener has been simultaneously filmed. The corpus aims at aiding research at SL linguistics, SL machine translation and affective computing, and is freely available for research purposes at the following address: https://doi.org/10.5281/zenodo.10822097.

2023

pdf bib
Neural Machine Translation Methods for Translating Text to Sign Language Glosses
Dele Zhu | Vera Czehmann | Eleftherios Avramidis
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

State-of-the-art techniques common to low resource Machine Translation (MT) are applied to improve MT of spoken language text to Sign Language (SL) glosses. In our experiments, we improve the performance of the transformer-based models via (1) data augmentation, (2) semi-supervised Neural Machine Translation (NMT), (3) transfer learning and (4) multilingual NMT. The proposed methods are implemented progressively on two German SL corpora containing gloss annotations. Multilingual NMT combined with data augmentation appear to be the most successful setting, yielding statistically significant improvements as measured by three automatic metrics (up to over 6 points BLEU), and confirmed via human evaluation. Our best setting outperforms all previous work that report on the same test-set and is also confirmed on a corpus of the American Sign Language (ASL).

pdf bib
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages
Dimitar Shterionov | Mirella De Sisto | Mathias Muller | Davy Van Landuyt | Rehana Omardeen | Shaun Oboyle | Annelies Braffort | Floris Roelofsen | Fred Blain | Bram Vanroy | Eleftherios Avramidis
Proceedings of the Second International Workshop on Automatic Translation for Signed and Spoken Languages

pdf bib
First WMT Shared Task on Sign Language Translation (WMT-SLT22)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgoz | Cristina España-Bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios Gonzales | Dimitar Shterionov | Sandra Sidler-Miserez | Katja Tissi | Davy Van Landuyt
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website (https://www.wmt-slt.com/) or in the findings paper (Müller et al., 2022).

pdf bib
Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet
Tom Kocmi | Eleftherios Avramidis | Rachel Bawden | Ondřej Bojar | Anton Dvorkovich | Christian Federmann | Mark Fishel | Markus Freitag | Thamme Gowda | Roman Grundkiewicz | Barry Haddow | Philipp Koehn | Benjamin Marie | Christof Monz | Makoto Morishita | Kenton Murray | Makoto Nagata | Toshiaki Nakazawa | Martin Popel | Maja Popović | Mariya Shmatova
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

pdf bib
Findings of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23)
Mathias Müller | Malihe Alikhani | Eleftherios Avramidis | Richard Bowden | Annelies Braffort | Necati Cihan Camgöz | Sarah Ebling | Cristina España-Bonet | Anne Göhring | Roman Grundkiewicz | Mert Inan | Zifan Jiang | Oscar Koller | Amit Moryossef | Annette Rios | Dimitar Shterionov | Sandra Sidler-Miserez | Katja Tissi | Davy Van Landuyt
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the Second WMT Shared Task on Sign Language Translation (WMT-SLT23; https://www.wmt-slt.com/). This shared task is concerned with automatic translation between signed and spoken languages. The task is unusual in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT). The task offers four tracks involving the following languages: Swiss German Sign Language (DSGS), French Sign Language of Switzerland (LSF-CH), Italian Sign Language of Switzerland (LIS-CH), German, French and Italian. Four teams (including one working on a baseline submission) participated in this second edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora and reproducible baseline systems. Finally, the task also resulted in publicly available sets of system outputs and more human evaluation scores for sign language translation.

pdf bib
Linguistically Motivated Evaluation of the 2023 State-of-the-art Machine Translation: Can ChatGPT Outperform NMT?
Shushen Manakhimova | Eleftherios Avramidis | Vivien Macketanz | Ekaterina Lapshinova-Koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Eighth Conference on Machine Translation

This paper offers a fine-grained analysis of the machine translation outputs in the context of the Shared Task at the 8th Conference of Machine Translation (WMT23). Building on the foundation of previous test suite efforts, our analysis includes Large Language Models and an updated test set featuring new linguistic phenomena. To our knowledge, this is the first fine-grained linguistic analysis for the GPT-4 translation outputs. Our evaluation spans German-English, English-German, and English-Russian language directions. Some of the phenomena with the lowest accuracies for German-English are idioms and resultative predicates. For English-German, these include mediopassive voice, and noun formation(er). As for English-Russian, these included idioms and semantic roles. GPT-4 performs equally or comparably to the best systems in German-English and English-German but falls in the second significance cluster for English-Russian.

pdf bib
Results of WMT23 Metrics Shared Task: Metrics Might Be Guilty but References Are Not Innocent
Markus Freitag | Nitika Mathur | Chi-kiu Lo | Eleftherios Avramidis | Ricardo Rei | Brian Thompson | Tom Kocmi | Frederic Blain | Daniel Deutsch | Craig Stewart | Chrysoula Zerva | Sheila Castilho | Alon Lavie | George Foster
Proceedings of the Eighth Conference on Machine Translation

This paper presents the results of the WMT23 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT23 News Translation Task. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). Following last year’s success, we also included a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Furthermore, we improved our meta-evaluation procedure by considering fewer tasks and calculating a global score by weighted averaging across the various tasks. We present an extensive analysis on how well metrics perform on three language pairs: Chinese-English, Hebrew-English on the sentence-level and English-German on the paragraph-level. The results strongly confirm the results reported last year, that neural-based metrics are significantly better than non-neural metrics in their levels of correlation with human judgments. Further, we investigate the impact of bad reference translations on the correlations of metrics with human judgment. We present a novel approach for generating synthetic reference translations based on the collection of MT system outputs and their corresponding MQM ratings, which has the potential to mitigate bad reference issues we observed this year for some language pairs. Finally, we also study the connections between the magnitude of metric differences and their expected significance in human evaluation, which should help the community to better understand and adopt new metrics.

pdf bib
Challenging the State-of-the-art Machine Translation Metrics from a Linguistic Perspective
Eleftherios Avramidis | Shushen Manakhimova | Vivien Macketanz | Sebastian Möller
Proceedings of the Eighth Conference on Machine Translation

We employ a linguistically motivated challenge set in order to evaluate the state-of-the-art machine translation metrics submitted to the Metrics Shared Task of the 8th Conference for Machine Translation. The challenge set includes about 21,000 items extracted from 155 machine translation systems for three language directions, covering more than 100 linguistically-motivated phenomena organized in 14 categories. The metrics that have the best performance with regard to our linguistically motivated analysis are the Cometoid22-wmt23 (a trained metric based on distillation) for German-English and MetricX-23-c (based on a fine-tuned mT5 encoder-decoder language model) for English-German and English-Russian. Some of the most difficult phenomena are passive voice for German-English, named entities, terminology and measurement units for English-German, and focus particles, adverbial clause and stripping for English-Russian.

pdf bib
Semi-supervised Learning for Quality Estimation of Machine Translation
Tarun Bhatia | Martin Kraemer | Eduardo Vellasques | Eleftherios Avramidis
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

We investigate whether using semi-supervised learning (SSL) methods can be beneficial for the task of word-level Quality Estimation of Machine Translation in low resource conditions. We show that the Mean Teacher network can provide equal or significantly better MCC scores (up to +12%) than supervised methods when a limited amount of labeled data is available. Additionally, following previous work on SSL, we investigate Pseudo-Labeling in combination with SSL, which nevertheless does not provide consistent improvements.

2022

pdf bib
Fine-tuning of Convolutional Neural Networks for the Recognition of Facial Expressions in Sign Language Video Samples
Neha Deshpande | Fabrizio Nunnari | Eleftherios Avramidis
Proceedings of the 7th International Workshop on Sign Language Translation and Avatar Technology: The Junction of the Visual and the Textual: Challenges and Perspectives

In this paper, we investigate the capability of convolutional neural networks to recognize in sign language video frames the six basic Ekman facial expressions for ‘fear’, ‘disgust’, ‘surprise’, ‘sadness’, ‘happiness’, ‘anger’ along with the ‘neutral’ class. Given the limited amount of annotated facial expression data for the sign language domain, we started from a model pre-trained on general-purpose facial expression datasets and we applied various machine learning techniques such as fine-tuning, data augmentation, class balancing, as well as image preprocessing to reach a better accuracy. The models were evaluated using K-fold cross-validation to get more accurate conclusions. It is experimentally demonstrated that fine-tuning a pre-trained model along with data augmentation by horizontally flipping images and image normalization, helps in providing the best accuracy on the sign language dataset. The best setting achieves satisfactory classification accuracy, comparable to state-of-the-art systems in generic facial expression recognition. Experiments were performed using different combinations of the above-mentioned techniques based on two different architectures, namely MobileNet and EfficientNet, and is deemed that both architectures seem equally suitable for the purpose of fine-tuning, whereas class balancing is discouraged.

pdf bib
Using Neural Machine Translation Methods for Sign Language Translation
Galina Angelova | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

We examine methods and techniques, proven to be helpful for the text-to-text translation of spoken languages in the context of gloss-to-text translation systems, where the glosses are the written representation of the signs. We present one of the first works that include experiments on both parallel corpora of the German Sign Language (PHOENIX14T and the Public DGS Corpus). We experiment with two NMT architectures with optimization of their hyperparameters, several tokenization methods and two data augmentation techniques (back-translation and paraphrasing). Through our investigation we achieve a substantial improvement of 5.0 and 2.2 BLEU scores for the models trained on the two corpora respectively. Our RNN models outperform our Transformer models, and the segmentation method we achieve best results with is BPE, whereas back-translation and paraphrasing lead to minor but not significant improvements.

pdf bib
A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output
Vivien Macketanz | Eleftherios Avramidis | Aljoscha Burchardt | He Wang | Renlong Ai | Shushen Manakhimova | Ursula Strohriegel | Sebastian Möller | Hans Uszkoreit
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents a fine-grained test suite for the language pair German–English. The test suite is based on a number of linguistically motivated categories and phenomena and the semi-automatic evaluation is carried out with regular expressions. We describe the creation and implementation of the test suite in detail, providing a full list of all categories and phenomena. Furthermore, we present various exemplary applications of our test suite that have been implemented in the past years, like contributions to the Conference of Machine Translation, the usage of the test suite and MT outputs for quality estimation, and the expansion of the test suite to the language pair Portuguese–English. We describe how we tracked the development of the performance of various systems MT systems over the years with the help of the test suite and which categories and phenomena are prone to resulting in MT errors. For the first time, we also make a large part of our test suite publicly available to the research community.

pdf bib
Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | Eleftherios Avramidis | Tom Kocmi | George Foster | Alon Lavie | André F. T. Martins
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level. Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors. Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are remarkably robust across different domains and challenges.

pdf bib
Linguistically Motivated Evaluation of the 2022 State-of-the-art Machine Translation Systems for Three Language Directions
Vivien Macketanz | Shushen Manakhimova | Eleftherios Avramidis | Ekaterina Lapshinova-koltunski | Sergei Bagdasarov | Sebastian Möller
Proceedings of the Seventh Conference on Machine Translation (WMT)

This document describes a fine-grained linguistically motivated analysis of 29 machine translation systems submitted at the Shared Task of the 7th Conference of Machine Translation (WMT22). This submission expands the test suite work of previous years by adding the language direction of English–Russian. As a result, evaluation takes place for the language directions of German–English, English–German, and English–Russian. We find that the German–English systems suffer in translating idioms, some tenses of modal verbs, and resultative predicates, the English–German ones in idioms, transitive-past progressive, and middle voice, whereas the English–Russian ones in pseudogapping and idioms.

pdf bib
Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set
Eleftherios Avramidis | Vivien Macketanz
Proceedings of the Seventh Conference on Machine Translation (WMT)

We employ a linguistically motivated challenge set in order to evaluate the state-of-the-art machine translation metrics submitted to the Metrics Shared Task of the 7th Conference for Machine Translation. The challenge set includes about 20,000 items extracted from 145 MT systems for two language directions (German-English, English-German), covering more than 100 linguistically-motivated phenomena organized in 14 categories. The best performing metrics are YiSi-1, BERTScore and COMET-22 for German-English, and UniTE, UniTE-ref, XL-DA and xxl-DA19 for English-German.Metrics in both directions are performing worst when it comes to named-entities & terminology and particularly measuring units. Particularly in German-English they are weak at detecting issues at punctuation, polar questions, relative clauses, dates and idioms. In English-German, they perform worst at present progressive of transitive verbs, future II progressive of intransitive verbs, simple present perfect of ditransitive verbs and focus particles.

pdf bib
Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgöz | Cristina España-bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios | Dimitar Shterionov | Sandra Sidler-miserez | Katja Tissi
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents the results of the First WMT Shared Task on Sign Language Translation (WMT-SLT22).This shared task is concerned with automatic translation between signed and spoken languages. The task is novel in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT).The task featured two tracks, translating from Swiss German Sign Language (DSGS) to German and vice versa. Seven teams participated in this first edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora, reproducible baseline systems and new protocols and software for human evaluation. Finally, the task also resulted in the first publicly available set of system outputs and human evaluation scores for sign language translation.

pdf bib
Experimental Machine Translation of the Swiss German Sign Language via 3D Augmentation of Body Keypoints
Lorenz Hufe | Eleftherios Avramidis
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes the participation of DFKI-SLT at the Sign Language Translation Task of the Seventh Conference of Machine Translation (WMT22). The system focuses on the translation direction from the Swiss German Sign Language (DSGS) to written German. The original videos of the sign language were analyzed with computer vision models to provide 3D body keypoints. A deep-learning sequence-to-sequence model is trained on a parallel corpus of these body keypoints aligned to written German sentences. Geometric data augmentation occurs during the training process. The body keypoints are augmented by artificial rotation in the three dimensional space. The 3D-transformation is calculated with different angles on every batch of the training process.

2021

pdf bib
Observing the Learning Curve of NMT Systems With Regard to Linguistic Phenomena
Patrick Stadler | Vivien Macketanz | Eleftherios Avramidis
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

In this paper we present our observations and evaluations by observing the linguistic performance of the system on several steps on the training process of various English-to-German Neural Machine Translation models. The linguistic performance is measured through a semi-automatic process using a test suite. Among several linguistic observations, we find that the translation quality of some linguistic categories decreased within the recorded iterations. Additionally, we notice some drops of the translation quality of certain categories when using a larger corpus.

pdf bib
Automatic generation of a 3D sign language avatar on AR glasses given 2D videos of human signers
Lan Thao Nguyen | Florian Schicktanz | Aeneas Stankowski | Eleftherios Avramidis
Proceedings of the 1st International Workshop on Automatic Translation for Signed and Spoken Languages (AT4SSL)

In this paper we present a prototypical implementation of a pipeline that allows the automatic generation of a German Sign Language avatar from 2D video material. The presentation is accompanied by the source code. We record human pose movements during signing with computer vision models. The joint coordinates of hands and arms are imported as landmarks to control the skeleton of our avatar. From the anatomically independent landmarks, we create another skeleton based on the avatar’s skeletal bone architecture to calculate the bone rotation data. This data is then used to control our human 3D avatar. The avatar is displayed on AR glasses and can be placed virtually in the room, in a way that it can be perceived simultaneously to the verbal speaker. In further work it is aimed to be enhanced with speech recognition and machine translation methods for serving as a sign language interpreter. The prototype has been shown to people of the deaf and hard-of-hearing community for assessing its comprehensibility. Problems emerged with the transferred hand rotations, hand gestures were hard to recognize on the avatar due to deformations like twisted finger meshes.

pdf bib
Linguistic Evaluation for the 2021 State-of-the-art Machine Translation Systems for German to English and English to German
Vivien Macketanz | Eleftherios Avramidis | Shushen Manakhimova | Sebastian Möller
Proceedings of the Sixth Conference on Machine Translation

We are using a semi-automated test suite in order to provide a fine-grained linguistic evaluation for state-of-the-art machine translation systems. The evaluation includes 18 German to English and 18 English to German systems, submitted to the Translation Shared Task of the 2021 Conference on Machine Translation. Our submission adds up to the submissions of the previous years by creating and applying a wide-range test suite for English to German as a new language pair. The fine-grained evaluation allows spotting significant differences between systems that cannot be distinguished by the direct assessment of the human evaluation campaign. We find that most of the systems achieve good accuracies in the majority of linguistic phenomena but there are few phenomena with lower accuracy, such as the idioms, the modal pluperfect and the German resultative predicates. Two systems have significantly better test suite accuracy in macro-average in every language direction, Online-W and Facebook-AI for German to English and VolcTrans and Online-W for English to German. The systems show a steady improvement as compared to previous years.

2020

pdf bib
Fine-grained linguistic evaluation for state-of-the-art Machine Translation
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Aljoscha Burchardt | Sebastian Möller
Proceedings of the Fifth Conference on Machine Translation

This paper describes a test suite submission providing detailed statistics of linguistic performance for the state-of-the-art German-English systems of the Fifth Conference of Machine Translation (WMT20). The analysis covers 107 phenomena organized in 14 categories based on about 5,500 test items, including a manual annotation effort of 45 person hours. Two systems (Tohoku and Huoshan) appear to have significantly better test suite accuracy than the others, although the best system of WMT20 is not significantly better than the one from WMT19 in a macro-average. Additionally, we identify some linguistic phenomena where all systems suffer (such as idioms, resultative predicates and pluperfect), but we are also able to identify particular weaknesses for individual systems (such as quotation marks, lexical ambiguity and sluicing). Most of the systems of WMT19 which submitted new versions this year show improvements.

2019

pdf bib
Train, Sort, Explain: Learning to Diagnose Translation Models
Robert Schwarzenberg | David Harbecke | Vivien Macketanz | Eleftherios Avramidis | Sebastian Möller
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Evaluating translation models is a trade-off between effort and detail. On the one end of the spectrum there are automatic count-based methods such as BLEU, on the other end linguistic evaluations by humans, which arguably are more informative but also require a disproportionately high effort. To narrow the spectrum, we propose a general approach on how to automatically expose systematic differences between human and machine translations to human experts. Inspired by adversarial settings, we train a neural text classifier to distinguish human from machine translations. A classifier that performs and generalizes well after training should recognize systematic differences between the two classes, which we uncover with neural explainability methods. Our proof-of-concept implementation, DiaMaT, is open source. Applied to a dataset translated by a state-of-the-art neural Transformer model, DiaMaT achieves a classification accuracy of 75% and exposes meaningful differences between humans and the Transformer, amidst the current discussion about human parity.

pdf bib
Linguistic Evaluation of German-English Machine Translation Using a Test Suite
Eleftherios Avramidis | Vivien Macketanz | Ursula Strohriegel | Hans Uszkoreit
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We present the results of the application of a grammatical test suite for German-to-English MT on the systems submitted at WMT19, with a detailed analysis for 107 phenomena organized in 14 categories. The systems still translate wrong one out of four test items in average. Low performance is indicated for idioms, modals, pseudo-clefts, multi-word expressions and verb valency. When compared to last year, there has been a improvement of function words, non verbal agreement and punctuation. More detailed conclusions about particular systems and phenomena are also presented.

2018

pdf bib
Fine-grained evaluation of Quality Estimation for Machine translation based on a linguistically motivated Test Suite
Eleftherios Avramidis | Vivien Macketanz | Arle Lommel | Hans Uszkoreit
Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing

pdf bib
Fine-grained evaluation of German-English Machine Translation based on a Test Suite
Vivien Macketanz | Eleftherios Avramidis | Aljoscha Burchardt | Hans Uszkoreit
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present an analysis of 16 state-of-the-art MT systems on German-English based on a linguistically-motivated test suite. The test suite has been devised manually by a team of language professionals in order to cover a broad variety of linguistic phenomena that MT often fails to translate properly. It contains 5,000 test sentences covering 106 linguistic phenomena in 14 categories, with an increased focus on verb tenses, aspects and moods. The MT outputs are evaluated in a semi-automatic way through regular expressions that focus only on the part of the sentence that is relevant to each phenomenon. Through our analysis, we are able to compare systems based on their performance on these categories. Additionally, we reveal strengths and weaknesses of particular systems and we identify grammatical phenomena where the overall performance of MT is relatively low.

2017

pdf bib
Sentence-level quality estimation by predicting HTER as a multi-component metric
Eleftherios Avramidis
Proceedings of the Second Conference on Machine Translation

2016

pdf bib
DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors
Eleftherios Avramidis | Aljoscha Burchardt | Vivien Macketanz | Ankit Srivastava
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Deeper Machine Translation and Evaluation for German
Eleftherios Avramidis | Vivien Macketanz | Aljoscha Burchardt | Jindrich Helcl | Hans Uszkoreit
Proceedings of the 2nd Deep Machine Translation Workshop

pdf bib
Tools and Guidelines for Principled Machine Translation Development
Nora Aranberri | Eleftherios Avramidis | Aljoscha Burchardt | Ondřej Klejch | Martin Popel | Maja Popović
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work addresses the need to aid Machine Translation (MT) development cycles with a complete workflow of MT evaluation methods. Our aim is to assess, compare and improve MT system variants. We hereby report on novel tools and practices that support various measures, developed in order to support a principled and informed approach of MT development. Our toolkit for automatic evaluation showcases quick and detailed comparison of MT system variants through automatic metrics and n-gram feedback, along with manual evaluation via edit-distance, error annotation and task-based feedback.

2015

pdf bib
Poor man’s lemmatisation for automatic error classification
Maja Popovic | Mihael Arcan | Eleftherios Avramidis | Aljoscha Burchardt
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
DFKI’s experimental hybrid MT system for WMT 2015
Eleftherios Avramidis | Maja Popović | Aljoscha Burchardt
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Poor man’s lemmatisation for automatic error classification
Maja Popović | Mihael Arčan | Eleftherios Avramidis | Aljoscha Burchardt | Arle Lommel
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Towards Deeper MT - A Hybrid System for German
Eleftherios Avramidis | Aljoscha Burchardt | Maja Popović | Hans Uszkoreit
Proceedings of the 1st Deep Machine Translation Workshop

2014

pdf bib
The tara corpus of human-annotated machine translations
Eleftherios Avramidis | Aljoscha Burchardt | Sabine Hunsicker | Maja Popović | Cindy Tscherwinka | David Vilar | Hans Uszkoreit
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Human translators are the key to evaluating machine translation (MT) quality and also to addressing the so far unanswered question when and how to use MT in professional translation workflows. This paper describes the corpus developed as a result of a detailed large scale human evaluation consisting of three tightly connected tasks: ranking, error classification and post-editing.

pdf bib
Efforts on Machine Learning over Human-mediated Translation Edit Rate
Eleftherios Avramidis
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Correlating decoding events with errors in Statistical Machine Translation
Eleftherios Avramidis | Maja Popović
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Using a new analytic measure for the annotation and analysis of MT errors on real data
Arle Lommel | Aljoscha Burchardt | Maja Popović | Kim Harris | Eleftherios Avramidis | Hans Uszkoreit
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

pdf bib
Relations between different types of post-editing operations, cognitive effort and temporal effort
Maja Popović | Arle Lommel | Aljoscha Burchardt | Eleftherios Avramidis | Hans Uszkoreit
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

2013

pdf bib
Selecting Feature Sets for Comparative and Time-Oriented Quality Estimation of Machine Translation Output
Eleftherios Avramidis | Maja Popović
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
A CCG-based Quality Estimation Metric for Statistical Machine Translation Learning from Human Judgments of Machine Translation Output
Maja Popovic | Eleftherios Avramidis | Aljoscha Burchardt | Sabine Hunsicker | Sven Schmeier | Cindy Tscherwinka | David Vilar
Proceedings of Machine Translation Summit XIV: Posters

pdf bib
Learning from Human Judgments of Machine Translation Output
Maja Popovic | Eleftherios Avramidis | Aljoscha Burchardt | Sabine Hunsicker | Sven Schmeier | Cindy Tscherwinka | David Vilar
Proceedings of Machine Translation Summit XIV: Posters

pdf bib
What can we learn about the selection mechanism for post-editing?
Maja Popović | Eleftherios Avramidis | Aljoscha Burchardt | David Vilar | Hans Uszkoreit
Proceedings of the 2nd Workshop on Post-editing Technology and Practice

2012

pdf bib
Quality estimation for Machine Translation output using linguistic analysis and decoding features
Eleftherios Avramidis
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Comparative Quality Estimation: Automatic Sentence-Level Ranking of Multiple Machine Translation Outputs
Eleftherios Avramidis
Proceedings of COLING 2012

pdf bib
Involving Language Professionals in the Evaluation of Machine Translation
Eleftherios Avramidis | Aljoscha Burchardt | Christian Federmann | Maja Popović | Cindy Tscherwinka | David Vilar
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Significant breakthroughs in machine translation only seem possible if human translators are taken into the loop. While automatic evaluation and scoring mechanisms such as BLEU have enabled the fast development of systems, it is not clear how systems can meet real-world (quality) requirements in industrial translation scenarios today. The taraXÜ project paves the way for wide usage of hybrid machine translation outputs through various feedback loops in system development. In a consortium of research and industry partners, the project integrates human translators into the development process for rating and post-editing of machine translation outputs thus collecting feedback for possible improvements.

pdf bib
A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
Eleftherios Avramidis | Marta R. Costa-jussà | Christian Federmann | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual “component” translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data.

pdf bib
The ML4HMT Workshop on Optimising the Division of Labour in Hybrid Machine Translation
Christian Federmann | Eleftherios Avramidis | Marta R. Costa-jussà | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the “Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation” (ML4HMT) which aims to foster research on improved system combination approaches for machine translation (MT). Participants of the challenge are requested to build hybrid translations by combining the output of several MT systems of different types. We first describe the ML4HMT corpus used in the shared task, then explain the XLIFF-based annotation format we have designed for it, and briefly summarize the participating systems. Using both automated metrics scores and extensive manual evaluation, we discuss the individual performance of the various systems. An interesting result from the shared task is the fact that we were able to observe different systems winning according to the automated metrics scores when compared to the results from the manual evaluation. We conclude by summarising the first edition of the challenge and by giving an outlook to future work.

2011

pdf bib
DFKI’s SC and MT submissions to IWSLT 2011
David Vilar | Eleftherios Avramidis | Maja Popović | Sabine Hunsicker
Proceedings of the 8th International Workshop on Spoken Language Translation: Evaluation Campaign

We describe DFKI’s submission to the System Combination and Machine Translation tracks of the 2011 IWSLT Evaluation Campaign. We focus on a sentence selection mechanism which chooses the (hopefully) best sentence among a set of candidates. The rationale behind it is to take advantage of the strengths of each system, especially given an heterogeneous dataset like the one in this evaluation campaign, composed of TED Talks of very different topics. We focus on using features that correlate well with human judgement and, while our primary system still focus on optimizing the BLEU score on the development set, our goal is to move towards optimizing directly the correlation with human judgement. This kind of system is still under development and was used as a secondary submission.

pdf bib
Evaluate with Confidence Estimation: Machine ranking of translation outputs using grammatical features
Eleftherios Avramidis | Maja Popovic | David Vilar | Aljoscha Burchardt
Proceedings of the Sixth Workshop on Statistical Machine Translation

pdf bib
Evaluation without references: IBM1 scores as evaluation metrics
Maja Popović | David Vilar | Eleftherios Avramidis | Aljoscha Burchardt
Proceedings of the Sixth Workshop on Statistical Machine Translation

2008

pdf bib
Enriching Morphologically Poor Languages for Statistical Machine Translation
Eleftherios Avramidis | Philipp Koehn
Proceedings of ACL-08: HLT

Search
Co-authors