Christian Federmann - ACL Anthology

Christian Federmann

2025

TASER: Translation Assessment via Systematic Evaluation and Reasoning
Monishwaran Maheswaran | Marco Carini | Christian Federmann | Tony Diaz
Proceedings of the Tenth Conference on Machine Translation

We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.

2024

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
Tom Kocmi | Vilém Zouhar | Christian Federmann | Matt Post
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the “dynamic range” of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask “what point difference x in metric y is required between two systems for humans to notice?”. We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

Findings of the WMT 2024 Shared Task of the Open Language Data Initiative
Laurie Burchell | Jean Maillard | Antonios Anastasopoulos | Christian Federmann | Philipp Koehn | Skyler Wang
Proceedings of the Ninth Conference on Machine Translation

We present the results of the WMT 2024 shared task of the Open Language Data Initiative. Participants were invited to contribute to the FLORES+ and MT Seed multilingual datasets, two foundational open resources that facilitate the organic expansion of language technology’s reach. We accepted ten submissions covering 16 languages, which extended the range of languages included in the datasets and improved the quality of existing data.

2023

Large Language Models Are State-of-the-Art Evaluators of Translation Quality
Tom Kocmi | Christian Federmann
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22’s Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4
Tom Kocmi | Christian Federmann
Proceedings of the Eighth Conference on Machine Translation

This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans. Compared to previous works, our method has language-agnostic prompts, thus avoiding the need for manual prompt preparation for new languages. While preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking, we advise caution when using it in academic works to demonstrate improvements over other methods due to its dependence on the proprietary, black-box GPT model.

2022

NTREX-128 – News Test References for MT Evaluation of 128 Languages
Christian Federmann | Tom Kocmi | Ying Xin
Proceedings of the First Workshop on Scaling Up Multilingual Evaluation

We present the results of the WMT’22 SharedTask on Large-Scale Machine Translation Evaluation for African Languages. The shared taskincluded both a data and a systems track, alongwith additional innovations, such as a focus onAfrican languages and extensive human evaluation of submitted systems. We received 14system submissions from 8 teams, as well as6 data track contributions. We report a largeprogress in the quality of translation for Africanlanguages since the last iteration of this sharedtask: there is an increase of about 7.5 BLEUpoints across 72 language pairs, and the average BLEU scores went from 15.09 to 22.60.

MS-COMET: More and Better Human Judgements Improve Metric Performance
Tom Kocmi | Hitokazu Matsushita | Christian Federmann
Proceedings of the Seventh Conference on Machine Translation (WMT)

We develop two new metrics that build on top of the COMET architecture. The main contribution is collecting a ten-times larger corpus of human judgements than COMET and investigating how to filter out problematic human judgements. We propose filtering human judgements where human reference is statistically worse than machine translation. Furthermore, we average scores of all equal segments evaluated multiple times. The results comparing automatic metrics on source-based DA and MQM-style human judgement show state-of-the-art performance on a system-level pair-wise system ranking. We release both of our metrics for public use.

Searching for a Higher Power in the Human Evaluation of MT
Johnny Wei | Tom Kocmi | Christian Federmann
Proceedings of the Seventh Conference on Machine Translation (WMT)

In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an “early stopping” collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27% efficiency gain when spending 3x the current budget, or 18% savings at the current evaluation power.

This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).

Findings of the IWSLT 2022 Evaluation Campaign
Antonios Anastasopoulos | Loïc Barrault | Luisa Bentivogli | Marcely Zanon Boito | Ondřej Bojar | Roldano Cattoni | Anna Currey | Georgiana Dinu | Kevin Duh | Maha Elbayad | Clara Emmanuel | Yannick Estève | Marcello Federico | Christian Federmann | Souhir Gahbiche | Hongyu Gong | Roman Grundkiewicz | Barry Haddow | Benjamin Hsu | Dávid Javorský | Vĕra Kloudová | Surafel Lakew | Xutai Ma | Prashant Mathur | Paul McNamee | Kenton Murray | Maria Nǎdejde | Satoshi Nakamura | Matteo Negri | Jan Niehues | Xing Niu | John Ortega | Juan Pino | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Marco Turchi | Yogesh Virkar | Alexander Waibel | Changhan Wang | Shinji Watanabe
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation. A total of 27 teams participated in at least one of the shared tasks. This paper details, for each shared task, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved.

2021

On User Interfaces for Large-Scale Document-Level Human Evaluation of Machine Translation Outputs
Roman Grundkiewicz | Marcin Junczys-Dowmunt | Christian Federmann | Tom Kocmi
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Recent studies emphasize the need of document context in human evaluation of machine translations, but little research has been done on the impact of user interfaces on annotator productivity and the reliability of assessments. In this work, we compare human assessment data from the last two WMT evaluation campaigns collected via two different methods for document-level evaluation. Our analysis shows that a document-centric approach to evaluation where the annotator is presented with the entire document context on a screen leads to higher quality segment and document level assessments. It improves the correlation between segment and document scores and increases inter-annotator agreement for document scores but is considerably more time consuming for annotators.

The JHU-Microsoft Submission for WMT21 Quality Estimation Shared Task
Shuoyang Ding | Marcin Junczys-Dowmunt | Matt Post | Christian Federmann | Philipp Koehn
Proceedings of the Sixth Conference on Machine Translation

This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip translation, and pseudo post-editing of the MT output. We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline. Our system is also the top-ranking system on the MT MCC metric for the English-German language pair.

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation
Tom Kocmi | Christian Federmann | Roman Grundkiewicz | Marcin Junczys-Dowmunt | Hitokazu Matsushita | Arul Menezes
Proceedings of the Sixth Conference on Machine Translation

Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system’s quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on – to the best of our knowledge – the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

2020

This paper presents the results of the news translation task and the similar language translation task, both organised alongside the Conference on Machine Translation (WMT) 2020. In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the similar language translation task, participants built machine translation systems for translating between closely related pairs of languages.

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of teams participated in at least one of the tracks. This paper introduces each track’s goal, data and evaluation metrics, and reports the results of the received submissions.

The COVID-19 pandemic is the worst pandemic to strike the world in over a century. Crucial to stemming the tide of the SARS-CoV-2 virus is communicating to vulnerable populations the means by which they can protect themselves. To this end, the collaborators forming the Translation Initiative for COvid-19 (TICO-19) have made test and development data available to AI and MT researchers in 35 different languages in order to foster the development of tools and resources for improving access to information about COVID-19 in these languages. In addition to 9 high-resourced, ”pivot” languages, the team is targeting 26 lesser resourced languages, in particular languages of Africa, South Asia and South-East Asia, whose populations may be the most vulnerable to the spread of the virus. The same data is translated into all of the languages represented, meaning that testing or development can be done for any pairing of languages in the set. Further, the team is converting the test and development data into translation memories (TMXs) that can be used by localizers from and to any of the languages.

Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Michael Denkowski | Christian Federmann
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Assessing Human-Parity in Machine Translation on the Segment Level
Yvette Graham | Christian Federmann | Maria Eskevich | Barry Haddow
Findings of the Association for Computational Linguistics: EMNLP 2020

Recent machine translation shared tasks have shown top-performing systems to tie or in some cases even outperform human translation. Such conclusions about system and human performance are, however, based on estimates aggregated from scores collected over large test sets of translations and unfortunately leave some remaining questions unanswered. For instance, simply because a system significantly outperforms the human translator on average may not necessarily mean that it has done so for every translation in the test set. Firstly, are there remaining source segments present in evaluation test sets that cause significant challenges for top-performing systems and can such challenging segments go unnoticed due to the opacity of current human evaluation procedures? To provide insight into these issues we carefully inspect the outputs of top-performing systems in the most recent WMT-19 news translation shared task for all language pairs in which a system either tied or outperformed human translation. Our analysis provides a new method of identifying the remaining segments for which either machine or human perform poorly. For example, in our close inspection of WMT-19 English to German and German to English we discover the segments that disjointly proved a challenge for human and machine. For English to Russian, there were no segments included in our sample of translations that caused a significant challenge for the human translator, while we again identify the set of segments that caused issues for the top-performing system.

2019

Multilingual Whispers: Generating Paraphrases with Translation
Christian Federmann | Oussama Elachqar | Chris Quirk
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

Naturally occurring paraphrase data, such as multiple news stories about the same event, is a useful but rare resource. This paper compares translation-based paraphrase gathering using human, automatic, or hybrid techniques to monolingual paraphrasing by experts and non-experts. We gather translations, paraphrases, and empirical human quality assessments of these approaches. Neural machine translation techniques, especially when pivoting through related languages, provide a relatively robust source of paraphrases with diversity comparable to expert human paraphrases. Surprisingly, human translators do not reliably outperform neural systems. The resulting data release will not only be a useful test set, but will also allow additional explorations in translation and paraphrase quality assessments and relationships.

Domain Adaptation of Document-Level NMT in IWSLT19
Martin Popel | Christian Federmann
Proceedings of the 16th International Conference on Spoken Language Translation

We describe our four NMT systems submitted to the IWSLT19 shared task in English→Czech text-to-text translation of TED talks. The goal of this study is to understand the interactions between document-level NMT and domain adaptation. All our systems are based on the Transformer model implemented in the Tensor2Tensor framework. Two of the systems serve as baselines, which are not adapted to the TED talks domain: SENTBASE is trained on single sen- tences, DOCBASE on multi-sentence (document-level) sequences. The other two submitted systems are adapted to TED talks: SENTFINE is fine-tuned on single sentences, DOCFINE is fine-tuned on multi-sentence sequences. We present both automatic-metrics evaluation and manual analysis of the translation quality, focusing on the differences between the four systems.

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

Findings of the WMT 2019 Shared Tasks on Quality Estimation
Erick Fonseca | Lisa Yankovskaya | André F. T. Martins | Mark Fishel | Christian Federmann
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We report the results of the WMT19 shared task on Quality Estimation, i.e. the task of predicting the quality of the output of machine translation systems given just the source text and the hypothesis translations. The task includes estimation at three granularity levels: word, sentence and document. A novel addition is evaluating sentence-level QE against human judgments: in other words, designing MT metrics that do not need a reference translation. This year we include three language pairs, produced solely by neural machine translation systems. Participating teams from eleven institutions submitted a variety of systems to different task variants and language pairs.

Findings of the WMT 2019 Shared Task on Automatic Post-Editing
Rajen Chatterjee | Christian Federmann | Matteo Negri | Marco Turchi
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We present the results from the 5th round of the WMT task on MT Automatic Post-Editing. The task consists in automatically correcting the output of a “black-box” machine translation system by learning from human corrections. Keeping the same general evaluation setting of the previous four rounds, this year we focused on two language pairs (English-German and English-Russian) and on domain-specific data (In-formation Technology). For both the language directions, MT outputs were produced by neural systems unknown to par-ticipants. Seven teams participated in the English-German task, with a total of 18 submitted runs. The evaluation, which was performed on the same test set used for the 2018 round, shows a slight progress in APE technology: 4 teams achieved better results than last year’s winning system, with improvements up to -0.78 TER and +1.23 BLEU points over the baseline. Two teams participated in theEnglish-Russian task submitting 2 runs each. On this new language direction, characterized by a higher quality of the original translations, the task proved to be particularly challenging. None of the submitted runs improved the very high results of the strong system used to produce the initial translations(16.16 TER, 76.20 BLEU).

2018

Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment
Luisa Bentivogli | Mauro Cettolo | Marcello Federico | Christian Federmann
Proceedings of the 15th International Conference on Spoken Language Translation

In this paper we present an analysis of the two most prominent methodologies used for the human evaluation of MT quality, namely evaluation based on Post-Editing (PE) and evaluation based on Direct Assessment (DA). To this purpose, we exploit a publicly available large dataset containing both types of evaluations. We first focus on PE and investigate how sensitive TER-based evaluation is to the type and number of references used. Then, we carry out a comparative analysis of PE and DA to investigate the extent to which the evaluation results obtained by methodologies addressing different human perspectives are similar. This comparison sheds light not only on PE but also on the so-called reference bias related to monolingual DA. Also, we analyze if and how the two methodologies can complement each other’s weaknesses.

Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu
Sandipan Dandapat | Christian Federmann
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on English–Telugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation.

Findings of the 2018 Conference on Machine Translation (WMT18)
Ondřej Bojar | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Philipp Koehn | Christof Monz
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2018. Participants were asked to build machine translation systems for any of 7 language pairs in both directions, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. This year, we also opened up the task to additional test sets to probe specific aspects of translation.

Appraise Evaluation Framework for Machine Translation
Christian Federmann
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

We present Appraise, an open-source framework for crowd-based annotation tasks, notably for evaluation of machine translation output. This is the software used to run the yearly evaluation campaigns for shared tasks at the WMT Conference on Machine Translation. It has also been used at IWSLT 2017 and, recently, to measure human parity for machine translation for Chinese to English news text. The demo will present the full end-to-end lifecycle of an Appraise evaluation campaign, from task creation to annotation and interpretation of results.

2017

The Microsoft Speech Language Translation (MSLT) Corpus for Chinese and Japanese: Conversational Test data for Machine Translation and Speech Recognition
Christian Federmann | William D. Lewis
Proceedings of Machine Translation Summit XVI: Research Track

Proceedings of the Second Conference on Machine Translation
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Julia Kreutzer
Proceedings of the Second Conference on Machine Translation

Overview of the IWSLT 2017 Evaluation Campaign
Mauro Cettolo | Marcello Federico | Luisa Bentivogli | Jan Niehues | Sebastian Stüker | Katsuhito Sudoh | Koichiro Yoshino | Christian Federmann
Proceedings of the 14th International Conference on Spoken Language Translation

The IWSLT 2017 evaluation campaign has organised three tasks. The Multilingual task, which is about training machine translation systems handling many-to-many language directions, including so-called zero-shot directions. The Dialogue task, which calls for the integration of context information in machine translation, in order to resolve anaphoric references that typically occur in human-human dialogue turns. And, finally, the Lecture task, which offers the challenge of automatically transcribing and translating real-life university lectures. Following the tradition of these reports, we will described all tasks in detail and present the results of all runs submitted by their participants.

2016

Microsoft Speech Language Translation (MSLT) Corpus: The IWSLT 2016 release for English, French and German
Christian Federmann | William D. Lewis
Proceedings of the 13th International Conference on Spoken Language Translation

We describe the Microsoft Speech Language Translation (MSLT) corpus, which was created in order to evaluate end-to-end conversational speech translation quality. The corpus was created from actual conversations over Skype, and we provide details on the recording setup and the different layers of associated text data. The corpus release includes Test and Dev sets with reference transcripts for speech recognition. Additionally, cleaned up transcripts and reference translations are available for evaluation of machine translation quality. The IWSLT 2016 release described here includes the source audio, raw transcripts, cleaned up transcripts, and translations to or from English for both French and German.

2015

Applying cross-entropy difference for selecting parallel training data from publicly available sources for conversational machine translation
William Lewis | Christian Federmann | Ying Xin
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

Findings of the 2015 Workshop on Statistical Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Barry Haddow | Matthias Huck | Chris Hokamp | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Matt Post | Carolina Scarton | Lucia Specia | Marco Turchi
Proceedings of the Tenth Workshop on Statistical Machine Translation

Proceedings of the Tenth Workshop on Statistical Machine Translation
Ondřej Bojar | Rajan Chatterjee | Christian Federmann | Barry Haddow | Chris Hokamp | Matthias Huck | Varvara Logacheva | Pavel Pecina
Proceedings of the Tenth Workshop on Statistical Machine Translation

2014

Findings of the 2014 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Johannes Leveling | Christof Monz | Pavel Pecina | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia | Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation

Proceedings of the Ninth Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Christof Monz | Matt Post | Lucia Specia
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

Findings of the 2013 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Chris Callison-Burch | Christian Federmann | Barry Haddow | Philipp Koehn | Christof Monz | Matt Post | Radu Soricut | Lucia Specia
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

Can Machine Learning Algorithms Improve Phrase Selection in Hybrid Machine Translation?
Christian Federmann
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

Involving Language Professionals in the Evaluation of Machine Translation
Eleftherios Avramidis | Aljoscha Burchardt | Christian Federmann | Maja Popović | Cindy Tscherwinka | David Vilar
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Significant breakthroughs in machine translation only seem possible if human translators are taken into the loop. While automatic evaluation and scoring mechanisms such as BLEU have enabled the fast development of systems, it is not clear how systems can meet real-world (quality) requirements in industrial translation scenarios today. The taraXÜ project paves the way for wide usage of hybrid machine translation outputs through various feedback loops in system development. In a consortium of research and industry partners, the project integrates human translators into the development process for rating and post-editing of machine translation outputs thus collecting feedback for possible improvements.

Using Domain-specific and Collaborative Resources for Term Translation
Mihael Arcan | Christian Federmann | Paul Buitelaar
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT
Josef van Genabith | Toni Badia | Christian Federmann | Maite Melero | Marta R. Costa-jussà | Tsuyoshi Okita
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

The ML4HMT Workshop on Optimising the Division of Labour in Hybrid Machine Translation
Christian Federmann | Eleftherios Avramidis | Marta R. Costa-jussà | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation (ML4HMT) which aims to foster research on improved system combination approaches for machine translation (MT). Participants of the challenge are requested to build hybrid translations by combining the output of several MT systems of different types. We first describe the ML4HMT corpus used in the shared task, then explain the XLIFF-based annotation format we have designed for it, and briefly summarize the participating systems. Using both automated metrics scores and extensive manual evaluation, we discuss the individual performance of the various systems. An interesting result from the shared task is the fact that we were able to observe different systems winning according to the automated metrics scores when compared to the results from the manual evaluation. We conclude by summarising the first edition of the challenge and by giving an outlook to future work.

A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
Eleftherios Avramidis | Marta R. Costa-jussà | Christian Federmann | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual component translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data.

META-SHARE v2: An Open Network of Repositories for Language Resources including Data and Tools
Christian Federmann | Ioanna Giannopoulou | Christian Girardi | Olivier Hamon | Dimitris Mavroeidis | Salvatore Minutoli | Marc Schröder
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe META-SHARE which aims at providing an open, distributed, secure, and interoperable infrastructure for the exchange of language resources, including both data and tools. The application has been designed and is developed as part of the T4ME Network of Excellence. We explain the underlying motivation for such a distributed repository for metadata storage and give a detailed overview on the META-SHARE application and its various components. This includes a discussion of the technical architecture of the system as well as a description of the component-based metadata schema format which has been developed in parallel. Development of the META-SHARE infrastructure adopts state-of-the-art technology and follows an open-source approach, allowing the general community to participate in the development process. The META-SHARE software package including full source code has been released to the public in March 2012. We look forward to present an up-to-date version of the META-SHARE software at the conference.

Results from the ML4HMT-12 Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation
Christian Federmann | Tsuyoshi Okita | Maite Melero | Marta R. Costa-Jussa | Toni Badia | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

Machine Learning for Hybrid Machine Translation
Sabine Hunsicker | Yu Chen | Christian Federmann
Proceedings of the Seventh Workshop on Statistical Machine Translation

Experiments with Term Translation
Mihael Arcan | Christian Federmann | Paul Buitelaar
Proceedings of COLING 2012

Hybrid Machine Translation Using Joint, Binarised Feature Vectors
Christian Federmann
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

We present an approach for Hybrid Machine Translation, based on a Machine-Learning framework. Our method combines output from several source systems. We first define an extensible, total order on translations and use it to estimate a ranking on the sentence level for a given set of systems. We introduce and define the notion of joint, binarised feature vectors. We train an SVM-based classifier and show how its classification results can be used to create hybrid translations. We describe a series of oracle experiments on data sets from the WMT11 translation task in order to find an upper bound regarding the achievable level of translation quality. We also present results from first experiments with an implemented version of our system. Evaluation using NIST and BLEU metrics indicates that the proposed method can outperform its individual source systems. An interesting finding is that our approach allows to leverage good translations from otherwise bad systems as the translation quality estimation is based on sentence-level phenomena rather than corpus-level metrics. We conclude by summarising our findings and by giving an outlook to future work.

System Combination Using Joint, Binarised Feature Vectors
Christian Federmann
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

2011

From Statistical Term Extraction to Hybrid Machine Translation
Petra Wolf | Ulrike Bernardi | Christian Federmann | Sabine Hunsicker
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

Stochastic Parse Tree Selection for an Existing RBMT System
Christian Federmann | Sabine Hunsicker
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

Appraise: An Open-Source Toolkit for Manual Phrase-Based Evaluation of Translations
Christian Federmann
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a focused effort to investigate the performance of phrase-based, human evaluation of machine translation output achieving a high annotator agreement. We define phrase-based evaluation and describe the implementation of Appraise, a toolkit that supports the manual evaluation of machine translation results. Phrase ranking can be done using either a fine-grained six-way scoring scheme that allows to differentiate between ""much better"" and ""slightly better"", or a reduced subset of ranking choices. Afterwards we discuss kappa values for both scoring models from several experiments conducted with human annotators. Our results show that phrase-based evaluation can be used for fast evaluation obtaining significant agreement among annotators. The granularity of ranking choices should, however, not be too fine-grained as this seems to confuse annotators and thus reduces the overall agreement. The work reported in this paper confirms previous work in the field and illustrates that the usage of human evaluation in machine translation should be reconsidered. The Appraise toolkit is available as open-source and can be downloaded from the author's website.

Further Experiments with Shallow Hybrid MT Systems
Christian Federmann | Andreas Eisele | Yu Chen | Sabine Hunsicker | Jia Xu | Hans Uszkoreit
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

Extraction, Merging, and Monitoring of Company Data from Heterogeneous Sources
Christian Federmann | Thierry Declerck
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe the implementation of an enterprise monitoring system that builds on an ontology-based information extraction (OBIE) component applied to heterogeneous data sources. The OBIE component consists of several IE modules - each extracting on a regular temporal basis a specific fraction of company data from a given data source - and a merging tool, which is used to aggregate all the extracted information about a company. The full set of information about companies, which is to be extracted and merged by the OBIE component, is given in the schema of a domain ontology, which is guiding the information extraction process. The monitoring system, in case it detects changes in the extracted and merged information on a company with respect to the actual state of the knowledge base of the underlying ontology, ensures the update of the population of the ontology. As we are using an ontology extended with temporal information, the system is able to assign time intervals to any of the object instances. Additionally, detected changes can be communicated to end-users, who can validate and possibly correct the resulting updates in the knowledge base.

2009

Translation Combination using Factored Word Substitution
Christian Federmann | Silke Theison | Andreas Eisele | Hans Uszkoreit | Yu Chen | Michael Jellinghaus | Sabine Hunsicker
Proceedings of the Fourth Workshop on Statistical Machine Translation

Combining Multi-Engine Translations with Moses
Yu Chen | Michael Jellinghaus | Andreas Eisele | Yi Zhang | Sabine Hunsicker | Silke Theison | Christian Federmann | Hans Uszkoreit
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

Using Moses to Integrate Multiple Rule-Based Machine Translation Engines into a Hybrid System
Andreas Eisele | Christian Federmann | Hervé Saint-Amand | Michael Jellinghaus | Teresa Herrmann | Yu Chen
Proceedings of the Third Workshop on Statistical Machine Translation

Hybrid machine translation architectures within and beyond the EuroMatrix project
Andreas Eisele | Christian Federmann | Hans Uszkoreit | Hervé Saint-Amand | Martin Kay | Michael Jellinghaus | Sabine Hunsicker | Teresa Herrmann | Yu Chen
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

Extracting and Querying Relations in Scientific Papers on Language Technology
Ulrich Schäfer | Hans Uszkoreit | Christian Federmann | Torsten Marek | Yajing Zhang
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We describe methods for extracting interesting factual relations from scientific texts in computational linguistics and language technology taken from the ACL Anthology. We use a hybrid NLP architecture with shallow preprocessing for increased robustness and domain-specific, ontology-based named entity recognition, followed by a deep HPSG parser running the English Resource Grammar (ERG). The extracted relations in the MRS (minimal recursion semantics) format are simplified and generalized using WordNet. The resulting quriples are stored in a database from where they can be retrieved (again using abstraction methods) by relation-based search. The query interface is embedded in a web browser-based application we call the Scientists Workbench. It supports researchers in editing and online-searching scientific papers.

2007

Multi-Engine Machine Translation with an Open-Source SMT Decoder
Yu Chen | Andreas Eisele | Christian Federmann | Eva Hasler | Michael Jellinghaus | Silke Theison
Proceedings of the Second Workshop on Statistical Machine Translation

Co-authors

Rajen Chatterjee 17

Yvette Graham 17

Matteo Negri 16

Marco Turchi 15

Antonio Jimeno Yepes 12

Marta R. Costa-jussà 11

Roman Grundkiewicz 10

Lucia Specia 10

Aurelie Neveol 9

Mariana Neves 9

Karin Verspoor 8

Sabine Hunsicker 7

André F. T. Martins 7

Makoto Morishita 7

Masaaki Nagata 7

Loic Barrault 6

Christian Buck 6

Andreas Eisele 6

Toshiaki Nakazawa 6

Eleftherios Avramidis 5

Marcello Federico 5

Markus Freitag 5

Michael Jellinghaus 5

Hans Uszkoreit 5

Marcos Zampieri 5

Antonios Anastasopoulos 4

Varvara Logacheva 4

Maja Popović 4

Josef van Genabith 4

Rachel Bawden 3

Luisa Bentivogli 3

Fethi Bougares 3

Anton Dvorkovich 3

Alexander Fraser 3

Marcin Junczys-Dowmunt 3

William Lewis 3

Kenton Murray 3

Herve Saint-Amand 3

Sebastian Stüker 3

Silke Theison 3

Magdalena Biesialska 2

Paul Buitelaar 2

Roldano Cattoni 2

Mauro Cettolo 2

Liane Guillou 2

Francisco Guzmán 2

Teresa Herrmann 2

Jean Maillard 2

Benjamin Marie 2

Hitokazu Matsushita 2

Tsuyoshi Okita 2

Raphael Rubino 2

Elizabeth Salesky 2

Carolina Scarton 2

Mariya Shmatova 2

Katsuhito Sudoh 2

Jörg Tiedemann 2

Changhan Wang 2

Vilém Zouhar 2

David Ifeoluwa Adelani 1

Farhad Akhbardeh 1

Md Mahfuz Ibn Alam 1

Kwabena Amponsah-Kaakyire 1

Ebrahim Ansari 1

Arkady Arkhangorodsky 1

Amittai Axelrod 1

Ulrike Bernardi 1

Akshita Bhagia 1

Aljoscha Burchardt 1

Laurie Burchell 1

Chris Callison-Burch 1

Alessandro Cattelan 1

Vishrav Chaudhary 1

Sandipan Dandapat 1

Thierry Declerck 1

Michael Denkowski 1

Shuoyang Ding 1

Georgiana Dinu 1

Nadir Durrani 1

Oussama Elachqar 1

Clara Emmanuel 1

Maria Eskevich 1

Cristina España-Bonet 1

Yannick Estève 1

Natalia Fedorova 1

Erick Fonseca 1

Souhir Gahbiche 1

Dmitriy Genzel 1

Ioanna Giannopoulou 1

Christian Girardi 1

Olivier Hamon 1

Leonie Harter 1

Kenneth Heafield 1

Christopher Homan 1

Shujian Huang (书剑黄) 1

Macduff Hughes 1

Dávid Javorský 1

Marzena Karpinska 1

Daniel Khashabi 1

Věra Kloudová 1

Rebecca Knowles 1

Sergey Koshelev 1

Julia Kreutzer 1

Surafel Lakew 1

Johannes Leveling 1

Nikola Ljubešić 1

Nicholas Lourie 1

Monishwaran Maheswaran 1

Shervin Malmasi 1

Torsten Marek 1

Vukosi Marivate 1

Prashant Mathur 1

Dimitris Mavroeidis 1

Jonathan Mbuya 1

Salvatore Minutoli 1

Alexandre Mourachko 1

Mathias Müller 1

Maria Nadejde 1

Satoshi Nakamura 1

Graham Neubig 1

Michal Novák 1

Safiyyah Saleem 1

Marc Schröder 1

Holger Schwenk 1

Ulrich Schäfer 1

Matthias Sperber 1

Steinþór Steingrímsson 1

Aleš Tamchyna 1

Allahsera Auguste Tapo 1

Cindy Tscherwinka 1

Yogesh Virkar 1

Valentin Vydrin 1

Shinji Watanabe 1

Guillaume Wenzek 1

Lisa Yankovskaya 1

Koichiro Yoshino 1

Marcely Zanon Boito 1

Venues