Aleš Tamchyna
2025
CUNI and Phrase at WMT25 MT Evaluation Task
Miroslav Hrabal | Ondrej Glembek | Aleš Tamchyna | Almut Silja Hildebrand | Alan Eckhard | Miroslav Štola | Sergio Penkale | Zuzana Šimečková | Ondřej Bojar | Alon Lavie | Craig Stewart
Proceedings of the Tenth Conference on Machine Translation
Miroslav Hrabal | Ondrej Glembek | Aleš Tamchyna | Almut Silja Hildebrand | Alan Eckhard | Miroslav Štola | Sergio Penkale | Zuzana Šimečková | Ondřej Bojar | Alon Lavie | Craig Stewart
Proceedings of the Tenth Conference on Machine Translation
This paper describes the joint effort of Phrase a.s. and Charles University’sInstitute of Formal and Applied Linguistics (CUNI/UFAL) on the WMT25Automated Translation Quality Evaluation Systems Shared Task. Both teamsparticipated both in a collaborative and competitive manner, i.e. they eachsubmitted a system of their own as well as a contrastive joint system ensemble.In Task~1, we show that such an ensembling—if chosen in a clever way—canlead to a performance boost. We present the analysis of various kinds ofsystems comprising both “traditional” NN-based approach, as well as differentflavours of LLMs—off-the-shelf commercial models, their fine-tuned versions,but also in-house, custom-trained alternative models. In Tasks~2 and~3 we showPhrase’s approach to tackling the tasks via various GPT models: Error SpanAnnotation via the complete MQM solution using non-reasoning models (includingfine-tuned versions) in Task~2, and using reasoning models in Task~3.
2023
Bad MT Systems are Good for Quality Estimation
Iryna Tryhubyshyn | Aleš Tamchyna | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Iryna Tryhubyshyn | Aleš Tamchyna | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track
Quality estimation (QE) is the task of predicting quality of outputs produced by machine translation (MT) systems. Currently, the highest-performing QE systems are supervised and require training on data with golden quality scores. In this paper, we investigate the impact of the quality of the underlying MT outputs on the performance of QE systems. We find that QE models trained on datasets with lower-quality translations often outperform those trained on higher-quality data. We also demonstrate that good performance can be achieved by using a mix of data from different MT systems.
2021
Neural Machine Translation Quality and Post-Editing Performance
Vilém Zouhar | Martin Popel | Ondřej Bojar | Aleš Tamchyna
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Vilém Zouhar | Martin Popel | Ondřej Bojar | Aleš Tamchyna
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We test the natural expectation that using MT in professional translation saves human processing time. The last such study was carried out by Sanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing the translation quality. In contrast, we focus on neural MT (NMT) of high quality, which has become the state-of-the-art approach since then and also got adopted by most translation companies. Through an experimental study involving over 30 professional translators for English -> Czech translation, we examine the relationship between NMT performance and post-editing time and quality. Across all models, we found that better MT systems indeed lead to fewer changes in the sentences in this industry setting. The relation between system quality and post-editing time is however not straightforward and, contrary to the results on phrase-based MT, BLEU is definitely not a stable predictor of the time or final output quality.
Deploying MT Quality Estimation on a large scale: Lessons learned and open questions
Aleš Tamchyna
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
Aleš Tamchyna
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
This talk will focus on Memsource’s experience implementing MT Quality Estimation on a large scale within a translation management system. We will cover the whole development journey: from our early experimentation and the challenges we faced adapting academic models for a real world setting, all the way through to the practical implementation. Since the launch of this feature, we’ve accumulated a significant amount of experience and feedback, which has informed our subsequent development. Lastly we will discuss several open questions regarding the future role of quality estimation in translation.
2020
Selection of MT Systems in Translation Workflows
Aleš Tamchyna
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)
Aleš Tamchyna
Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)
2017
Producing Unseen Morphological Variants in Statistical Machine Translation
Matthias Huck | Aleš Tamchyna | Ondřej Bojar | Alexander Fraser
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Matthias Huck | Aleš Tamchyna | Ondřej Bojar | Alexander Fraser
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Translating into morphologically rich languages is difficult. Although the coverage of lemmas may be reasonable, many morphological variants cannot be learned from the training data. We present a statistical translation system that is able to produce these inflected word forms. Different from most previous work, we do not separate morphological prediction from lexical choice into two consecutive steps. Our approach is novel in that it is integrated in decoding and takes advantage of context information from both the source language and the target language sides.
Modeling Target-Side Inflection in Neural Machine Translation
Aleš Tamchyna | Marion Weller-Di Marco | Alexander Fraser
Proceedings of the Second Conference on Machine Translation
Aleš Tamchyna | Marion Weller-Di Marco | Alexander Fraser
Proceedings of the Second Conference on Machine Translation
2016
Manual and Automatic Paraphrases for MT Evaluation
Aleš Tamchyna | Petra Barančíková
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Aleš Tamchyna | Petra Barančíková
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Paraphrasing of reference translations has been shown to improve the correlation with human judgements in automatic evaluation of machine translation (MT) outputs. In this work, we present a new dataset for evaluating English-Czech translation based on automatic paraphrases. We compare this dataset with an existing set of manually created paraphrases and find that even automatic paraphrases can improve MT evaluation. We have also propose and evaluate several criteria for selecting suitable reference translations from a larger set.
Target-Side Context for Discriminative Models in Statistical Machine Translation
Aleš Tamchyna | Alexander Fraser | Ondřej Bojar | Marcin Junczys-Dowmunt
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aleš Tamchyna | Alexander Fraser | Ondřej Bojar | Marcin Junczys-Dowmunt
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
UFAL at SemEval-2016 Task 5: Recurrent Neural Networks for Sentence Classification
Aleš Tamchyna | Kateřina Veselovská
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Aleš Tamchyna | Kateřina Veselovská
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
A Framework for Discriminative Rule Selection in Hierarchical Moses
Fabienne Braune | Alexander Fraser | Hal Daumé III | Aleš Tamchyna
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
Fabienne Braune | Alexander Fraser | Hal Daumé III | Aleš Tamchyna
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers
The QT21/HimL Combined Machine Translation System
Jan-Thorsten Peter | Tamer Alkhouli | Hermann Ney | Matthias Huck | Fabienne Braune | Alexander Fraser | Aleš Tamchyna | Ondřej Bojar | Barry Haddow | Rico Sennrich | Frédéric Blain | Lucia Specia | Jan Niehues | Alex Waibel | Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon | Mārcis Pinnis | Stella Frank
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Jan-Thorsten Peter | Tamer Alkhouli | Hermann Ney | Matthias Huck | Fabienne Braune | Alexander Fraser | Aleš Tamchyna | Ondřej Bojar | Barry Haddow | Rico Sennrich | Frédéric Blain | Lucia Specia | Jan Niehues | Alex Waibel | Alexandre Allauzen | Lauriane Aufrant | Franck Burlot | Elena Knyazeva | Thomas Lavergne | François Yvon | Mārcis Pinnis | Stella Frank
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten
Aleš Tamchyna | Roman Sudarikov | Ondřej Bojar | Alexander Fraser
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
Aleš Tamchyna | Roman Sudarikov | Ondřej Bojar | Alexander Fraser
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers
2015
CUNI in WMT15: Chimera Strikes Again
Ondřej Bojar | Aleš Tamchyna
Proceedings of the Tenth Workshop on Statistical Machine Translation
Ondřej Bojar | Aleš Tamchyna
Proceedings of the Tenth Workshop on Statistical Machine Translation
A Discriminative Model for Semantics-to-String Translation
Aleš Tamchyna | Chris Quirk | Michel Galley
Proceedings of the 1st Workshop on Semantics-Driven Statistical Machine Translation (S2MT 2015)
Aleš Tamchyna | Chris Quirk | Michel Galley
Proceedings of the 1st Workshop on Semantics-Driven Statistical Machine Translation (S2MT 2015)
What a Transfer-Based System Brings to the Combination with PBMT
Aleš Tamchyna | Ondřej Bojar
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)
Aleš Tamchyna | Ondřej Bojar
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)
2014
HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation
Ondřej Bojar | Vojtěch Diatka | Pavel Rychlý | Pavel Straňák | Vít Suchomel | Aleš Tamchyna | Daniel Zeman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Ondřej Bojar | Vojtěch Diatka | Pavel Rychlý | Pavel Straňák | Vít Suchomel | Aleš Tamchyna | Daniel Zeman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.
Improving Evaluation of English-Czech MT through Paraphrasing
Petra Barančíková | Rudolf Rosa | Aleš Tamchyna
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Petra Barančíková | Rudolf Rosa | Aleš Tamchyna
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we present a method of improving the accuracy of machine translation evaluation of Czech sentences. Given a reference sentence, our algorithm transforms it by targeted paraphrasing into a new synthetic reference sentence that is closer in wording to the machine translation output, but at the same time preserves the meaning of the original reference sentence. Grammatical correctness of the new reference sentence is provided by applying Depfix on newly created paraphrases. Depfix is a system for post-editing English-to-Czech machine translation outputs. We adjusted it to fix the errors in paraphrased sentences. Due to a noisy source of our paraphrases, we experiment with adding word alignment. However, the alignment reduces the number of paraphrases found and the best results were achieved by a simple greedy method with only one-word paraphrases thanks to their intensive filtering. BLEU scores computed using these new reference sentences show significantly higher correlation with human judgment than scores computed on the original reference sentences.
ÚFAL: Using Hand-crafted Rules in Aspect Based Sentiment Analysis on Parsed Data
Kateřina Veselovská | Aleš Tamchyna
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
Kateřina Veselovská | Aleš Tamchyna
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)
Findings of the 2014 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Johannes Leveling | Christof Monz | Pavel Pecina | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia | Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Johannes Leveling | Christof Monz | Pavel Pecina | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia | Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation
CUNI in WMT14: Chimera Still Awaits Bellerophon
Aleš Tamchyna | Martin Popel | Rudolf Rosa | Ondřej Bojar
Proceedings of the Ninth Workshop on Statistical Machine Translation
Aleš Tamchyna | Martin Popel | Rudolf Rosa | Ondřej Bojar
Proceedings of the Ninth Workshop on Statistical Machine Translation
Machine Translation of Medical Texts in the Khresmoi Project
Ondřej Dušek | Jan Hajič | Jaroslava Hlaváčová | Michal Novák | Pavel Pecina | Rudolf Rosa | Aleš Tamchyna | Zdeňka Urešová | Daniel Zeman
Proceedings of the Ninth Workshop on Statistical Machine Translation
Ondřej Dušek | Jan Hajič | Jaroslava Hlaváčová | Michal Novák | Pavel Pecina | Rudolf Rosa | Aleš Tamchyna | Zdeňka Urešová | Daniel Zeman
Proceedings of the Ninth Workshop on Statistical Machine Translation
2013
Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis
Rudolf Rosa | David Mareček | Aleš Tamchyna
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop
Rudolf Rosa | David Mareček | Aleš Tamchyna
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop
Chimera – Three Heads for English-to-Czech Translation
Ondřej Bojar | Rudolf Rosa | Aleš Tamchyna
Proceedings of the Eighth Workshop on Statistical Machine Translation
Ondřej Bojar | Rudolf Rosa | Aleš Tamchyna
Proceedings of the Eighth Workshop on Statistical Machine Translation
2012
The Joy of Parallelism with CzEng 1.0
Ondřej Bojar | Zdeněk Žabokrtský | Ondřej Dušek | Petra Galuščáková | Martin Majliš | David Mareček | Jiří Maršík | Michal Novák | Martin Popel | Aleš Tamchyna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Ondřej Bojar | Zdeněk Žabokrtský | Ondřej Dušek | Petra Galuščáková | Martin Majliš | David Mareček | Jiří Maršík | Michal Novák | Martin Popel | Aleš Tamchyna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.
Selecting Data for English-to-Czech Machine Translation
Aleš Tamchyna | Petra Galuščáková | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the Seventh Workshop on Statistical Machine Translation
Aleš Tamchyna | Petra Galuščáková | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the Seventh Workshop on Statistical Machine Translation
2011
Search
Fix author
Co-authors
- Ondřej Bojar 16
- Alexander Fraser 6
- Rudolf Rosa 5
- Martin Popel 3
- Petra Barancikova 2
- Fabienne Braune 2
- Ondřej Dušek 2
- Petra Galuščáková 2
- Barry Haddow 2
- Matthias Huck 2
- David Mareček 2
- Michal Novák 2
- Pavel Pecina 2
- Lucia Specia 2
- Kateřina Veselovská 2
- Daniel Zeman 2
- Tamer Alkhouli 1
- Alexandre Allauzen 1
- Lauriane Aufrant 1
- Frédéric Blain 1
- Christian Buck 1
- Franck Burlot 1
- Hal Daumé III 1
- Vojtěch Diatka 1
- Alan Eckhard 1
- Christian Federmann 1
- Stella Frank 1
- Michel Galley 1
- Ondrej Glembek 1
- Jan Hajic 1
- Almut Silja Hildebrand 1
- Jaroslava Hlaváčová 1
- Miroslav Hrabal 1
- Marcin Junczys-Dowmunt 1
- Amir Kamran 1
- Elena Knyazeva 1
- Philipp Koehn 1
- Thomas Lavergne 1
- Alon Lavie 1
- Johannes Leveling 1
- Martin Majliš 1
- Jiří Maršík 1
- Christof Monz 1
- Hermann Ney 1
- Jan Niehues 1
- Sergio Penkale 1
- Jan-Thorsten Peter 1
- Mārcis Pinnis 1
- Matt Post 1
- Chris Quirk 1
- Pavel Rychlý 1
- Herve Saint-Amand 1
- Rico Sennrich 1
- Radu Soricut 1
- Miloš Stanojević 1
- Craig Stewart 1
- Pavel Straňák 1
- Vit Suchomel 1
- Roman Sudarikov 1
- Iryna Tryhubyshyn 1
- Zdenka Uresova 1
- Alex Waibel 1
- Marion Weller-Di Marco 1
- François Yvon 1
- Vilém Zouhar 1
- Zuzana Šimečková 1
- Miroslav Štola 1
- Zdeněk Žabokrtský 1