Miriam Exel

2025

pdf bib abs
Quality-Aware Decoding: Unifying Quality Estimation and Decoding
Sai Koneru | Matthias Huck | Miriam Exel | Jan Niehues
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

Quality Estimation (QE) models for Neural Machine Translation (NMT) predict the quality of the hypothesis without having access to the reference. An emerging research direction in NMT involves the use of QE models, which have demonstrated high correlations with human judgment and can enhance translations through Quality-Aware Decoding. Although several approaches have been proposed based on sampling multiple candidate translations and picking the best candidate, none have integrated these models directly into the decoding process. In this paper, we address this by proposing a novel token-level QE model capable of reliably scoring partial translations. We build a uni-directional QE model for this, as decoder models are inherently trained and efficient on partial sequences. We then present a decoding strategy that integrates the QE model for Quality-Aware decoding and demonstrate that the translation quality improves when compared to the N-best list re-ranking with state-of-the-art QE models (up to 1.39 XCOMET-XXL). Finally, we show that our approach provides significant benefits in document translation tasks, where the quality of N-best lists is typically suboptimal. Code can be found at https://github.com/SAP-samples/quality-aware-decoding-translation.

2024

pdf bib abs
How Effective is Synthetic Data and Instruction Fine-tuning for Translation with Markup using LLMs?
Raj Dabre | Haiyue Song | Miriam Exel | Bianka Buschbeck | Johannes Eschbach-Dymanus | Hideki Tanaka
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Recent works have shown that prompting large language models (LLMs) is effective for translation with markup where LLMs can simultaneously transfer markup tags while ensuring that the content, both inside and outside tag pairs is correctly translated. However, these works make a rather unrealistic assumption of the existence of high-quality parallel sentences with markup for prompting. Furthermore, the impact of instruction fine-tuning (IFT) in this setting is unknown. In this paper, we provide a study, the first of its kind, focusing on the effectiveness of synthetically created markup data and IFT for translation with markup using LLMs. We focus on translation from English to five European languages, German, French, Dutch, Finnish and Russian, where we show that regardless of few-shot prompting or IFT, synthetic data created via word alignments, while leading to inferior markup transfer compared to using original data with markups, does not negatively impact the translation quality. Furthermore, IFT mainly impacts the translation quality compared to few-shot prompting and has slightly better markup transfer capabilities than the latter. We hope our work will help practitioners make effective decisions on modeling choices for LLM based translation with markup.

pdf bib abs
Exploring the Effectiveness of LLM Domain Adaptation for Business IT Machine Translation
Johannes Eschbach-Dymanus | Frank Essenberger | Bianka Buschbeck | Miriam Exel
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

In this paper, we study the translation abilities of Large Language Models (LLMs) for business IT texts.We are strongly interested in domain adaptation of translation systems, which is essential for accurate and lexically appropriate translation of such texts.Among the open-source models evaluated in a zero- and few-shot setting, we find Llama-2 13B the most promising for domain-specific translation fine-tuning.We investigate the full range of adaptation techniques for LLMs: from prompting, over parameter-efficient fine-tuning to full fine-tuning, and compare to classic neural machine translation (MT) models trained internally at SAP.We provide guidance how to use training budget most effectively for different fine-tuning approaches.We observe that while LLMs can translate on-par with SAP’s MT models on general domain data, it is difficult to close the gap on SAP’s domain-specific data, even with extensive training and carefully curated data.

pdf bib abs
Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation
Nathaniel Berger | Stefan Riezler | Miriam Exel | Matthias Huck
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories (TM) produced by PE (source segments, machine translations, and reference translations, henceforth called PE-TM) for the needs of correct and consistent term translation in technical domains. We investigate a light-weight two-step scenario where at inference time, a human translator marks errors in the first translation step, and in a second step a few similar examples are extracted from the PE-TM to prompt an LLM. Our experiment shows that the additional effort of augmenting translations with human error markings guides the LLM to focus on a correction of the marked errors, yielding consistent improvements over automatic PE (APE) and MT from scratch.

pdf bib abs
Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing
Sai Koneru | Miriam Exel | Matthias Huck | Jan Niehues
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated considerable success in various natural language processing tasks, but open-source LLMs have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigate using LLMs for MT and explore recent parameter-efficient fine-tuning techniques. Surprisingly, our initial experiments found that fine-tuning with Q-LoRA for translation purposes led to performance improvements in terms of BLEU but degradation in COMET compared to in-context learning. To overcome this, we propose an alternative approach: adapting LLMs as Automatic Post-Editors (APE) rather than direct translators. Building on the ability of the LLM to handle long sequences, we also propose extending our approach to document-level translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can yield significant improvements across both sentence and document-level metrics while generalizing to out-of-domain data. Most notably, we achieve a state-of-the-art accuracy rate of 88.7% on the ContraPro test set, which assesses the model’s ability to resolve pronoun ambiguities when translating from English to German. Lastly, during manual post-editing for document-level translation, the source sentences are iteratively annotated, which can be used to refine further translations in the document. Here, we demonstrate that leveraging human corrections can significantly reduce the number of edits required for subsequent translations.

pdf bib abs
Post-edits Are Preferences Too
Nathaniel Berger | Miriam Exel | Matthias Huck | Stefan Riezler
Proceedings of the Ninth Conference on Machine Translation

Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreuzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings.We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit like hypotheses to the top output ranks.

pdf bib abs
Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking across Diverse Vocabularies
Sai Koneru | Matthias Huck | Miriam Exel | Jan Niehues
Proceedings of the Ninth Conference on Machine Translation

Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality.

2023

pdf bib abs
Analyzing Challenges in Neural Machine Translation for Software Localization
Sai Koneru | Matthias Huck | Miriam Exel | Jan Niehues
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Advancements in Neural Machine Translation (NMT) greatly benefit the software localization industry by decreasing the post-editing time of human annotators. Although the volume of the software being localized is growing significantly, techniques for improving NMT for user interface (UI) texts are lacking. These UI texts have different properties than other collections of texts, presenting unique challenges for NMT. For example, they are often very short, causing them to be ambiguous and needing additional context (button, title text, a table item, etc.) for disambiguation. However, no such UI data sets are readily available with contextual information for NMT models to exploit. This work aims to provide a first step in improving UI translations and highlight its challenges. To achieve this, we provide a novel multilingual UI corpus collection (∼ 1.3M for English ↔ German) with a targeted test set and analyze the limitations of state-of-the-art methods on this challenging task. Specifically, we present a targeted test set for disambiguation from English to German to evaluate reliably and emphasize UI translation challenges. Furthermore, we evaluate several state-of-the-art NMT techniques from domain adaptation and document-level NMT on this challenging task. All the scripts to replicate the experiments and data sets are available here.ˆ,

pdf bib abs
Enhancing Supervised Learning with Contrastive Markings in Neural Machine Translation Training
Nathaniel Berger | Miriam Exel | Matthias Huck | Stefan Riezler
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Supervised learning in Neural Machine Translation (NMT) standardly follows a teacher forcing paradigm where the conditioning context in the model’s prediction is constituted by reference tokens, instead of its own previous predictions. In order to alleviate this lack of exploration in the space of translations, we present a simple extension of standard maximum likelihood estimation by a contrastive marking objective. The additional training signals are extracted automatically from reference translations by comparing the system hypothesis against the reference, and used for up/down-weighting correct/incorrect tokens. The proposed new training procedure requires one additional translation pass over the training set, and does not alter the standard inference setup. We show that training with contrastive markings yields improvements on top of supervised learning, and is especially useful when learning from postedits where contrastive markings indicate human error corrections to the original hypotheses.

pdf bib abs
A Study on the Effectiveness of Large Language Models for Translation with Markup
Raj Dabre | Bianka Buschbeck | Miriam Exel | Hideki Tanaka
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

In this paper we evaluate the utility of large language models (LLMs) for translation of text with markup in which the most important and challenging aspect is to correctly transfer markup tags while ensuring that the content, both, inside and outside tags is correctly translated. While LLMs have been shown to be effective for plain text translation, their effectiveness for structured document translation is not well understood. To this end, we experiment with BLOOM and BLOOMZ, which are open-source multilingual LLMs, using zero, one and few-shot prompting, and compare with a domain-specific in-house NMT system using a detag-and-project approach for markup tags. We observe that LLMs with in-context learning exhibit poorer translation quality compared to the domain-specific NMT system, however, they are effective in transferring markup tags, especially the large BLOOM model (176 billion parameters). This is further confirmed by our human evaluation which also reveals the types of errors of the different tag transfer techniques. While LLM-based approaches come with the risk of losing, hallucinating and corrupting tags, they excel at placing them correctly in the translation.

2022

pdf bib abs
“Hi, how can I help you?” Improving Machine Translation of Conversational Content in a Business Context
Bianka Buschbeck | Jennifer Mell | Miriam Exel | Matthias Huck
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

This paper addresses the automatic translation of conversational content in a business context, for example support chat dialogues. While such use cases share characteristics with other informal machine translation scenarios, translation requirements with respect to technical and business-related expressions are high. To succeed in such scenarios, we experimented with curating dedicated training and test data, injecting noise to improve robustness, and applying sentence weighting schemes to carefully manage the influence of the different corpora. We show that our approach improves the performance of our models on conversational content for all 18 investigated language pairs while preserving translation quality on other domains - an indispensable requirement to integrate these developments into our MT engines at SAP.

Translation of structured content is an important application of machine translation, but the scarcity of evaluation data sets, especially for Asian languages, limits progress. In this paper we present a novel multilingual multiway evaluation data set for the translation of structured documents of the Asian languages Japanese, Korean and Chinese. We describe the data set, its creation process and important characteristics, followed by establishing and evaluating baselines using the direct translation as well as detag-project approaches. Our data set is well suited for multilingual evaluation, and it contains richer annotation tag sets than existing data sets. Our results show that massively multilingual translation models like M2M-100 and mBART-50 perform surprisingly well despite not being explicitly trained to handle structured content. The data set described in this paper and used in our experiments is released publicly.

2020

pdf bib abs
Incorporating External Annotation to improve Named Entity Translation in NMT
Maciej Modrzejewski | Miriam Exel | Bianka Buschbeck | Thanh-Le Ha | Alexander Waibel
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The correct translation of named entities (NEs) still poses a challenge for conventional neural machine translation (NMT) systems. This study explores methods incorporating named entity recognition (NER) into NMT with the aim to improve named entity translation. It proposes an annotation method that integrates named entities and inside–outside–beginning (IOB) tagging into the neural network input with the use of source factors. Our experiments on English→German and English→ Chinese show that just by including different NE classes and IOB tagging, we can increase the BLEU score by around 1 point using the standard test set from WMT2019 and achieve up to 12% increase in NE translation rates over a strong baseline.

pdf bib abs
Terminology-Constrained Neural Machine Translation at SAP
Miriam Exel | Bianka Buschbeck | Lauritz Brandt | Simona Doneva
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper examines approaches to bias a neural machine translation model to adhere to terminology constraints in an industrial setup. In particular, we investigate variations of the approach by Dinu et al. (2019), which uses inline annotation of the target terms in the source segment plus source factor embeddings during training and inference, and compare them to constrained decoding. We describe the challenges with respect to terminology in our usage scenario at SAP and show how far the investigated methods can help to overcome them. We extend the original study to a new language pair and provide an in-depth evaluation including an error classification and a human evaluation.

pdf bib abs
A Parallel Evaluation Data Set of Software Documentation with Document Structure Annotation
Bianka Buschbeck | Miriam Exel
Proceedings of the 7th Workshop on Asian Translation

This paper accompanies the software documentation data set for machine translation, a parallel evaluation data set of data originating from the SAP Help Portal, that we released to the machine translation community for research purposes. It offers the possibility to tune and evaluate machine translation systems in the domain of corporate software documentation and contributes to the availability of a wider range of evaluation scenarios. The data set comprises of the language pairs English to Hindi, Indonesian, Malay and Thai, and thus also increases the test coverage for the many low-resource language pairs. Unlike most evaluation data sets that consist of plain parallel text, the segments in this data set come with additional metadata that describes structural information of the document context. We provide insights into the origin and creation, the particularities and characteristics of the data set as well as machine translation results.