Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Rachel Bawden, Víctor M Sánchez-Cartagena, Patrick Cadwell, Ekaterina Lapshinova-Koltunski, Vera Cabarrão, Konstantinos Chatzitheodorou, Mary Nurminen, Diptesh Kanojia, Helena Moniz (Editors)


Anthology ID:
2024.eamt-1
Month:
June
Year:
2024
Address:
Sheffield, UK
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
URL:
https://aclanthology.org/2024.eamt-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2024.eamt-1.pdf

pdf bib
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
Carolina Scarton | Charlotte Prescott | Chris Bayliss | Chris Oakley | Joanna Wright | Stuart Wrigley | Xingyi Song | Edward Gow-Smith | Rachel Bawden | Víctor M Sánchez-Cartagena | Patrick Cadwell | Ekaterina Lapshinova-Koltunski | Vera Cabarrão | Konstantinos Chatzitheodorou | Mary Nurminen | Diptesh Kanojia | Helena Moniz

pdf bib
Thesis Award
Page Break

Thesis Award

pdf bib
Direct Speech Translation Toward High-Quality, Inclusive, and Augmented Systems
Marco Gaido

When this PhD started, the translation of speech into text in a different language was mainly tackled with a cascade of automatic speech recognition (ASR) and machine translation (MT) models, as the emerging direct speech translation (ST) models were not yet competitive. To close this gap, part of the PhD has been devoted to improving the quality of direct models, both in the simplified condition of test sets where the audio is split into well-formed sentences, and in the realistic condition in which the audio is automatically segmented. First, we investigated how to transfer knowledge from MT models trained on large corpora. Then, we defined encoder architectures that give different weights to the vectors in the input sequence, reflecting the variability of the amount of information over time in speech. Finally, we reduced the adverse effects caused by the suboptimal automatic audio segmentation in two ways: on one side, we created models robust to this condition; on the other, we enhanced the audio segmentation itself. The good results achieved in terms of overall translation quality allowed us to investigate specific behaviors of direct ST systems, which are crucial to satisfy real users’ needs. On one side, driven by the ethical goal of inclusive systems, we disclosed that established technical choices geared toward high general performance (statistical word segmentation of the target text, knowledge distillation from MT) cause an exacerbation of the gender representational disparities in the training data. Along this line of work, we proposed mitigation techniques that reduce the gender bias of ST models, and showed how gender-specific systems can be used to control the translation of gendered words related to the speakers, regardless of their vocal traits. On the other side, motivated by the practical needs of interpreters and translators, we evaluated the potential of direct ST systems in the “augmented translation” scenario, focusing on the translation and recognition of named entities (NEs). Along this line of work, we proposed solutions to cope with the major weakness of ST models (handling person names), and introduced direct models that jointly perform ST and NE recognition showing their superiority over a pipeline of dedicated tools for the two tasks. Overall, we believe that this thesis moves a step forward toward adopting direct ST systems in real applications, increasing the awareness of their strengths and weaknesses compared to the traditional cascade paradigm.

pdf bib
Streaming Neural Speech Translation
Javier Iranzo-Sánchez

EAMT 2023 Thesis Award submission for Javier Iranzo-Sánchez.

pdf bib
Thesis: Model-based Evaluation of Multilinguality
Jannis Vamvas

The aim of this thesis was to extend the methodological toolbox for evaluating the ability of natural language processing systems to handle multiple languages. Neural machine translation (NMT) took the central role in this endeavour: NMT is inherently cross-lingual, and multilingual NMT systems, which translate from many source languages into many target languages, embody the concept of multilinguality in a very tangible way. In addition, NMT and specifically the perplexity of NMT systems can themselves be used as a tool for evaluating multilinguality.

pdf bib
Research: Technical
Page Break

Research: Technical

pdf bib
Promoting Target Data in Context-aware Neural Machine Translation
Harritxu Gete | Thierry Etchegoyhen

Standard context-aware neural machine translation (NMT) typically relies on parallel document-level data, exploiting both source and target contexts. Concatenation-based approaches in particular, still a strong baseline for document-level NMT, prepend source and/or target context sentences to the sentences to be translated, with model variants that exploit equal amounts of source and target data on each side achieving state-of-the-art results. In this work, we investigate whether target data should be further promoted within standard concatenation-based approaches, as most document-level phenomena rely on information that is present on the target language side. We evaluate novel concatenation-based variants where the target context is prepended to the source language, either in isolation or in combination with the source context. Experimental results in English-Russian and Basque-Spanish show that including target context in the source leads to large improvements on target language phenomena. On source-dependent phenomena, using only target language context in the source achieves parity with state-of-the-art concatenation approaches, or slightly underperforms, whereas combining source and target context on the source side leads to significant gains across the board.

pdf bib
A Human Perspective on GPT-4 Translations: Analysing Faroese to English News and Blog Text Translations
Annika Simonsen | Hafsteinn Einarsson

This study investigates the potential of Generative Pre-trained Transformer models, specifically GPT-4, to generate machine translation resources for the low-resource language, Faroese. Given the scarcity of high-quality, human-translated data for such languages, Large Language Models’ capabilities to produce native-sounding text offer a practical solution. This approach is particularly valuable for generating paired translation examples where one is in natural, authentic Faroese as opposed to traditional approaches that went from English to Faroese, addressing a common limitation in such approaches. By creating such a synthetic parallel dataset and evaluating it through the Multidimensional Quality Metrics framework, this research assesses the translation quality offered by GPT-4. The findings reveal GPT-4’s strengths in general translation tasks, while also highlighting its limitations in capturing cultural nuances.

pdf bib
ReSeTOX: Re-learning attention weights for toxicity mitigation in machine translation
Javier García Gilabert | Carlos Escolano | Marta Costa-jussà

Our proposed method, RESETOX (REdoSEarch if TOXic), addresses the issue ofNeural Machine Translation (NMT) gener-ating translation outputs that contain toxicwords not present in the input. The ob-jective is to mitigate the introduction oftoxic language without the need for re-training. In the case of identified addedtoxicity during the inference process, RE-SETOX dynamically adjusts the key-valueself-attention weights and re-evaluates thebeam search hypotheses. Experimental re-sults demonstrate that RESETOX achievesa remarkable 57% reduction in added tox-icity while maintaining an average trans-lation quality of 99.5% across 164 lan-guages. Our code is available at: https://github.com

pdf bib
Using Machine Translation to Augment Multilingual Classification
Adam King

An all-too-present bottleneck for text classification model development is the need to annotate training data and this need is multiplied for multilingual classifiers. Fortunately, contemporary machine translation models are both easily accessible and have dependable translation quality, making it possible to translate labeled training data from one language into another. Here, we explore the effects of using machine translation to fine-tune a multilingual model for a classification task across multiple languages. We also investigate the benefits of using a novel technique, originally proposed in the field of image captioning, to account for potential negative effects of tuning models on translated data. We show that translated data are of sufficient quality to tune multilingual classifiers and that this novel loss technique is able to offer some improvement over models tuned without it.

pdf bib
Recovery Should Never Deviate from Ground Truth: Mitigating Exposure Bias in Neural Machine Translation
Jianfei He | Shichao Sun | Xiaohua Jia | Wenjie Li

In Neural Machine Translation, models are often trained with teacher forcing and suffer from exposure bias due to the discrepancy between training and inference. Current token-level solutions, such as scheduled sampling, aim to maximize the model’s capability to recover from errors. Their loss functions have a side effect: a sequence with errors may have a larger probability than the ground truth. The consequence is that the generated sequences may recover too much and deviate from the ground truth. This side effect is verified in our experiments. To address this issue, we propose using token-level contrastive learning to coordinate three training objectives: the usual MLE objective, an objective for recovery from errors, and a new objective to explicitly constrain the recovery in a scope that does not impact the ground truth. Our empirical analysis shows that this method effectively achieves these objectives in training and reduces the frequency with which the third objective is violated. We conduct experiments on three language pairs: German-English, Russian-English, and English-Russian. Results show that our method outperforms the vanilla Transformer and other methods addressing the exposure bias.

pdf bib
Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation
Kamil Guttmann | Mikołaj Pokrywka | Adrian Charkiewicz | Artur Nowakowski

This paper explores Minimum Bayes Risk (MBR) decoding for self-improvement in machine translation (MT), particularly for domain adaptation and low-resource languages. We implement the self-improvement process by fine-tuning the model on its MBR-decoded forward translations. By employing COMET as the MBR utility metric, we aim to achieve the reranking of translations that better aligns with human preferences. The paper explores the iterative application of this approach and the potential need for language-specific MBR utility metrics. The results demonstrate significant enhancements in translation quality for all examined language pairs, including successful application to domain-adapted models and generalisation to low-resource settings. This highlights the potential of COMET-guided MBR for efficient MT self-improvement in various scenarios.

pdf bib
Mitra: Improving Terminologically Constrained Translation Quality with Backtranslations and Flag Diacritics
Iikka Hauhio | Théo Friberg

Terminologically constrained machine translation is a hot topic in the field of neural machine translation. One major way to categorize constrained translation methods is to divide them into “hard” constraints that are forced into the target language sentence using a special decoding algorithm, and “soft” constraints that are included in the input given to the model.We present a constrained translation pipeline that combines soft and hard constraints while being completely model-agnostic, i.e. our method can be used with any NMT or LLM model. In the “soft” part, we substitute the source language terms in the input sentence for the backtranslations of their target language equivalents. This causes the source sentence to be more similar to the intended translation, thus making it easier to translate for the model. In the “hard” part, we use a novel nondeterministic finite state transducer-based (NDFST) constraint recognition algorithm utilizing flag diacritics to force the model to use the desired target language terms.We test our model with both Finnish–English and English–Finnish real-world vocabularies. We find that our methods consistently improve the translation quality when compared to previous constrained decoding algorithms, while the improvement over unconstrained translations depends on the familiarity of the model over the subject vocabulary and the quality of the vocabulary.

pdf bib
Bootstrapping Pre-trained Word Embedding Models for Sign Language Gloss Translation
Euan McGill | Luis Chiruzzo | Horacio Saggion

This paper explores a novel method to modify existing pre-trained word embedding models of spoken languages for Sign Language glosses. These newly-generated embeddings are described, visualised, and then used in the encoder and/or decoder of models for the Text2Gloss and Gloss2Text task of machine translation. In two translation settings (one including data augmentation-based pre-training and a baseline), we find that bootstrapped word embeddings for glosses improve translation across four Signed/spoken language pairs. Many improvements are statistically significant, including those where the bootstrapped gloss embedding models are used.Languages included: American Sign Language, Finnish Sign Language, Spanish Sign Language, Sign Language of The Netherlands.

pdf bib
Quality Estimation with k-nearest Neighbors and Automatic Evaluation for Model-specific Quality Estimation
Tu Dinh | Tobias Palzer | Jan Niehues

Providing quality scores along with Machine Translation (MT) output, so-called reference-free Quality Estimation (QE), is crucial to inform users about the reliability of the translation. We propose a model-specific, unsupervised QE approach, termed kNN-QE, that extracts information from the MT model’s training data using k-nearest neighbors. Measuring the performance of model-specific QE is not straightforward, since they provide quality scores on their own MT output, thus cannot be evaluated using benchmark QE test sets containing human quality scores on premade MT output. Therefore, we propose an automatic evaluation method that uses quality scores from reference-based metrics as gold standard instead of human-generated ones. We are the first to conduct detailed analyses and conclude that this automatic method is sufficient, and the reference-based MetricX-23 is best for the task.

pdf bib
SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation
Haiyue Song | Francois Meyer | Raj Dabre | Hideki Tanaka | Chenhui Chu | Sadao Kurohashi

Subword regularized models leverage multiple subword tokenizations of one target sentence during training. However, selecting one tokenization during inference leads to the underutilization of knowledge learned about multiple tokenizations.We propose the SubMerge algorithm to rescue the ignored Subword tokenizations through merging equivalent ones during inference.SubMerge is a nested search algorithm where the outer beam search treats the word as the minimal unit, and the inner beam search provides a list of word candidates and their probabilities, merging equivalent subword tokenizations. SubMerge estimates the probability of the next word more precisely, providing better guidance during inference.Experimental results on six low-resource to high-resource machine translation datasets show that SubMerge utilizes a greater proportion of a model’s probability weight during decoding (lower word perplexities for hypotheses). It also improves BLEU and chrF++ scores for many translation directions, most reliably for low-resource scenarios. We investigate the effect of different beam sizes, training set sizes, dropout rates, and whether it is effective on non-regularized models.

pdf bib
FAME-MT Dataset: Formality Awareness Made Easy for Machine Translation Purposes
Dawid Wisniewski | Zofia Rostek | Artur Nowakowski

People use language for various purposes. Apart from sharing information, individuals may use it to express emotions or to show respect for another person. In this paper, we focus on the formality level of machine-generated translations and present FAME-MT – a dataset consisting of 11.2 million translations between 15 European source languages and 8 European target languages classified to formal and informal classes according to target sentence formality. This dataset can be used to fine-tune machine translation models to ensure a given formality level for 8 European target languages considered. We describe the dataset creation procedure, the analysis of the dataset’s quality showing that FAME-MT is a reliable source of language register information, and we construct a publicly available proof-of-concept machine translation model that uses the dataset to steer the formality level of the translation. Currently, it is the largest dataset of formality annotations, with examples expressed in 112 European language pairs. The dataset is made available online.

pdf bib
Iterative Translation Refinement with Large Language Models
Pinzhen Chen | Zhicheng Guo | Barry Haddow | Kenneth Heafield

We propose iteratively prompting a large language model to self-correct a translation, with inspiration from their strong language capability as well as a human-like translation approach. Interestingly, multi-turn querying reduces the output’s string-based metric scores, but neural metrics suggest comparable or improved quality after two or more iterations. Human evaluations indicate better fluency and naturalness compared to initial translations and even human references, all while maintaining quality. Ablation studies underscore the importance of anchoring the refinement to the source and a reasonable seed translation for quality considerations. We also discuss the challenges in evaluation and relation to human performance and translationese.

pdf bib
Detector–Corrector: Edit-Based Automatic Post Editing for Human Post Editing
Hiroyuki Deguchi | Masaaki Nagata | Taro Watanabe

Post-editing is crucial in the real world because neural machine translation (NMT) sometimes makes errors.Automatic post-editing (APE) attempts to correct the outputs of an MT model for better translation quality.However, many APE models are based on sequence generation, and thus their decisions are harder to interpret for actual users.In this paper, we propose “detector–corrector”, an edit-based post-editing model, which breaks the editing process into two steps, error detection and error correction.The detector model tags each MT output token whether it should be corrected and/or reordered while the corrector model generates corrected words for the spans identified as errors by the detector.Experiments on the WMT’20 English–German and English–Chinese APE tasks showed that our detector–corrector improved the translation edit rate (TER) compared to the previous edit-based model and a black-box sequence-to-sequence APE model, in addition, our model is more explainable because it is based on edit operations.

pdf bib
Assessing Translation Capabilities of Large Language Models involving English and Indian Languages
Vandan Mujadia | Ashok Urlana | Yash Bhaskar | Penumalla Aditya Pavani | Kukkapalli Shravya | Parameswari Krishnamurthy | Dipti Sharma

Generative Large Language Models (LLMs) have achieved remarkable advances in various NLP tasks. In this work, our aim is to explore the multilingual capabilities of large language models by using machine translation as a task involving English and 22 Indian languages. We first investigate the translation capabilities of raw large-language models, followed by exploring the in-context learning capabilities of the same raw models. We fine-tune these large language models using parameter-efficient fine-tuning methods such as LoRA and additionally with full fine-tuning. Through our study, we have identified the model that performs best among the large language models available for the translation task.Our results demonstrate significant progress, with average BLEU scores of 13.42, 15.93, 12.13, 12.30, and 12.07, as well as chrF scores of 43.98, 46.99, 42.55, 42.42, and 45.39, respectively, using two-stage fine-tuned LLaMA-13b for English to Indian languages on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest, and newstest2019 testsets. Similarly, for Indian languages to English, we achieved average BLEU scores of 14.03, 16.65, 16.17, 15.35 and 12.55 along with chrF scores of 36.71, 40.44, 40.26, 39.51, and 36.20, respectively, using fine-tuned LLaMA-13b on IN22 (conversational), IN22 (general), flores200-dev, flores200-devtest and newstest2019 testsets. Overall, our findings highlight the potential and strength of large language models for machine translation capabilities, including languages that are currently underrepresented in LLMs.

pdf bib
Improving NMT from a Low-Resource Source Language: A Use Case from Catalan to Chinese via Spanish
Yongjian Chen | Antonio Toral | Zhijian Li | Mireia Farrús

The effectiveness of neural machine translation is markedly constrained in low-resource scenarios, where the scarcity of parallel data hampers the development of robust models. This paper focuses on the scenario where the source language is low-resourceand there exists a related high-resource language, for which we introduce a novel approach that combines pivot translation and multilingual training. As a use case we tackle the automatic translation from Catalan to Chinese, using Spanish as an additional language. Our evaluation, conducted on the FLORES-200 benchmark, compares our new approach against a vanilla baseline alongside other models representing various low-resource techniques in the Catalan-to-Chinese context. Experimental results highlight the efficacy of our proposed method, which outperforms existing models, notably demonstrating significant improvements both in translation quality and in lexical diversity.

pdf bib
A Case Study on Context-Aware Neural Machine Translation with Multi-Task Learning
Ramakrishna Appicharla | Baban Gain | Santanu Pal | Asif Ekbal | Pushpak Bhattacharyya

In document-level neural machine translation (DocNMT), multi-encoder approaches are common in encoding context and source sentences. Recent studies (CITATION) have shown that the context encoder generates noise and makes the model robust to the choice of context. This paper further investigates this observation by explicitly modelling context encoding through multi-task learning (MTL) to make the model sensitive to the choice of context. We conduct experiments on cascade MTL architecture, which consists of one encoder and two decoders. Generation of the source from the context is considered an auxiliary task, and generation of the target from the source is the main task. We experimented with German–English language pairs on News, TED, and Europarl corpora. Evaluation results show that the proposed MTL approach performs better than concatenation-based and multi-encoder DocNMT models in low-resource settings and is sensitive to the choice of context. However, we observe that the MTL models are failing to generate the source from the context. These observations align with the previous studies, and this might suggest that the available document-level parallel corpora are not context-aware, and a robust sentence-level model can outperform the context-aware models.

pdf bib
Aligning Neural Machine Translation Models: Human Feedback in Training and Inference
Miguel Ramos | Patrick Fernandes | António Farinhas | Andre Martins

Reinforcement learning from human feedback (RLHF) is a recent technique to improve the quality of the text generated by a language model, making it closer to what humans would generate.A core ingredient in RLHF’s success in aligning and improving large language models (LLMs) is its reward model, trained using human feedback on model outputs. In machine translation (MT), where metrics trained from human annotations can readily be used as reward models, recent methods using minimum Bayes risk decoding and reranking have succeeded in improving the final quality of translation.In this study, we comprehensively explore and compare techniques for integrating quality metrics as reward models into the MT pipeline. This includes using the reward model for data filtering, during the training phase through RL, and at inference time by employing reranking techniques, and we assess the effects of combining these in a unified approach.Our experimental results, conducted across multiple translation tasks, underscore the crucial role of effective data filtering, based on estimated quality, in harnessing the full potential of RL in enhancing MT quality.Furthermore, our findings demonstrate the effectiveness of combining RL training with reranking techniques, showcasing substantial improvements in translation quality.

pdf bib
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
Dimitris Roussis | Sokratis Sofianopoulos | Stelios Piperidis

The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora from the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the research domains of: Energy Research, Neuroscience, Cancer and Transportation. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

pdf bib
Towards Tailored Recovery of Lexical Diversity in Literary Machine Translation
Esther Ploeger | Huiyuan Lai | Rik Van Noord | Antonio Toral

Machine translations are found to be lexically poorer than human translations. The loss of lexical diversity through MT poses an issue in the automatic translation of litrature, where it matters not only what is written, but also how it is written. Current methods for increasing lexical diversity in MT are rigid. Yet, as we demonstrate, the degree of lexical diversity can vary considerably across different novels. Thus, rather than aiming for the rigid increase of lexical diversity, we reframe the task as recovering what is lost in the machine translation process. We propose a novel approach that consists of reranking translation candidates with a classifier that distinguishes between original and translated text. We evaluate our approach on 31 English-to-Dutch book translations, and find that, for certain books, our approach retrieves lexical diversity scores that are close to human translation.

pdf bib
Enhancing Gender-Inclusive Machine Translation with Neomorphemes and Large Language Models
Andrea Piergentili | Beatrice Savoldi | Matteo Negri | Luisa Bentivogli

Machine translation (MT) models are known to suffer from gender bias, especially when translating into languages with extensive gendered morphology. Accordingly, they still fall short in using gender-inclusive language, also representative of non-binary identities. In this paper, we look at gender-inclusive neomorphemes, neologistic elements that avoid binary gender markings as an approach towards fairer MT. In this direction, we explore prompting techniques with large language models (LLMs) to translate from English into Italian using neomorphemes. So far, this area has been under-explored due to its novelty and the lack of publicly available evaluation resources. We fill this gap by releasing NEO-GATE, a resource designed to evaluate gender-inclusive en→it translation with neomorphemes. With NEO-GATE, we assess four LLMs of different families and sizes and different prompt formats, identifying strengths and weaknesses of each on this novel task for MT.

pdf bib
Research: Translators & Users
Page Break

Research: Translators & Users

pdf bib
Prompting ChatGPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts
Sui He

Prompt engineering has shown potential for improving translation quality in LLMs. However, the possibility of using translation concepts in prompt design remains largely underexplored. Against this backdrop, the current paper discusses the effectiveness of incorporating the conceptual tool of “translation brief” and the personas of “translator” and “author” into prompt design for translation tasks in ChatGPT. Findings suggest that, although certain elements are constructive in facilitating human-to-human communication for translation tasks, their effectiveness is limited for improving translation quality in ChatGPT. This accentuates the need for explorative research on how translation theorists and practitioners can develop the current set of conceptual tools rooted in the human-to-human communication paradigm for translation purposes in this emerging workflow involving human-machine interaction, and how translation concepts developed in translation studies can inform the training of GPT models for translation tasks.

pdf bib
Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation
Claudio Fantinuoli | Xiaoman Wang

Assessing the performance of interpreting services is a complex task, given the nuanced nature of spoken language translation, the strategies that interpreters apply, and the diverse expectations of users. The complexity of this task become even more pronounced when automated evaluation methods are applied. This is particularly true because interpreted texts exhibit less linearity between the source and target languages due to the strategies employed by the interpreter.This study aims to assess the reliability of automatic metrics in evaluating simultaneous interpretations by analyzing their correlation with human evaluations. We focus on a particular feature of interpretation quality, namely translation accuracy or faithfulness. As a benchmark we use human assessments performed by language experts, and evaluate how well sentence embeddings and Large Language Models correlate with them. We quantify semantic similarity between the source and translated texts without relying on a reference translation. The results suggest GPT models, particularly GPT-3.5 with direct prompting, demonstrate the strongest correlation with human judgment in terms of semantic similarity between source and target texts, even when evaluating short textual segments. Additionally, the study reveals that the size of the context window has a notable impact on this correlation.

pdf bib
MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs
Serge Gladkoff | Lifeng Han | Gleb Erofeev | Irina Sorokina | Goran Nenadic

Translation Quality Evaluation (TQE) is an essential step of the modern translation production process. TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. The ability to evaluate or even simply estimate the quality of translation automatically may open significant efficiency gains through process optimisation.This work examines whether the state-of-the-art large language models (LLMs) can be used for this uncertainty estimation of MT output quality. We take OpenAI models as an example technology and approach TQE as a binary classification task.On eight language pairs including English to Italian, German, French, Japanese, Dutch, Portuguese, Turkish, and Chinese, our experimental results show that fine-tuned gpt3.5 can demonstrate good performance on translation quality prediction tasks, i.e. whether the translation needs to be edited.Another finding is that simply increasing the sizes of LLMs does not lead to apparent better performances on this task by comparing the performance of three different versions of OpenAI models: curie, davinci, and gpt3.5 with 13B, 175B, and 175B parameters, respectively.

pdf bib
Translators’ perspectives on machine translation uses and impacts in the Swiss Confederation: Navigating technological change in an institutional setting
Paolo Canavese | Patrick Cadwell

New language technologies are driving major changes in the language services of institutions worldwide, including the Swiss Confederation. Based on a definition of change management as a combination of adaptation measures at both the organisation and individual levels, this study used a survey to gather unprecedented quantitative data on the use and qualitative data on the perceptions of machine translation (MT) by federal in-house translators. The results show that more than half of the respondents use MT regularly and that translators are largely free to use it as they see fit. In terms of perceptions, they mostly anticipate negative evolutions along five dimensions: work processes, translators, translated texts, the future of their language services and job, and the place of translators within their institution and society. Their apprehensions concern MT per se, but even more the way it is seen and used within their organisation. However, positive perspectives regarding efficiency gains or usefulness of MT as a translation aid were also discussed. Building on these human factors is key to successful change management. Academic research has a contribution to make, and the coming together of translation and organisation studies offers promising avenues for further research.

pdf bib
Added Toxicity Mitigation at Inference Time for Multimodal and Massively Multilingual Translation
Marta Costa-jussà | David Dale | Maha Elbayad | Bokai Yu

Machine translation models sometimes lead to added toxicity: translated outputs may contain more toxic content that the original input. In this paper, we introduce MinTox, a novel pipeline to automatically identify and mitigate added toxicity at inference time, without further model training. MinTox leverages a multimodal (speech and text) toxicity classifier that can scale across languages.We demonstrate the capabilities of MinTox when applied to SEAMLESSM4T, a multi-modal and massively multilingual machine translation system. MinTox significantly reduces added toxicity: across all domains, modalities and language directions, 25% to95% of added toxicity is successfully filtered out, while preserving translation quality

pdf bib
LLMs in Post-Translation Workflows: Comparing Performance in Post-Editing and Error Analysis
Celia Uguet | Fred Bane | Mahmoud Aymo | João Torres | Anna Zaretskaya | Tània Blanch Miró Blanch Miró

This study conducts a comprehensive comparison of three leading LLMs—GPT-4, Claude 3, and Gemini—in two translation-related tasks: automatic post-editing and MQM error annotation, across four languages. Utilizing the pharmaceutical EMEA corpus to maintain domain specificity and minimize data contamination, the research examines the models’ performance in these two tasks. Our findings reveal the nuanced capabilities of LLMs in handling MTPE and MQM tasks, hinting at the potential of these models in streamlining and optimizing translation workflows. Future directions include fine-tuning LLMs for task-specific improvements and exploring the integration of style guides for enhanced translation quality.

pdf bib
Post-editors as Gatekeepers of Lexical and Syntactic Diversity: Comparative Analysis of Human Translation and Post-editing in Professional Settings
Lise Volkart | Pierrette Bouillon

This paper presents a comparative analysis between human translation (HT) and post-edited machine translation (PEMT) from a lexical and syntactic perspective to verify whether the tendency of neural machine translation (NMT) systems to produce lexically and syntactically poorer translations shines through after post-editing (PE). The analysis focuses on three datasets collected in professional contexts containing translations from English into French and German into French. Through a comparison of word translation entropy (HTRa) scores, we observe a lower degree of lexical diversity in PEMT compared to HT. Additionally, metrics of syntactic equivalence indicate that PEMT is more likely to mirror the syntactic structure of the source text in contrast to HT. By incorporating raw machine translation (MT) output into our analysis, we underline the important role post-editors play in adding lexical and syntactic diversity to MT output. Our findings provide relevant input for MT users and decision-makers in language services as well as for MT and PE trainers and advisers.

pdf bib
Exploring NMT Explainability for Translators Using NMT Visualising Tools
Gabriela Gonzalez-Saez | Mariam Nakhle | James Turner | Fabien Lopez | Nicolas Ballier | Marco Dinarelli | Emmanuelle Esperança-Rodier | Sui He | Raheel Qader | Caroline Rossi | Didier Schwab | Jun Yang

This paper describes work in progress on Visualisation tools to foster collaborations between translators and computational scientists. We aim to describe how visualisation features can be used to explain translation and NMT outputs. We tested several visualisation functionalities with three NMT models based on Chinese-English, Spanish-English and French-English language pairs. We created three demos containing different visualisation tools and analysed them within the framework of performance-explainability, focusing on the translator’s perspective.

pdf bib
Mitigating Translationese with GPT-4: Strategies and Performance
Maria Kunilovskaya | Koel Dutta Chowdhury | Heike Przybyl | Cristina España-Bonet | Josef Genabith

Translations differ in systematic ways from texts originally authored in the same language.These differences, collectively known as translationese, can pose challenges in cross-lingual natural language processing: models trained or tested on translated input might struggle when presented with non-translated language. Translationese mitigation can alleviate this problem. This study investigates the generative capacities of GPT-4 to reduce translationese in human-translated texts. The task is framed as a rewriting process aimed at modified translations indistinguishable from the original text in the target language. Our focus is on prompt engineering that tests the utility of linguistic knowledge as part of the instruction for GPT-4. Through a series of prompt design experiments, we show that GPT4-generated revisions are more similar to originals in the target language when the prompts incorporate specific linguistic instructions instead of relying solely on the model’s internal knowledge. Furthermore, we release the segment-aligned bidirectional German-English data built from the Europarl corpus that underpins this study.

pdf bib
Translate your Own: a Post-Editing Experiment in the NLP domain
Rachel Bawden | Ziqian Peng | Maud Bénard | Éric Clergerie | Raphaël Esamotunu | Mathilde Huguin | Natalie Kübler | Alexandra Mestivier | Mona Michelot | Laurent Romary | Lichao Zhu | François Yvon

The improvements in neural machine translation make translation and post-editing pipelines ever more effective for a wider range of applications. In this paper, we evaluate the effectiveness of such a pipeline for the translation of scientific documents (limited here to article abstracts). Using a dedicated interface, we collect, then analyse the post-edits of approximately 350 abstracts (English→French) in the Natural Language Processing domain for two groups of post-editors: domain experts (academics encouraged to post-edit their own articles) on the one hand and trained translators on the other. Our results confirm that such pipelines can be effective, at least for high-resource language pairs. They also highlight the difference in the post-editing strategy of the two subgroups. Finally, they suggest that working on term translation is the most pressing issue to improve fully automatic translations, but that in a post-editing setup, other error types can be equally annoying for post-editors.

pdf bib
Pre-task perceptions of MT influence quality and productivity: the importance of better translator-computer interactions and implications for training
Vicent Briva-Iglesias | Sharon O’Brien

This paper presents a user study with 11 professional English-Spanish translators in the legal domain. We analysed whether negative or positive translators’ pre-task perceptions of machine translation (MT) being an aid or a threat had any relationship with final translation quality and productivity in a post-editing workflow. Pre-task perceptions of MT were collected in a questionnaire before translators conducted post-editing tasks and were then correlated with translation productivity and translation quality after an Adequacy-Fluency evaluation. Each participant translated 13 texts over two consecutive weeks, accounting for 120,102 words in total. Results show that translators who had higher levels of trust in MT and thought that MT was not a threat to the translation profession reported higher translation quality and productivity. These results have critical implications: improving translator-computer interactions and fostering MT literacy in translation training may be crucial to reducing negative translators’ pre-task perceptions, resulting in better translation productivity and quality, especially adequacy.

pdf bib
Bayesian Hierarchical Modelling for Analysing the Effect of Speech Synthesis on Post-Editing Machine Translation
Miguel Rios | Justus Brockmann | Claudia Wiesinger | Raluca Chereji | Alina Secară | Dragoș Ciobanu

Automatic speech synthesis has seen rapid development and integration in domains as diverse as accessibility services, translation, or language learning platforms. We analyse its integration in a post-editing machine translation (PEMT) environment and the effect this has on quality, productivity, and cognitive effort. We use Bayesian hierarchical modelling to analyse eye-tracking, time-tracking, and error annotation data resulting from an experiment involving 21 professional translators post-editing from English into German in a customised cloud-based CAT environment and listening to the source and/or target texts via speech synthesis. Using speech synthesis in a PEMT task has a non-substantial positive effect on quality, a substantial negative effect on productivity, and a substantial negative effect on the cognitive effort expended on the target text, signifying that participants need to allocate less cognitive effort to the target text.

pdf bib
Evaluation of intralingual machine translation for health communication
Silvana Deilen | Ekaterina Lapshinova-Koltunski | Sergio Garrido | Julian Hörner | Christiane Maaß | Vanessa Theel | Sophie Ziemer

In this paper, we describe results of a study on evaluation of intralingual machine translation. The study focuses on machine translations of medical texts into Plain German. The automatically simplified texts were compared with manually simplified texts (i.e., simplified by human experts) as well as with the underlying, unsimplified source texts. We analyse the quality of outputs from three models based on different criteria, such as correctness, readability, and syntactic complexity. We compare the outputs of the three models under analysis between each other, as well as with the existing human translations. The study revealed that system performance depends on the evaluation criteria used and that only one of the three models showed strong similarities to the human translations. Furthermore, we identified various types of errors in all three models. These included not only grammatical mistakes and misspellings, but also incorrect explanations of technical terms and false statements, which in turn led to serious content-related mistakes.

pdf bib
Using Machine Learning to Validate a Novel Taxonomy of Phenomenal Translation States
Michael Carl | Sheng Lu | Ali Al-Ramadan

We report an experiment in which we use machine learning to validate the empirical objectivity of a novel annotation taxonomy for behavioral translation data. The HOF taxonomy defines three translation states according to which a human translator can be in a state of Orientation (O), Hesitation (H) or in a Flow state (F). We aim at validating the taxonomy based on a manually annotated dataset that consists of six English-Spanish translation sessions (approx 900 words) and 1813 HOF-annotated Activity Units (AUs). Two annotators annotated the data and obtain high average inter-annotator accuracy 0.76 (kappa 0.88). We train two classifiers, a Multi-layer Perceptron (MLP) and a Random Forest (RF) on the annotated data and tested on held-out data. The classifiers perform well on the annotated data and thus confirm the epistemological objectivity of the annotation taxonomy. Interestingly, inter-classifier accuracy scores are higher than between the two human annotators.

pdf bib
Perceptions of Educators on MTQA Curriculum and Instruction
João Camargo | Sheila Castilho | Joss Moorkens

This paper reports the preliminary resultsof a survey aimed at identifying and ex-ploring the attitudes and recommendationsof machine translation quality assessment(MTQA) educators. Drawing upon ele-ments from the literature on MTQA teach-ing, the survey explores themes that maypose a challenge or lead to successful im-plementation of human evaluation, as theliterature shows that there has not beenenough design and reporting. Results show educators’ awareness ofthe topic, awareness stemming from therecommendations of the literature on MTevaluation, and reports new challenges andissues.

pdf bib
Comparative Quality Assessment of Human and Machine Translation with Best-Worst Scaling
Bettina Hiebl | Dagmar Gromann

Translation quality and its assessment are of great importance in the context of human as well as machine translation. Methods range from human annotation and assessment to quality metrics and estimation, where the former are rather time-consuming. Furthermore, assessing translation quality is a subjective process. Best-Worst Scaling (BWS) represents a time-efficient annotation method to obtain subjective preferences, the best and the worst in a given set and their ratings. In this paper, we propose to use BWS for a comparative translation quality assessment of one human and three machine translations to German of the same source text in English. As a result, ten participants with a translation background selected the human translation most frequently and rated it overall as best closely followed by DeepL. Participants showed an overall positive attitude towards this assessment method.

pdf bib
Quantifying the Contribution of MWEs and Polysemy in Translation Errors for English–Igbo MT
Adaeze Ohuoba | Serge Sharoff | Callum Walker

In spite of recent successes in improving Machine Translation (MT) quality overall, MT engines require a large amount of resources, which leads to markedly lower quality for lesser-resourced languages. This study explores the case of translation from English into Igbo, a very low resource language spoken by about 45 million speakers. With the aim of improving MT quality in this scenario, we investigate methods for guided detection of critical/harmful MT errors, more specifically those caused by non-compositional multi-word expressions and polysemy. We have designed diagnostic tests for these cases and applied them to collections of medical texts from CDC, Cochrane, NCDC, NHS and WHO.

pdf bib
Analysis of the Annotations from a Crowd MT Evaluation Initiative: Case Study for the Spanish-Basque Pair
Nora Aranberri

With the advent and success of trainable automatic evaluation metrics, creating annotated machine translation evaluation data sets is increasingly relevant. However, for low-resource languages, gathering such data can be challenging and further insights into evaluation design for opportunistic scenarios are necessary. In this work we explore an evaluation initiative that targets the Spanish—-Basque language pair to study the impact of design decisions and the reliability of volunteer contributions. To do that, we compare the work carried out by volunteers and a translation professional in terms of evaluation results and evaluator agreement and examine the control measures used to ensure reliability. Results show similar behaviour regarding general quality assessment but underscore the need for more informative working environments to make evaluation processes more reliable as well as the need for carefully crafted control cases.

pdf bib
Implementations & Case Studies
Page Break

Implementations & Case Studies

pdf bib
A Case Study on Contextual Machine Translation in a Professional Scenario of Subtitling
Sebastian Vincent | Charlotte Prescott | Chris Bayliss | Chris Oakley | Carolina Scarton

Incorporating extra-textual context such as film metadata into the machine translation (MT) pipeline can enhance translation quality, as indicated by automatic evaluation in recent work. However, the positive impact of such systems in industry remains unproven. We report on an industrial case study carried out to investigate the benefit of MT in a professional scenario of translating TV subtitles with a focus on how leveraging extra-textual context impacts post-editing. We found that post-editors marked significantly fewer context-related errors when correcting the outputs of MTCue, the context-aware model, as opposed to non-contextual models. We also present the results of a survey of the employed post-editors, which highlights contextual inadequacy as a significant gap consistently observed in MT. Our findings strengthen the motivation for further work within fully contextual MT.

pdf bib
Training an NMT system for legal texts of a low-resource language variety South Tyrolean German - Italian
Antoni Oliver | Sergi Alvarez-Vidal | Egon Stemle | Elena Chiocchetti

This paper illustrates the process of training and evaluating NMT systems for a language pair that includes a low-resource language variety.A parallel corpus of legal texts for Italian and South Tyrolean German has been compiled, with South Tyrolean German being the low-resourced language variety. As the size of the compiled corpus is insufficient for the training, we have combined the corpus with several parallel corpora using data weighting at sentence level. We then performed an evaluation of each combination and of two popular commercial systems.

pdf bib
Implementing Gender-Inclusivity in MT Output using Automatic Post-Editing with LLMs
Mara Nunziatini | Sara Diego

This paper investigates the effectiveness of combining machine translation (MT) systems and large language models (LLMs) to produce gender-inclusive translations from English to Spanish. The study uses a multi-step approach where a translation is first generated by an MT engine and then reviewed by an LLM. The results suggest that while LLMs, particularly GPT-4, are successful in generating gender-inclusive post-edited translations and show potential in enhancing fluency, they often introduce unnecessary changes and inconsistencies. The findings underscore the continued necessity for human review in the translation process, highlighting the current limitations of AI systems in handling nuanced tasks like gender-inclusive translation. Also, the study highlights that while the combined approach can improve translation fluency, the effectiveness and reliability of the post-edited translations can vary based on the language of the prompts used.

pdf bib
CantonMT: Cantonese to English NMT Platform with Fine-Tuned Models using Real and Synthetic Back-Translation Data
Kung Hong | Lifeng Han | Riza Batista-Navarro | Goran Nenadic

Neural Machine Translation (NMT) for low-resource languages remains a challenge for many NLP researchers. In this work, we deploy a standard data augmentation methodology by back-translation to a new language translation direction, i.e., Cantonese-to-English. We present the models we fine-tuned using the limited amount of real data and the synthetic data we generated using back-translation by three models: OpusMT, NLLB, and mBART.We carried out automatic evaluation using a range of different metrics including those that are lexical-based and embedding-based.Furthermore, we create a user-friendly interface for the models we included in this project, CantonMT, and make it available to facilitate Cantonese-to-English MT research. Researchers can add more models to this platform via our open-source CantonMT toolkit, available at https://github.com/kenrickkung/CantoneseTranslation.

pdf bib
Advancing Digital Language Equality in Europe: A Market Study and Open-Source Solutions for Multilingual Websites
Andrejs Vasiljevs | Rinalds Vīksna | Neil Vacheva | Andis Lagzdiņš

The paper presents findings from a comprehensive market study commissioned by the European Commission, aimed at analysing multilinguality of European websites and automated website translation services across various sectors. The findings show that the majority of websites offer content in one or two languages, while only less than 25% of European websites provide content in 3 or more languages. Additionally, we introduce Web-T, a collection of open-source solutions facilitating automated website translation with a help of free MT service eTranslation provided by the European Commission and possibility to integrate other MT providers. Web-T solutions include local plug-ins for Content Management Systems, universal plug-ins, and an MT API Integrator, thus contributing to the broader goal of digital language equality in Europe.

pdf bib
Exploring the Effectiveness of LLM Domain Adaptation for Business IT Machine Translation
Johannes Eschbach-Dymanus | Frank Essenberger | Bianka Buschbeck | Miriam Exel

In this paper, we study the translation abilities of Large Language Models (LLMs) for business IT texts.We are strongly interested in domain adaptation of translation systems, which is essential for accurate and lexically appropriate translation of such texts.Among the open-source models evaluated in a zero- and few-shot setting, we find Llama-2 13B the most promising for domain-specific translation fine-tuning.We investigate the full range of adaptation techniques for LLMs: from prompting, over parameter-efficient fine-tuning to full fine-tuning, and compare to classic neural machine translation (MT) models trained internally at SAP.We provide guidance how to use training budget most effectively for different fine-tuning approaches.We observe that while LLMs can translate on-par with SAP’s MT models on general domain data, it is difficult to close the gap on SAP’s domain-specific data, even with extensive training and carefully curated data.

pdf bib
Creating and Evaluating a Multilingual Corpus of UN General Assembly Debates
Hannah Bechara | Krishnamoorthy Manohara | Slava Jankin

This paper presents a multilingual aligned corpus of political debates from the United Nations (UN) General Assembly sessions between 1978 and 2021, which covers five of the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. We explain the preprocessing steps we applied to the corpus. We align the sentences by using word vectors to numerically represent the meaning of each sentence and then calculating the Euclidean distance between them. To validate our alignment methods, we conducted an evaluation study with crowd-sourced human annotators using Scale AI, an online platform for data labelling. The final dataset consists of around 300,000 aligned sentences for En-Es, En-Fr, En-Zh and En-Ru. It is publicly available for download.

pdf bib
Generating subject-matter expertise assessment questions with GPT-4: a medical translation use-case
Diana Silveira | Marina Torrón | Helena Moniz

This paper examines the suitability of a large language model (LLM), GPT-4, for generating multiple choice questions (MCQs) aimed at assessing subject matter expertise (SME) in the domain of medical translation. The main objective of these questions is to model the skills of potential subject matter experts in a human-in-the-loop machine translation (MT) flow, to ensure that tasks are matched to the individuals with the right skill profile. The investigation was conducted at Unbabel, an artificial intelligence-powered human translation platform. Two medical translation experts evaluated the GPT-4-generated questions and answers, one focusing on English–European Portuguese, and the other on English–German. We present a methodology for creating prompts to elicit high-quality GPT-4 outputs for this use case, as well as for designing evaluation scorecards for human review of such output. Our findings suggest that GPT-4 has the potential to generate suitable items for subject matter expertise tests, providing a more efficient approach compared to relying solely on humans. Furthermore, we propose recommendations for future research to build on our approach and refine the quality of the outputs generated by LLMs.

pdf bib
Prompting Large Language Models with Human Error Markings for Self-Correcting Machine Translation
Nathaniel Berger | Stefan Riezler | Miriam Exel | Matthias Huck

While large language models (LLMs) pre-trained on massive amounts of unpaired language data have reached the state-of-the-art in machine translation (MT) of general domain texts, post-editing (PE) is still required to correct errors and to enhance term translation quality in specialized domains. In this paper we present a pilot study of enhancing translation memories (TM) produced by PE (source segments, machine translations, and reference translations, henceforth called PE-TM) for the needs of correct and consistent term translation in technical domains. We investigate a light-weight two-step scenario where at inference time, a human translator marks errors in the first translation step, and in a second step a few similar examples are extracted from the PE-TM to prompt an LLM. Our experiment shows that the additional effort of augmenting translations with human error markings guides the LLM to focus on a correction of the marked errors, yielding consistent improvements over automatic PE (APE) and MT from scratch.

pdf bib
Estonian-Centric Machine Translation: Data, Models, and Challenges
Elizaveta Korotkova | Mark Fishel

Machine translation (MT) research is most typically English-centric. In recent years, massively multilingual translation systems have also been increasingly popular. However, efforts purposefully focused on less-resourced languages are less widespread. In this paper, we focus on MT from and into the Estonian language. First, emphasizing the importance of data availability, we generate and publicly release a back-translation corpus of over 2 billion sentence pairs. Second, using these novel data, we create MT models covering 18 translation directions, all either from or into Estonian. We re-use the encoder of the NLLB multilingual model and train modular decoders separately for each language, surpassing the original NLLB quality. Our resulting MT models largely outperform other open-source MT systems, including previous Estonian-focused efforts, and are released as part of this submission.