Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

Carolina Scarton, Charlotte Prescott, Chris Bayliss, Chris Oakley, Joanna Wright, Stuart Wrigley, Xingyi Song, Edward Gow-Smith, Mikel Forcada, Helena Moniz (Editors)


Anthology ID:
2024.eamt-2
Month:
June
Year:
2024
Address:
Sheffield, UK
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation (EAMT)
URL:
https://aclanthology.org/2024.eamt-2
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2024.eamt-2.pdf

pdf bib
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)
Carolina Scarton | Charlotte Prescott | Chris Bayliss | Chris Oakley | Joanna Wright | Stuart Wrigley | Xingyi Song | Edward Gow-Smith | Mikel Forcada | Helena Moniz

pdf bib
Products & Projects
Page Break

Products & Projects

pdf bib
Transitude: Machine Translation on Social Media: MT as a potential tool for opinion (mis)formation
Khetam Sharou | Joss Moorkens

Misinformation on social media is a concern for content creators, consumers and regulators alike. Transitude looks at misinformation generated by machine translation (MT) through distortion of the intention and sentiment of text. It is the first study of MT’s impact on the formation of users’ views of society through refugees in Ireland. It extends current MT evaluation methods with a new quality evaluation framework, producing the first dataset annotated for information distortion. It provides insights into the risks of relying on MT, with recommendations for users, developers, and policymakers.

pdf bib
Lightweight neural translation technologies for low-resource languages
Felipe Sánchez-Martínez | Juan Antonio Pérez-Ortiz | Víctor Sánchez-Cartagena | Andrés Lou | Cristian García-Romero | Aarón Galiano-Jiménez | Miquel Esplà-Gomis

The LiLowLa (“Lightweight neural translation technologies for low-resource languages”) project aims to enhance machine translation (MT) and translation memory (TM) technologies, particularly for low-resource language pairs, where adequate linguistic resources are scarce. The project started in September 2022 and will run till August 2025.

pdf bib
MaTIAS: Machine Translation to Inform Asylum Seekers
Lieve Macken | Ella Hest | Arda Tezcan | Michaël Lumingu | Katrijn Maryns | July Wilde

This project aims to develop a multilingual notification system for asylum reception centres in Belgium using machine translation. The system will allow staff to communicate practical messages to residents in their own language. Ethnographically inspired fieldwork is being conducted in reception centres to understand current communication practices and ensure that the technology meets user needs. The quality and suitability of machine translation will be evaluated for three MT systems supporting all target languages. Automatic and manual evaluation methods will be used to assess translation quality, and terms of use, privacy and data protection conditions will be analysed.

pdf bib
SmartBiC: Smart Harvesting of Bilingual Corpora from the Internet
Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Alicia Núñez Alcover | Tudor Mateiu | Mikel Forcada | Pedro Orzas | Almudena Carrillo | Giuseppe Nolasco | Noelia Listón

SmartBiC, an 18-month innovation project funded by the Spanish Government, aims at improving the full process of collecting, filtering and selecting in-domain parallel content to be used for machine translation and language model tuning purposes in industrial settings. Based on state-of-the-art technology in the free/open-source parallel web corpora harvester Bitextor, SmartBic develops a web-based application around it including novel components such as a language- and domain-focused crawler and a domain-specific corpora selector. SmartBic also addresses specific industrial use cases for individual components of the Bitextor pipeline, such as parallel data cleaning. Relevant improvements to the current Bitextor pipeline will be publicly released.

pdf bib
An Eye-Tracking Study on the Use of Machine Translation Post-Editing and Automatic Speech Recognition in Translations for the Medical Domain
Raluca Chereji

This EAMT-funded eye-tracking study investigates the impact of Machine Translation Post-Editing and Automatic Speech Recognition on English-Romanian translations of patient-facing medical texts. This paper provides an overview of the study objectives, setup and preliminary results.

pdf bib
The MAKE-NMTViz Project: Meaningful, Accurate and Knowledge-limited Explanations of NMT Systems for Translators
Gabriela Gonzalez-Saez | Fabien Lopez | Mariam Nakhle | James Turner | Nicolas Ballier | Marco Dinarelli | Emmanuelle Esperança-Rodier | Sui He | Caroline Rossi | Didier Schwab | Jun Yang

This paper describes MAKE-NMTViz, a project designed to help translators visualize neural machine translation outputs using explainable artificial intelligence visualization tools initially developed for computer vision.

pdf bib
MULTILINGTOOL, Development of an Automatic Multilingual Subtitling and Dubbing System
Xabier Saralegi | Ander Corral | Igor Leturia | Xabier Sarasola | Josu Murua | Iker Manterola | Itziar Cortes

In this paper, we present the MULTILINGTOOL project, led by the Elhuyar Foundation and funded by the European Commission under the CREA-MEDIA2022-INNOVBUSMOD call. The aim of the project is to develop an advanced platform for automatic multilingual subtitling and dubbing. It will provide support for Spanish, English, and French, as well as the co-official languages of Spain, namely Basque, Catalan, and Galician.

pdf bib
ERC Advanced Grant Project CALCULUS: Extending the Boundary of Machine Translation
Jingyuan Sun | Mingxiao Li | Ruben Cartuyvels | Marie-Francine Moens

The CALCULUS project, drawing on human capabilities of imagination and commonsense for natural language understanding (NLU), aims to advance machine-based NLU by integrating traditional AI concepts with contemporary machine learning techniques. It focuses on developing anticipatory event representations from both textual and visual data, connecting language structure to visual spatial organization and incorporating broad knowledge domains. Through testing these models in NLU tasks and evaluating their ability to predict untrained spatial and temporal details using real-world metrics, CALCULUS employs machine learning methods, including Bayesian techniques and neural networks, especially in data-sparse scenarios. The project’s culmination involves creating demonstrators that transform written stories into dynamic videos, showcasing the interdisciplinary expertise of the project leader in natural language processing, language and visual data analysis, information retrieval, and machine learning, all vital for the project’s achievements. In the CALCULUS project, our exploration of machine translation extends beyond the conventional text-to-text framework. We are broadening the horizons of machine translation by delving into the essence of transforming the formats of data distribution while keeping the meaning. This innovative approach involves converting information from one modality into another, transcending traditional linguistic boundaries. Our project includes novel work on translating text into images and videos, brain signals into images and videos.

pdf bib
GAMETRAPP project in progress: Designing a gamified environment for post-editing research abstracts
Laura Noriega-Santiáñez | Cristina Toledo-Báez

The «App for post-editing neural machine translation using gamification» (GAMETRAPP) project (TED2021-129789B-I00), funded by the Spanish Ministry of Science and Innovation (2022–2024), has been in progress for a year. Thus, this paper presents its main goals and the analysis of neural machine translation and post-editing errors of research abstracts carried out. This leads to the designing of the gamified environment, which is currently under construction.

pdf bib
RCnum: A Semantic and Multilingual Online Edition of the Geneva Council Registers from 1545 to 1550
Pierrette Bouillon | Christophe Chazalon | Sandra Coram-Mekkey | Gilles Falquet | Johanna Gerlach | Stephane Marchand-Maillet | Laurent Moccozet | Jonathan Mutal | Raphael Rubino | Marco Sorbi

The RCnum project is funded by the Swiss National Science Foundation and aims at producing a multilingual and semantically rich online edition of the Registers of Geneva Council from 1545 to 1550. Combining multilingual NLP, history and paleography, this collaborative project will clear hurdles inherent to texts manually written in 16th century Middle French while allowing for easy access and interactive consultation of these archives.

pdf bib
MTPE quality evaluation in translator education: the postedit.me app
Marie-Aude Lefer | Romane Bodart | Justine Piette | Adam Obrusník

This article presents the main functionality of the postedit.me app. Postedit.me is a software program that supports machine translation post-editing training in translator education, with special emphasis on standardized quality evaluation of post-edited texts produced by students. The app is made freely available to universities for teaching and research purposes.

pdf bib
Boosting Machine Translation with AI-powered terminology features
Marek Sabo | Judith Klein | Giorgio Bernardinello

Artificial intelligence (AI) is quickly becoming an exciting new technology for the translation industry in form of large language models (LLMs). AI-based functionality could be used to improve the output of neural machine translation (NMT). One main issue that impacts MT quality and reliability is incorrect terminology. This is why STAR is making AI-powered terminology control a priority for its translation products because of the significant gains to be made - greatly improving the quality of MT output, reducing post editing (PE) costs and efforts, and thereby boosting overall translation productivity.

pdf bib
Automatic detection of (potential) factors in the source text leading to gender bias in machine translation
Janiça Hackenbuchner | Arda Tezcan | Joke Daems

This research project aims to develop a comprehensive methodology to help make machine translation (MT) systems more gender-inclusive for society. The goal is the creation of a detection system, a machine learning (ML) model trained on manual annotations, that can automatically analyse source data and detect and highlight words and phrases that influence the gender bias inflection in target translations.The main research outputs will be (1) a manually annotated dataset, (2) a taxonomy, and (3) a fine-tuned model.

pdf bib
INCREC: Uncovering the creative process of translated content using machine translation
Ana Guerberof-Arenas

The INCREC project aims to uncover professional translators’ creative stages to understand how technology can be best applied to the translation of literary and audio-visual texts, and to analyse the impact of these processes on readers and viewers. To better understand this process, INCREC triangulates data from eye-tracking, retrospective think-aloud inter-views, translated material, and questionnaires from professional translators and users.

pdf bib
SMUGRI-MT - Machine Translation System for Low-Resource Finno-Ugric Languages
Taido Purason | Aleksei Ivanov | Lisa Yankovskaya | Mark Fishel

We introduce SMUGRI-MT, an online neural machine translation system that covers 20 low-resource Finno-Ugric languages, along with seven high-resource languages.

pdf bib
plain X: 4-in-1 multilingual adaptation platform
Peggy Kreeft | Mirko Lorenz | Carlos Amaral

plain X is a 4-in-1 solution for language adaptation. The software is an outcome of European HLT research and is by now in use as the major artificial-intelligence-powered human language pro-cessing platform at Deutsche Welle. plain X is a one-stop-shop for automated transcription, translation, subtitling and voice-over, with human correction options at all stages. We demonstrate how the platform works and show new features and developments of the platform in the framework of the SELMA project.

pdf bib
The BridgeAI Project
Helena Moniz | Joana Lamego | Nuno André | António Novais | Bruno Silva | Maria Henriques | Mariana Dalblon | Paulo Dimas | Pedro Gonçalves

This paper describes the project “BridgeAI: Boosting Regulatory Implementation with Data-driven insights, Global expertise, and Ethics for AI”, a one-year science-for-policy research project funded by the Portuguese Foundation for Science and Technology (FCT). The project aims to provide decision-makers in Portugal with the best context to implement the EU Artificial Intelligence (AI) Act and bridge the gap between AI research and policy. Although not exclusively on machine translation, the project pertains to natural language processing in general and ultimately to each of us as citizens.

pdf bib
GeFMT: Gender-Fair Language in German Machine Translation
Manuel Lardelli | Anne Lauscher | Giuseppe Attanasio

Research on gender bias in Machine Translation (MT) predominantly focuses on binary gender or few languages. In this project, we investigate the ability of commercial MT systems and neural models to translate using gender-fair language (GFL) from English into German. We enrich a community-created GFL dictionary, and sample multi-sentence test instances from encyclopedic text and parliamentary speeches. We translate our resources with different MT systems and open-weights models. We also plan to post-edit biased outputs with professionals and share them publicly. The outcome will constitute a new resource for automatic evaluation and modeling gender-fair EN-DE MT.

pdf bib
ExU: AI Models for Examining Multilingual Disinformation Narratives and Understanding their Spread
Jake Vasilakes | Zhixue Zhao | Michal Gregor | Ivan Vykopal | Martin Hyben | Carolina Scarton

Addressing online disinformation requires analysing narratives across languages to help fact-checkers and journalists sift through large amounts of data. The ExU project focuses on developing AI-based models for multilingual disinformation analysis, addressing the tasks of rumour stance classification and claim retrieval. We describe the ExU project proposal and summarise the results of a user requirements survey regarding the design of tools to support fact-checking.

pdf bib
Multilinguality in the VIGILANT project
Brendan Spillane | Carolina Scarton | Robert Moro | Petar Ivanov | Andrey Tagarev | Jakub Simko | Ibrahim Abu Farha | Gary Munnelly | Filip Uhlárik | Freddy Heppell

VIGILANT (Vital IntelliGence to Investigate ILlegAl DisiNformaTion) is a three-year Horizon Europe project that will equip European Law Enforcement Agencies (LEAs) with advanced disinformation detection and analysis tools to investigate and prevent criminal activities linked to disinformation. These include disinformation instigating violence towards minorities, promoting false medical cures, and increasing tensions between groups causing civil unrest and violent acts. VIGILANT’s four LEAs require support for English, Spanish, Catalan, Greek, Estonian, Romanian and Russian. Therefore, multilinguality is a major challenge and we present the current status of our tools and our plans to improve their performance.

pdf bib
Evaluating Machine Translation for Emotion-loaded User Generated Content (TransEval4Emo-UGC)
Shenbin Qian | Constantin Orasan | Félix Do Carmo | Diptesh Kanojia

This paper presents a dataset for evaluating the machine translation of emotion-loaded user generated content. It contains human-annotated quality evaluation data and post-edited reference translations. The dataset is available at our GitHub repository.

pdf bib
Community-driven machine translation for the Catalan language at Softcatalà
Xavi Ivars-Ribes | Jordi Mas | Marc Riera | Jaume Ortolà | Mikel Forcada | David Cànovas

Among the services provided by Softcatalà, a non-profit 25-year-old grassroots organization that localizes software into Catalan and develops software to ease the generation of Catalan content, one of the most used is its machine translation (MT) service, which provides both rule-based MT and neural MT between Catalan and twelve other languages. Development occurs in a community-supported, transparent way by using free/open-source software and open language resources. This paper briefly describes the MT services at Softcatalà: the offered functionalities, the data, and the software used to provide them.

pdf bib
The MTxGames Project: Creative Video Games and Machine Translation – Different Post-Editing Methods in the Translation Process
Judith Brenner

MTxGames is a doctoral research project examining three different machine translation (MT) post-editing (PE) methods in the context of translating creative texts from video games, focusing on translation speed, cognitive effort, quality, and translators’ preferences. This is a mixed-methods study, eliciting quantitative data through keylogging, eye-tracking, and error evaluation as well as qualitative data through interviews. To create realistic experimental conditions, data elicitation takes place at the workplaces of freelancing professional game translators.

pdf bib
SignON – a Co-creative Machine Translation for Sign and Spoken Languages (end-of-project results, contributions and lessons learned)
Dimitar Shterionov | Vincent Vandeghinste | Mirella Sisto | Aoife Brady | Mathieu De Coster | Lorraine Leeson | Andy Way | Josep Blat | Frankie Picron | Davy Landuyt | Marcello Scipioni | Aditya Parikh | Louis Bosch | John O’Flaherty | Joni Dambre | Caro Brosens | Jorn Rijckaert | Víctor Ubieto | Bram Vanroy | Santiago Gomez | Ineke Schuurman | Gorka Labaka | Adrián Núñez-Marcos | Irene Murtagh | Euan McGill | Horacio Saggion

SignON, a 3-year Horizon 20202 project addressing the lack of technology and services for MT between sign languages (SLs) and spoken languages (SpLs) ended in December 2023. SignON was unprecedented. Not only it addressed the wider complexity of the aforementioned problem – from research and development of recognition, translation and synthesis, through development of easy-to-use mobile applications and a cloud-based framework to do the “heavy lifting” as well as to establishing ethical, privacy and inclusivenesspolicies and operation guidelines – but also engaged with the deaf and hard of hearing communities in an effective co-creation approach where these main stakeholders drove the development in the right direction and had the final say.Currently we are witnessing advances in natural language processing for SLs, including MT. SignON was one of the largest projects that contributed to this surge with 17 partners and more than 60 consortium members, working in parallel with other international and European initiatives, such as project EASIER and others.

pdf bib
The Use of MT by humanitarian NGOs in Hong Kong
Marija Todorova | Rachel Hang Yi Liu

In the relief operations of international humanitarian organisations, non-governmental organisations (NGOs) often encounter language needs when delivering services (Tesseur 2022). This project examines the language needs of humanitarian NGOs working from Hong Kong and the solutions they adopted to overcome the language barriers when delivering international humanitarian relief to other countries.

pdf bib
HPLT’s First Release of Data and Models
Nikolay Arefyev | Mikko Aulamo | Pinzhen Chen | Ona De Gibert Bonet | Barry Haddow | Jindřich Helcl | Bhavitvya Malik | Gema Ramírez-Sánchez | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza-Bernabeu

The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.

pdf bib
Literacy in Digital Environments and Resources (LT-LiDER)
Joss Moorkens | Pilar Sánchez-Gijón | Esther Simon | Mireia Urpí | Nora Aranberri | Dragoș Ciobanu | Ana Guerberof-Arenas | Janiça Hackenbuchner | Dorothy Kenny | Ralph Krüger | Miguel Rios | Isabel Ginel | Caroline Rossi | Alina Secară | Antonio Toral

LT-LiDER is an Erasmus+ cooperation project with two main aims. The first is to map the landscape of technological capabilities required to work as a language and/or translation expert in the digitalised and datafied language industry. The second is to generate training outputs that will help language and translation trainers improve their skills and adopt appropriate pedagogical approaches and strategies for integrating data-driven technology into their language or translation classrooms, with a focus on digital and AI literacy.

pdf bib
Cultural Transcreation with LLMs as a new product
Beatriz Silva | Helena Wu | Yan Jingxuan | Vera Cabarrão | Helena Moniz | Sara Guerreiro de Sousa | João Almeida | Malene Sjørslev Søholm | Ana Farinha | Paulo Dimas

We present how at Unbabel we have been using Large Language Models to apply a Cultural Transcreation (CT) product on customer support (CS) emails and how we have been testing the quality and potential of this product. We discuss our preliminary evaluation of the performance of different MT models in the task of translating rephrased content and the quality of the translation outputs. Furthermore, we introduce the live pilot programme and the corresponding relevant findings, showing that transcreated content is not only culturally adequate but it is also of high rephrasing and translation quality.

pdf bib
AI4Culture: Towards Multilingual Access for Cultural Heritage Data
Tom Vanallemeersch | Sara Szoc | Laurens Meeus

The AI4Culture project (2023-2025), funded by the European Commission, and involving a 12-partner consortium led by the National Technical University of Athens, develops a platform serving as an online capacity building hub for AI technologies in the cultural heritage (CH) sector, enabling multilingual access to CH data. It offers access to AI-related resources, including openly labelled datasets for model training and testing, deployable and reusable tools, and capacity building materials. The tools are aimed at optical character recognition (OCR) for printed and handwritten documents, subtitle generation and validation, machine translation (MT), and metadata enrichment via image information extraction and semantic linking. The project also customises these tools to enhance interface and component usability. We illustrate this with technology that corrects OCR output using language models and adapts it for MT.

pdf bib
The Center for Responsible AI Project
Maria Ana Henriques | Ana Farinha | Nuno André | António Novais | Sara Guerreiro de Sousa | Bruno Prezado Silva | Ana Oliveira | Helena Moniz | Andre Martins | Paulo Dimas

This paper describes the project “NextGenAI: Center for Responsible AI”, a 39-month Mobilizing and Green Agenda for Business Innovation funded by the Portuguese Recovery and Resilience Plan, under the Recovery and Resilience Facility (RRF). The project aims to create a new Center for Responsible AI in Portugal, capable of delivering more than 20 AI products in crucial areas like “Life Sciences”, many of which use generative AI, particularly NLP models such as those for Machine Translation, contributing to translating into legislation the European Law included in the EU AI Act, and creating a critical mass in the development of responsible AI technologies. To accomplish this mission, the Center for Responsible AI is formed by an ecosystem of startups and research institutions driving research in a virtuous way by addressing real market needs and opportunities in Responsible AI.