Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations)

Marianna Martindale, Janice Campbell, Konstantin Savenkov, Shivali Goel (Editors)

Anthology ID:: 2024.amta-presentations
Month:: September
Year:: 2024
Address:: Chicago, USA
Venue:: AMTA
SIG:
Publisher:: Association for Machine Translation in the Americas
URL:: https://aclanthology.org/2024.amta-presentations/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

pdf bib
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations)
Marianna Martindale | Janice Campbell | Konstantin Savenkov | Shivali Goel

pdf bib abs
Staying in the Loop with Gen AI: AI/Gen AI-Powered HLT for Public Sector
Konstantine Boukhvalov

With the development of Generative AI (GAI) capabilities and new applications of GAI emerging every day, many are debating about what role, if any, there will be for human involvement in various tasks, from translation to translation-related services (TRS) to project management. Large organizations, such as language service providers and their customers, are concerned with what their companies will look like in the new GAI world. During our presentation, ManpowerGroup Public Sector (MGPS) will outline its vision for the future role of “humans-in-the-loop” for machine translation for the public sector and how we are transforming our organization to meet the new demands of GAI technology and workflows. We will outline five focus areas: corpus building; corpus curation / quality control; security; workflow adjustments; and output quality evaluation, including fact-checking and domain-specific expertise.

pdf bib abs
The Evolving Path to LLM-based MT
Kirti Vashee

This session will explore the challenges and obstacles we face in transitioning from current SOTA NMT models to an LLM-based MT landscape for enterprise use cases. NMT models are now pervasive and utilized in many production scenarios from eCommerce, eDiscovery, and Customer Service & Support. While LLM MT shows promise with high-resource language translation there are significant latency, throughput, and adaptation challenges to resolve. The session will look at key questions like: Can LLM MT scale to the same levels as current NMT technology? What innovation can we expect from LLM MT to further the SOTA? What other impact will GenAI have on localization production practices? Will there be an interim hybrid period where both NMT and GenAI work together in production workflows? Will LLM MT be able to address low-resource language requirements? How will multilingual LLMs being developed across the world affect the Big Tech and English-centric dominance we see in GenAI today?

pdf bib abs
Enhancing Translation Accuracy and Consistency through Large Language Models
Mei Chai Zheng

Recent advancements in neural machine translation (NMT) have significantly improved the accuracy of translation from one language to another. However, challenges such as adherence to translation memories, context-specific terminologies, and consistent formality register remain pervasive hurdles. This presentation explores the integration of Large Language Models (LLMs) into the MT pipeline to address these specific issues, demonstrating substantial improvements in translation quality and contextual appropriateness.

pdf bib abs
Is AI the new ”Human evaluator”?
Aneta Sapeta

The AI tide has been present in the Localization industry for many years now, and even though there is a big hype around it, it is still trying to find its place in localization. Some are trying to use it as an NMT replacement for the current market models, and others as a helping tool in evaluating the NMT outputs by having less Human input in evaluating the MT quality. From our experience, we are still depending on Human evaluation for assessment, but how good of an evaluator can AI be? From our tests, evaluating the MT quality by the AI can be a challenging task (even though we have seen significant progress in recent years) as it requires the system to understand the meaning of the source, and the target, and then to be able to judge the quality by assessing the more or less visible errors, and to be unbiased in giving its assessment. In this presentation, we want to show our insights on the reliability of AI for MT and whether we can exclude humans from the evaluation circle.

pdf bib abs
PREDICT Methodology - Machine Translation Eligibility Criteria
Paula Manzur

Enterprises in the localization sector handle diverse content types, requiring precise localization solutions. Options range from raw machine translation to transcreation. But how can they ensure the best match between content and localization method? Traditionally, the decision relied mostly on human judgment. The PREDICT Methodology, crafted by Booking.com’s localization central team, offers a systematic framework for assessing MT suitability, aligning content type with the optimal localization solution. By integrating risk tolerance weights into binary queries about a source content and use case, PREDICT provides a score and recommended solution, from raw MT to human-only translation. This approach enables our business to provide the right quality for that specific content type, boost translation efficiency and reduce costs. Looking ahead, the methodology envisions integrating LLMs for automation and guidance, utilizing prompts to identify risk-mitigating strategies.

The year 2024 marks the 10th anniversary of the Multidimensional Quality Metrics (MQM) framework for analytic translation quality evaluation. The MQM error typology has been widely used by practitioners in the translation and localization industry and has served as the basis for many derivative projects. The annual Conference on Machine Translation (WMT) shared tasks on both human and automatic translation quality evaluations used the MQM error typology. The metric stands on two pillars: error typology and the scoring model. The scoring model calculates the quality score from annotation data, detailing how to convert error type and severity counts into numeric scores to determine if the content meets specifications. Previously, only the raw scoring model had been published. This April, the MQM Council published the Linear Calibrated Scoring Model, officially presented herein, along with the Non-Linear Scoring Model, which had not been published

pdf bib abs
Automating Idiom Translation with Cross-Lingual Natural Language Generation Grounded In Semantic Analyses Using Large Language Models
Ming Qian

Idioms exhibit varying degrees of semantic transparency, making their translation challenging. Cross-language differences in idiom usage and connotations add complexity. Using a large language modeling (LLM) approach, we automate Chinese-to-English idiom translation in three steps: (1) Semantic analysis of Chinese idioms using ontology or FrameNet to identify key concepts/relationships like action, purpose, outcome, and context. (2) Generation of multi-word English expressions reflecting these concepts. (3) Selection of the top English idiom candidate that closely matches the Chinese idiom’s meaning. Applied to examples like ‘破釜沉舟’, ‘刀山火海’, and ‘抛砖引玉’, our method performs on par with human experts. The semantic reasoning approach enhances transparency in LLM decisions, simulating logical inferences over the semantic framework.

pdf bib abs
Enhancing Localization Workflows with GenAI-Based Solutions: A Deep Dive into Automated Post-Editing and Translation Error Detection
Maciej Modrzejewski

The advent of Large Language Models (LLMs) has significantly transformed the localization sector. This presentation examines the integration of Generative AI (GenAI) solutions into translation and localization workflows, focusing on Automated Post-Editing (APE) and Automated Translation Error Detection. Using language pairs English-German and English-Japanese, APE consistently enhances translation quality by an average of 2-5 BLEU and 0.1-0.25 COMET compared to strong generic baselines. For specialized domains, APE reduces post-editing time by 40% for the worst-performing outputs from encoder-decoder-based MT systems. Combining APE with our in-house reference-free Quality Estimation (QE) model yields additional improvement. Through detailed methodologies, human evaluation results, and industrial applications, we demonstrate the transformative potential of these technologies in enhancing accuracy, reducing costs, and optimizing localization processes.

pdf bib abs
CantonMT: Cantonese-English Neural Machine Translation Looking into Evaluations
Kung Yin Hong | Lifeng Han | Riza Batista-Navarro | Goran Nenadic

Cantonese-English is a low-resource language pair for machine translation (MT) studies, despite the vast amount of English content publicly available online and the large amount of native Cantonese speakers. Based on our previous work on CANTONMT from Hong et al. (2024), where we created the open-source fine-tuned systems for Cantonese-English Neural MT (NMT) using base-models NLLB, OpusMT, and mBART and corpus collections and creation, in this paper, we report our extended experiments on model training and comparisons. In particular, we incorporated human-based evaluations using native Cantonese speakers who are also fluent in the English language. We designed a modified version of the HOPE metric from Gladkoff and Han (2022) for the categorised error analysis and serenity-level statistics (naming HOPES). The models selected for human evaluations are NLLB-mBART fine-tuned and two translators from commercial companies: Bing and GPT4.

pdf bib abs
Leveraging AI Technologies for Enhanced Multimedia Localization
Ashley Mondello | Sahil Rasane | Alina Karakanta | Laura Casanellas

As demand for multilingual video content rises, multimedia localization is becoming crucial for Language Service Providers (LSPs), offering revenue growth and new business opportunities. To cope with labor-intensive multimedia workflows and the rise in client demand for cheaper and faster multimedia localization services, LSPs are starting to leverage advanced AI applications to streamline the localization process. However, workflows and tools adopted by media service providers may not be suitable for LSPs, while the plethora of available solutions makes it hard for LSPs to choose the ones that most effectively optimize their workflows. In this presentation, we assess AI technologies that offer efficiency and cost reduction in the traditionally human-driven workflows of transcription, translation, voice-over (VO), and subtitling with the goal to provide recommendations for LSPs on how to evaluate which tools work best for their processes.

pdf bib abs
Open-source LLMs vs. NMT Systems: Translating Spatial Language in EN-PT-br Subtitles
Rafael Fernandes | Marcos Lopes

This research investigates the challenges of translating spatial language using open-source LLMs versus traditional NMTs. Focusing on spatial prepositions like ACROSS, INTO, ONTO, and THROUGH, which are particularly challenging for the EN-PT-br pair, the study evaluates translations using BLEU, METEOR, BERTScore, COMET, and TER metrics, along with manual error analysis. The findings reveal that moderate-sized LLMs, such as LLaMa-3-8B and Mixtral-8x7B, achieve accuracy comparable to NMTs like DeepL. However, LLMs frequently exhibit mistranslation errors, including interlanguage/code-switching and anglicisms, while NMTs demonstrate better fluency. Both LLMs and NMTs struggle with spatial-related errors, including syntactic projections and polysemy. The study concludes that significant hurdles remain in accurately translating spatial language, suggesting that future research should focus on enhancing training datasets, refining models, and developing more sophisticated evaluation metrics.

pdf bib abs
Comparative Evaluation of Large Language Models for Linguistic Quality Assessment in Machine Translation
Daria Sinitsyna | Konstantin Savenkov

Building on our GPT-4 LQA research in MT, this study identifies top LLMs for an LQA pipeline with up to three models. LLMs like GPT-4, GPT-4o, GPT-4 Turbo, Google Vertex, Anthropic’s Claude 3, and Llama-3 are prompted using MQM error typology. These models generate segment-wise outputs describing translation errors, scored by severity and DQF-MQM penalties. The study evaluates four language pairs: English-Spanish, English-Chinese, English-German, and English-Portuguese, using datasets from our 2024 State of MT Report across eight domains. LLM outputs are correlated with human judgments, ranking models by alignment with human assessments for penalty score, issue presence, type, and severity. This research proposes an LQA pipeline with up to three models, weighted by output quality, highlighting LLMs’ potential to enhance MT review processes and improve translation quality.

pdf bib abs
Evaluating End-to-End Speech-to-Speech Translation for Dubbing: Challenges and New Metrics
Fred Bane

The advent of end-to-end speech-to-speech translation (S2ST) systems in recent years marks a significant advancement over traditional cascaded approaches. These novel systems represent a direct translation pathway from spoken input to spoken output without relying on intermediate text forms. However, evaluation methods for this task, such as ASR BLEU, are often still compartmentalized and text-based. We suggest the quality of the resulting speech must be measured too. Naturalness, similarity of the target voice to the original, reflection of accents, and rhythm are all important. We argue that new evaluation metrics are needed in response to this watershed change. Our presentation approaches this topic through the lens of dubbing, with a particular focus on voice over. We begin with a critical examination of existing metrics. Then we discuss key features of S2ST that are inadequately captured. Finally, we propose new directions for evaluation of S2ST systems.

pdf bib abs
Enhancing Consistency Through Prompt-Tuning for Style Guide Adaptation
Ming Qian | Zidian Guo

This presentation explores the use of Prompt-Tuning (PT) to improve brand and language consistency in localization by teaching Large Language Models (LLMs) to develop and apply style guides from minimal examples. PT allows for the automatic enforcement of style guides for specific projects, potentially enhancing translation quality across varied tasks. Our approach involves defining key style guide components such as domain, audience, and formatting standards for acronyms, dates, and measurements, and creating prompts that instruct LLMs to extract and apply these standards in new translation tasks. We conducted extensive tests to evaluate the effectiveness of PT, documenting the process to ensure replicability. The expected results include improved consistency and translation performance, advancing the use of AI in localization and setting a foundation for future innovation in the field.

Machine translation (MT) with Large Language Models (LLMs) holds promise as a clinical translation tool with more capabilities than a traditional MT model. This work compares the quality of English to Spanish translation by three LLMs: ChatGPT3.5 Turbo, ChatGPT4o, and Aguila, against Google Translate. The test set used in this study is MedlinePlus, a parallel dataset of educational health information in English and Spanish developed by the National Library of Medicine. ChatGPT4o and Google Translate performed similarly in both automated scoring (BLEU, METEOR, and BERTscore) and human evaluation with ChatGPT3.5 Turbo not far behind. Aguila, the only LLM intended for primarily Spanish and Catalan use, surprisingly performed much worse than the other models. However, qualitative analysis of Aguila’s results revealed the use of Spanish word choice that may reach a broader audience.

pdf bib abs
From “Comment allez-vous?” to “Comment ça va?”: Leveraging Large Language Models to Automate Formality Adaptation in Translation
Vera Senderowicz

The evolution of machine translation (MT) has seen significant advancements in data cleaning and post-editing methodologies, but numerous cases requiring semantic comprehension have still necessitated human intervention—until the emergence of Large Language Models (LLMs). In our research, we have explored an innovative application of Generative AI (Gen AI) to adapt bilingual content’s target segments from a formal to an informal register, in scenarios where the source language lacks explicit grammatical markers for formality and thus is grammatically bivalent in that sense. In this session, we will demonstrate how LLMs, enhanced by supplementary methodologies such as fine-tuning and combined with other, more legacy language models, can efficiently perform this formality adaptation task. We aim to showcase best practices for leveraging Gen AI in adapting bilingual content registers, highlighting the potential for cost reduction and quality enhancement in translation processes.

pdf bib abs
Academia & Business: How Quality Assurance can Merge Two Rivals
Patry Muñoz Andrés

As a general rule, in many industries, but especially in ours, the world of academia tends to go its own route, in many instances, separating itself from the business environment where the actual outcome of this research could be applied. This talk portraits the journey of Quality Assurance in Translation from the first logical step within an LSP, which involves ISO certifications and automated QA, to more sophisticated tools such as Machine Translation (MT), and Large Language Models (LLMs). This is a combined journey in which business and academia are merged in order to achieve a common goal: quality. As opposed to simply compiling research, this session means to show how such research can be used by LSPs in order to achieve the highest possible quality in their translation services.

pdf bib abs
Language Technology for All: Industry Initiatives to Serve Low Resource Languages
Blaise Hylak

In an increasingly globalized world, language localization tools have become indispensable. However, there is a glaring disparity in the distribution of these resources. While English and other dominant languages benefit from advanced machine translation (MT) technologies and Large Language Models (LLM), many languages remain marginalized. Luckily, there are some initiatives underway to address this concern. This research aims to explore the development of language technology tools for low resource languages. The study evaluates organizations’ efforts to develop language resource data/tools for low resource languages with regards to machine translation (MT), speech-to-speech translation (S2ST), and what the outlook may be for the future.

pdf bib abs
Impact of Syntactic Complexity on the Processes and Performance of Large Language Models-leveraged Post-editing
Longhui Zou | Michael Carl | Shaghayegh Momtaz | Mehdi Mirzapour

This research explores the interaction between human translators and Large Language Models (LLMs) during post-editing (PE). The study examines the impact of syntactic complexity on the PE processes and performance, specifically when working with the raw translation output generated by GPT-4. We selected four English source texts (STs) from previous American Translators Association (ATA) certification examinations. Each text is about 10 segments, with 250 words. GPT-4 was employed to translate the four STs from English into simplified Chinese. The empirical experiment simulated the authentic work environment of PE, using professional computer-assisted translation (CAT) tool, Trados. The experiment involved 46 participants with different levels of translation expertise (30 student translators and 16 expert translators), producing altogether 2162 segments of PE versions. We implemented five syntactic complexity metrics in the context of PE for quantitative analysis.

pdf bib abs
Labels on Translation Output: a triple win
Alan Melby

In the 2023 edition of the ASTM International translation standard (F2575) the labels BRT and UMT have been standardized. The Label BRT stands for ‘Bilingually Reviewed Translation, by a qualified language professional’. The Label UMT is for everything else, from raw machine translation to MT where only the target text is checked, to human translation that does not involve a qualified professional. Thus, UMT could be expanded as ‘Unreviewed or Missing-qualifications Translation’. This presentation will argue that the use of the labels BRT and UMT is a triple win: The ‘consumers’ (end users) of a translation win because they have useful information for risk analysis (harm from errors). MT developers win because they have useful metadata when selecting training material. And professional translators win by increasing their visibility to the public. The presentation will give a history of these two labels and enlist the help of the entire AMTA community in promoting their use.