Maja Popović

Also published as: Maja Popovic

2025

GENDER1PERSON: Test Suite for Estimating Gender Bias of First-person Singular Forms
Maja Popović | Ekaterina Lapshinova-Koltunski
Proceedings of the Tenth Conference on Machine Translation

The gender1person test suite is designed to measure gender bias in translating singular first-person forms from English into two Slavic languages, Russian and Serbian. The test suite consists of 1,000 Amazon product reviews, uniformly distributed over 10 different product categories. Bias is measured through a gender score ranging from -100 (all reviews are feminine) to 100 (all reviews are masculine). The test suite shows that the majority of the systems participating in the WMT-2025 task for these two target languages prefer the masculine writer’s gender. There is no single system which is biased towards the feminine variant. Furthermore, for each language pair, there are seven systems that are considered balanced, having the gender scores between -10 and 10.Finally, the analysis of different products showed that the choice of the writer’s gender depends to a large extent on the product. Moreover, it is demonstrated that even the systems with overall balanced scores are actually biased, but in different ways for different product categories.

This paper presents the results of the General Machine Translation Task organized as part of the 2025 Conference on Machine Translation (WMT). Participants were invited to build systems for any of 30 language pairs. For half of these pairs, we conducted a human evaluation on test sets spanning four to five different domains.We evaluated 60 systems in total: 36 submitted by participants and 24 for which we collected translations from large language models (LLMs) and popular online translation providers.This year, we focused on creating challenging test sets by developing a difficulty sampling technique and using more complex source data. We evaluated system outputs with professional annotators using the Error Span Annotation (ESA) protocol, except for two language pairs, for which we used Multidimensional Quality Metrics (MQM) instead.We continued the trend of increasingly moving towards document-level translation, providing the source texts as whole documents containing multiple paragraphs.

pdf bib abs

Did I (she) or I (he) buy this? Or rather I (she/he)? Towards first-person gender neutral translation by LLMs
Maja Popović | Ekaterina Lapshinova-Koltunski | Anastasiia Göldner
Proceedings of the 3rd Workshop on Gender-Inclusive Translation Technologies (GITT 2025)

This paper presents an analysis of gender in first-person mentions translated from English into two Slavic languages with the help of three LLMs and two different prompts. We explore if LLMs are able to generate Amazon product reviews with gender neutral first person forms. Apart from the overall question about the ability to produce gender neutral translations, we look into the impact of a prompt with a specific instruction which is supposed to reduce the gender bias in LLMs output translations. Our results show that although we are able to achieve a reduction in gender bias, our specific prompt cause also a number of errors. Analysing those emerging problems qualitatively, we formulate suggestions that could be helpful for the development of better prompting strategies in the future work on gender bias reduction.

Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and real-world usage, particularly for non-expert users who may struggle to assess translation reliability.This paper advocates for a human-centered approach to MT, emphasizing the alignment of system design with diverse communicative goals and contexts of use. We survey the literature in Translation Studies and Human-Computer Interaction to recontextualize MT evaluation and design to address the diverse real-world scenarios in which MT is used today.

2024

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).

pdf bib abs

Effects of different types of noise in user-generated reviews on human and machine translations including ChatGPT
Maja Popovic | Ekaterina Lapshinova-Koltunski | Maarit Koponen
Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)

This paper investigates effects of noisy source texts (containing spelling and grammar errors, informal words or expressions, etc.) on human and machine translations, namely whether the noisy phenomena are kept in the translations, corrected, or caused errors. The analysed data consists of English user reviews of Amazon products translated into Croatian, Russian and Finnish by professional translators, translation students, machine translation (MT) systems, and ChatGPT language model. The results show that overall, ChatGPT and professional translators mostly correct/standardise those parts, while students are often keeping them. Furthermore, MT systems are most prone to errors while ChatGPT is more robust, but notably less robust than human translators. Finally, some of the phenomena are particularly challenging both for MT systems and for ChatGPT, especially spelling errors and informal constructions.

pdf bib abs

Gender and bias in Amazon review translations: by humans, MT systems and ChatGPT
Maja Popovic | Ekaterina Lapshinova-Koltunski
Proceedings of the 2nd International Workshop on Gender-Inclusive Translation Technologies

This paper presents an analysis of first-person gender in five different translation variants of Amazon product reviews:those produced by professional translators, by translation students, with different machine translation (MT) systems andwith ChatGPT. The analysis revealed that the majority of the reviews were translated into the masculine first-person gender, both by humans as well as by machines. Further inspection revealed that the choice of the gender in a translation is not related to the actual gender of the translator. Finally, the analysis of different products showed that there are certain bias tendencies, because the distribution of genders notably differ for different products.

pdf bib abs

High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.

2023

pdf bib abs

Medical Concept Mention Identification in Social Media Posts Using a Small Number of Sample References
Vasudevan Nedumpozhimana | Sneha Rautmare | Meegan Gower | Nishtha Jain | Maja Popović | Patricia Buffini | John Kelleher
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Identification of mentions of medical concepts in social media text can provide useful information for caseload prediction of diseases like Covid-19 and Measles. We propose a simple model for the automatic identification of the medical concept mentions in the social media text. We validate the effectiveness of the proposed model on Twitter, Reddit, and News/Media datasets.

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

pdf bib abs

Exploring Variation of Results from Different Experimental Conditions
Maja Popović | Mohammad Arvan | Natalie Parde | Anya Belz
Findings of the Association for Computational Linguistics: ACL 2023

It might reasonably be expected that running multiple experiments for the same task using the same data and model would yield very similar results. Recent research has, however, shown this not to be the case for many NLP experiments. In this paper, we report extensive coordinated work by two NLP groups to run the training and testing pipeline for three neural text simplification models under varying experimental conditions, including different random seeds, run-time environments, and dependency versions, yielding a large number of results for each of the three models using the same data and train/dev/test set splits. From one perspective, these results can be interpreted as shedding light on the reproducibility of evaluation results for the three NTS models, and we present an in-depth analysis of the variation observed for different combinations of experimental conditions. From another perspective, the results raise the question of whether the averaged score should be considered the ‘true’ result for each model.

pdf bib abs

Using MT for multilingual covid-19 case load prediction from social media texts
Maja Popovic | Vasudevan Nedumpozhimana | Meegan Gower | Sneha Rautmare | Nishtha Jain | John Kelleher
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

In the context of an epidemiological study involving multilingual social media, this paper reports on the ability of machine translation systems to preserve content relevant for a document classification task designed to determine whether the social media text is related to covid. The results indicate that machine translation does provide a feasible basis for scaling epidemiological social media surveillance to multiple languages. Moreover, a qualitative error analysis revealed that the majority of classification errors are not caused by MT errors.

pdf bib abs

Computational analysis of different translations: by professionals, students and machines
Maja Popovic | Ekaterina Lapshinova-Koltunski | Maarit Koponen
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

In this work, we analyse different translated texts in terms of various text features. We compare two types of human translations, professional and students’, and machine translation outputs in terms of lexical and grammatical variety, sentence length,as well as frequencies of different POS tags and POS-trigrams. Our experimentsare carried out on parallel translations into three languages, Croatian, Finnish andRussian, all originating from the same source English texts. Our results indicatethat machine translations are closest to the source text, followed by student translations. Also, student translations are similar both to professional as well as to MT, sometimes even more to MT. Furthermore, we identify sets of features which are convenient for distinguishing machine from human translations.

Maja Popović

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2009

2007

2006

2005

2004

Co-authors

Venues