Miquel Esplà-Gomis

Also published as: Miquel Esplà

2025

pdf bib abs

Quality Beyond A Glance: Revealing Large Quality Differences Between Web-Crawled Parallel Corpora
Rik van Noord | Miquel Esplà-Gomis | Malina Chichirau | Gema Ramírez-Sánchez | Antonio Toral
Proceedings of the 31st International Conference on Computational Linguistics

Parallel corpora play a vital role in advanced multilingual natural language processing tasks, notably in machine translation (MT). The recent emergence of numerous large parallel corpora, often extracted from multilingual documents on the Internet, has expanded the available resources. Nevertheless, the quality of these corpora remains largely unexplored, while there are large differences in how the corpora are constructed. Moreover, how the potential differences affect the performance of neural MT (NMT) systems has also received limited attention. This study addresses this gap by manually and automatically evaluating four well-known publicly available parallel corpora across eleven language pairs. Our findings are quite concerning: all corpora contain a substantial amount of noisy sentence pairs, with CCMatrix and CCAligned having well below of 50% reasonably clean pairs. MaCoCu and ParaCrawl generally have higher quality texts, though around a third of the texts still have clear issues. While corpus size impacts NMT models’ performance, our study highlights the critical role of quality: higher-quality corpora consistently yield better-performing NMT models when controlling for size.

pdf bib abs

DeMINT: Automated Language Debriefing for English Learners via AI Chatbot Analysis of Meeting Transcripts
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of Machine Translation Summit XX: Volume 2

The objective of the DeMINT project is to develop a conversational tutoring system aimed at enhancing non-native English speakers’ language skills through post-meeting analysis of the transcriptions of video conferences in which they have participated. This paper describes the model developed and the results obtained through a human evaluation conducted with learners of English as a second language.

pdf bib abs

FLORES+ Mayas: Generating Textual Resources to Foster the Development of Language Technologies for Mayan Languages
Andrés Lou | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena
Proceedings of Machine Translation Summit XX: Volume 2

A significant percentage of the population of Guatemala and Mexico belongs to various Mayan indigenous communities, for whom language barriers lead to social, economic, and digital exclusion. The Mayan languages spoken by these communities remain severely underrepresented in terms of digital resources, which prevents them from leveraging the latest advances in artificial intelligence. This project addresses that problem by means of: 1) the digitisation and release of multiple printed linguistic resources; 2) the development of a high-quality parallel machine translation (MT) evaluation corpus for six Mayan languages. In doing so, we are paving the way for the development of MT systems that will facilitate the access for Mayan speakers to essential services such as healthcare or legal aid. The resources are produced with the essential participation of indigenous communities, whereby native speakers provide the necessary translation services, QA, and linguistic expertise. The project is funded by the Google Academic Research Awards and carried out in collaboration with the Proyecto Lingüístico Francisco Marroquín Foundation in Guatemala.

Miquel Esplà-Gomis

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2009

Co-authors

Venues