Juan Antonio Pérez-Ortiz - ACL Anthology

Juan Antonio Pérez-Ortiz

Also published as: Juan Antonio Pérez Ortiz, Juan Antonio Perez-Ortiz

2025

Beyond the Mode: Sequence-Level Distillation of Multilingual Translation Models for Low-Resource Language Pairs
Aarón Galiano-Jiménez | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena
Findings of the Association for Computational Linguistics: NAACL 2025

This paper delves into sequence-level knowledge distillation (KD) of multilingual pre-trained translation models. We posit that, beyond the approximated mode obtained via beam search, the whole output distribution of the teacher contains valuable insights for students. We explore the potential of n-best lists from beam search to guide student’s learning and then investigate alternative decoding methods to address observed issues like low variability and under-representation of infrequent tokens. Our research in data-limited scenarios reveals that although sampling methods can slightly compromise the translation quality of the teacher output compared to beam search based methods, they enrich the generated corpora with increased variability and lexical richness, ultimately enhancing student model performance and reducing the gender bias amplification commonly associated with KD.

ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
Lucile Favero | Daniel Frases | Juan Antonio Pérez-Ortiz | Tanja Käser | Nuria Oliver
Proceedings of the 12th Argument mining Workshop

The widespread adoption of chat interfaces based on Large Language Models (LLMs) raises concerns about promoting superficial learning and undermining the development of critical thinking skills. Instead of relying on LLMs purely for retrieving factual information, this work explores their potential to foster deeper reasoning by generating critical questions that challenge unsupported or vague claims in debate interventions. This study is part of a shared task of the 12th Workshop on Argument Mining, co-located with ACL 2025, focused on automatic critical question generation. We propose a two-step framework involving two small-scale open source language models: a Questioner that generates multiple candidate questions and a Judge that selects the most relevant ones. Our system ranked first in the shared task competition, demonstrating the potential of the proposed LLM-based approach to encourage critical engagement with argumentative texts.

FLORES+ Mayas: Generating Textual Resources to Foster the Development of Language Technologies for Mayan Languages
Andrés Lou | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena
Proceedings of Machine Translation Summit XX: Volume 2

A significant percentage of the population of Guatemala and Mexico belongs to various Mayan indigenous communities, for whom language barriers lead to social, economic, and digital exclusion. The Mayan languages spoken by these communities remain severely underrepresented in terms of digital resources, which prevents them from leveraging the latest advances in artificial intelligence. This project addresses that problem by means of: 1) the digitisation and release of multiple printed linguistic resources; 2) the development of a high-quality parallel machine translation (MT) evaluation corpus for six Mayan languages. In doing so, we are paving the way for the development of MT systems that will facilitate the access for Mayan speakers to essential services such as healthcare or legal aid. The resources are produced with the essential participation of indigenous communities, whereby native speakers provide the necessary translation services, QA, and linguistic expertise. The project is funded by the Google Academic Research Awards and carried out in collaboration with the Proyecto Lingüístico Francisco Marroquín Foundation in Guatemala.

DeMINT: Automated Language Debriefing for English Learners via AI Chatbot Analysis of Meeting Transcripts
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of Machine Translation Summit XX: Volume 2

The objective of the DeMINT project is to develop a conversational tutoring system aimed at enhancing non-native English speakers’ language skills through post-meeting analysis of the transcriptions of video conferences in which they have participated. This paper describes the model developed and the results obtained through a human evaluation conducted with learners of English as a second language.

2024

Lightweight neural translation technologies for low-resource languages
Felipe Sánchez-Martínez | Juan Antonio Pérez-Ortiz | Víctor Sánchez-Cartagena | Andrés Lou | Cristian García-Romero | Aarón Galiano-Jiménez | Miquel Esplà-Gomis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

The LiLowLa (“Lightweight neural translation technologies for low-resource languages”) project aims to enhance machine translation (MT) and translation memory (TM) technologies, particularly for low-resource language pairs, where adequate linguistic resources are scarce. The project started in September 2022 and will run till August 2025.

Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian
Juan Antonio Perez-Ortiz | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Miquel Esplà-Gomis | Aaron Galiano Jimenez | Antoni Oliver | Claudi Aventín-Boya | Alejandro Pardos | Cristina Valdés | Jusèp Loís Sans Socasau | Juan Pablo Martínez
Proceedings of the Ninth Conference on Machine Translation

In this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.

A Conversational Intelligent Tutoring System for Improving English Proficiency of Non-Native Speakers via Debriefing of Online Meeting Transcriptions
Juan Antonio Pérez-Ortiz | Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Felipe Sánchez-Martínez | Roman Chernysh | Gabriel Mora-Rodríguez | Lev Berezhnoy
Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning

Universitat d’Alacant’s Submission to the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain
Aaron Galiano Jimenez | Víctor M. Sánchez-Cartagena | Juan Antonio Perez-Ortiz | Felipe Sánchez-Martínez
Proceedings of the Ninth Conference on Machine Translation

This paper describes the submissions of the Transducens group of the Universitat d’Alacant to the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain; in particular, the task focuses on the translation from Spanish into Aragonese, Aranese and Asturian. Our submissions use parallel and monolingual data to fine-tune the NLLB-1.3B model and to investigate the effectiveness of synthetic corpora and transfer-learning between related languages such as Catalan, Galician and Valencian. We also present a many-to-many multilingual neural machine translation model focused on the Romance languages of Spain.

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars
Andrés Lou | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Víctor Sánchez-Cartagena
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv.

Findings of the WMT 2024 Shared Task Translation into Low-Resource Languages of Spain: Blending Rule-Based and Neural Systems
Felipe Sánchez-Martínez | Juan Antonio Perez-Ortiz | Aaron Galiano Jimenez | Antoni Oliver
Proceedings of the Ninth Conference on Machine Translation

This paper presents the results of the Ninth Conference on Machine Translation (WMT24) Shared Task “Translation into Low-Resource Languages of Spain”’. The task focused on the development of machine translation systems for three language pairs: Spanish-Aragonese, Spanish-Aranese, and Spanish-Asturian. 17 teams participated in the shared task with a total of 87 submissions. The baseline system for all language pairs was Apertium, a rule-based machine translation system that still performs competitively well, even in an era dominated by more advanced non-symbolic approaches. We report and discuss the results of the submitted systems, highlighting the strengths of both neural and rule-based approaches.

2023

Exploiting large pre-trained models for low-resource neural machine translation
Aarón Galiano-Jiménez | Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Pre-trained models have drastically changed the field of natural language processing by providing a way to leverage large-scale language representations to various tasks. Some pre-trained models offer general-purpose representations, while others are specialized in particular tasks, like neural machine translation (NMT). Multilingual NMT-targeted systems are often fine-tuned for specific language pairs, but there is a lack of evidence-based best-practice recommendations to guide this process. Moreover, the trend towards even larger pre-trained models has made it challenging to deploy them in the computationally restrictive environments typically found in developing regions where low-resource languages are usually spoken. We propose a pipeline to tune the mBART50 pre-trained model to 8 diverse low-resource language pairs, and then distil the resulting system to obtain lightweight and more sustainable models. Our pipeline conveniently exploits back-translation, synthetic corpus filtering, and knowledge distillation to deliver efficient, yet powerful bilingual translation models 13 times smaller than the original pre-trained ones, but with close performance in terms of BLEU.

2022

MultitraiNMT Erasmus+ project: Machine Translation Training for multilingual citizens (multitrainmt.eu)
Mikel L. Forcada | Pilar Sánchez-Gijón | Dorothy Kenny | Felipe Sánchez-Martínez | Juan Antonio Pérez Ortiz | Riccardo Superbo | Gema Ramírez Sánchez | Olga Torres-Hostench | Caroline Rossi
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

The MultitraiNMT Erasmus+ project has developed an open innovative syl-labus in machine translation, focusing on neural machine translation (NMT) and targeting both language learners and translators. The training materials include an open access coursebook with more than 250 activities and a pedagogical NMT interface called MutNMT that allows users to learn how neural machine translation works. These materials will allow students to develop the technical and ethical skills and competences required to become informed, critical users of machine translation in their own language learn-ing and translation practice. The pro-ject started in July 2019 and it will end in July 2022.

Cross-lingual neural fuzzy matching for exploiting target-language monolingual corpora in computer-aided translation
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Computer-aided translation (CAT) tools based on translation memories (MT) play a prominent role in the translation workflow of professional translators. However, the reduced availability of in-domain TMs, as compared to in-domain monolingual corpora, limits its adoption for a number of translation tasks. In this paper, we introduce a novel neural approach aimed at overcoming this limitation by exploiting not only TMs, but also in-domain target-language (TL) monolingual corpora, and still enabling a similar functionality to that offered by conventional TM-based CAT tools. Our approach relies on cross-lingual sentence embeddings to retrieve translation proposals from TL monolingual corpora, and on a neural model to estimate their post-editing effort. The paper presents an automatic evaluation of these techniques on four language pairs that shows that our approach can successfully exploit monolingual texts in a TM-based CAT environment, increasing the amount of useful translation proposals, and that our neural model for estimating the post-editing effort enables the combination of translation proposals obtained from monolingual corpora and from TMs in the usual way. A human evaluation performed on a single language pair confirms the results of the automatic evaluation and seems to indicate that the translation proposals retrieved with our approach are more useful than what the automatic evaluation shows.

2021

Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach
Víctor M. Sánchez-Cartagena | Miquel Esplà-Gomis | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In the context of neural machine translation, data augmentation (DA) techniques may be used for generating additional training samples when the available parallel data are scarce. Many DA approaches aim at expanding the support of the empirical data distribution by generating new sentence pairs that contain infrequent words, thus making it closer to the true data distribution of parallel sentences. In this paper, we propose to follow a completely different approach and present a multi-task DA approach in which we generate new sentence pairs with transformations, such as reversing the order of the target sentence, which produce unfluent target sentences. During training, these augmented sentences are used as auxiliary tasks in a multi-task framework with the aim of providing new contexts where the target prefix is not informative enough to predict the next word. This strengthens the encoder and forces the decoder to pay more attention to the source representations of the encoder. Experiments carried out on six low-resource translation tasks show consistent improvements over the baseline and over DA methods aiming at extending the support of the empirical data distribution. The systems trained with our approach rely more on the source tokens, are more robust against domain shift and suffer less hallucinations.

In the media industry and the focus of global reporting can shift overnight. There is a compelling need to be able to develop new machine translation systems in a short period of time and in order to more efficiently cover quickly developing stories. As part of the EU project GoURMET and which focusses on low-resource machine translation and our media partners selected a surprise language for which a machine translation system had to be built and evaluated in two months(February and March 2021). The language selected was Pashto and an Indo-Iranian language spoken in Afghanistan and Pakistan and India. In this period we completed the full pipeline of development of a neural machine translation system: data crawling and cleaning and aligning and creating test sets and developing and testing models and and delivering them to the user partners. In this paperwe describe rapid data creation and experiments with transfer learning and pretraining for this low-resource language pair. We find that starting from an existing large model pre-trained on 50languages leads to far better BLEU scores than pretraining on one high-resource language pair with a smaller model. We also present human evaluation of our systems and which indicates that the resulting systems perform better than a freely available commercial system when translating from English into Pashto direction and and similarly when translating from Pashto into English.

MultiTraiNMT: Training Materials to Approach Neural Machine Translation from Scratch
Gema Ramírez-Sánchez | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Caroline Rossi | Dorothy Kenny | Riccardo Superbo | Pilar Sánchez-Gijón | Olga Torres-Hostench
Proceedings of the Translation and Interpreting Technology Online Conference

The MultiTraiNMT Erasmus+ project aims at developing an open innovative syllabus in neural machine translation (NMT) for language learners and translators as multilingual citizens. Machine translation is seen as a resource that can support citizens in their attempt to acquire and develop language skills if they are trained in an informed and critical way. Machine translation could thus help tackle the mismatch between the desired EU aim of having multilingual citizens who speak at least two foreign languages and the current situation in which citizens generally fall far short of this objective. The training materials consists of an open-access coursebook, an open-source NMT web application called MutNMT for training purposes, and corresponding activities.

2020

An English-Swahili parallel corpus and its use for neural machine translation in the news domain
Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Mikel L. Forcada | Miquel Esplà-Gomis | Andrew Secker | Susie Coleman | Julie Wall
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper describes our approach to create a neural machine translation system to translate between English and Swahili (both directions) in the news domain, as well as the process we followed to crawl the necessary parallel corpora from the Internet. We report the results of a pilot human evaluation performed by the news media organisations participating in the H2020 EU-funded project GoURMET.

Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation
Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez
Proceedings of the 28th International Conference on Computational Linguistics

This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.

2019

The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019
Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the two submissions of Universitat d’Alacant to the English-to-Kazakh news translation task at WMT 2019. Our submissions take advantage of monolingual data and parallel data from other language pairs by means of iterative backtranslation, pivot backtranslation and transfer learning. They also use linguistic information in two ways: morphological segmentation of Kazakh text, and integration of the output of a rule-based machine translation system. Our systems were ranked second in terms of chrF++ despite being built from an ensemble of only 2 independent training runs.

Global Under-Resourced Media Translation (GoURMET)
Alexandra Birch | Barry Haddow | Ivan Tito | Antonio Valerio Miceli Barone | Rachel Bawden | Felipe Sánchez-Martínez | Mikel L. Forcada | Miquel Esplà-Gomis | Víctor Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Wilker Aziz | Andrew Secker | Peggy van der Kreeft
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

Proceedings of the 21st Annual Conference of the European Association for Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Maja Popović | Celia Rico | André Martins | Joachim Van den Bogaert | Mikel L. Forcada
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

2016

Ranking suggestions for black-box interactive translation prediction systems with multilayer perceptrons
Daniel Torregrosa | Juan Antonio Pérez-Ortiz | Mikel Forcada
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

The objective of interactive translation prediction (ITP), a paradigm of computer-aided translation, is to assist professional translators by offering context-based computer-generated suggestions as they type. While most state-of-the-art ITP systems are tightly coupled to a machine translation (MT) system (often created ad-hoc for this purpose), our proposal follows a resourceagnostic approach, one that does not need access to the inner workings of the bilingual resources (MT systems or any other bilingual resources) used to generate the suggestions, thus allowing to include new resources almost seamlessly. As we do not expect the user to tolerate more than a few proposals each time, the set of potential suggestions need to be filtered and ranked; the resource-agnostic approach has been evaluated before using a set of intuitive length-based and position-based heuristics designed to determine which suggestions to show, achieving promising results. In this paper, we propose a more principled suggestion ranking approach using a regressor (a multilayer perceptron) that achieves significantly better results.

Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution
Mikel L. Forcada | Miquel Esplà-Gomis | Juan Antonio Pérez-Ortiz
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

2015

Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Francis M. Tyers | Mikel L. Forcada | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

Evaluating machine translation for assimilation via a gap-filling task
Ekaterina Ageeva | Mikel L. Forcada | Francis M. Tyers | Juan Antonio Pérez-Ortiz
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

2014

The UA-Prompsit hybrid machine translation system for the 2014 Workshop on Statistical Machine Translation
Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez
Proceedings of the Ninth Workshop on Statistical Machine Translation

An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknknown words
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartegna | Felipe Sánchez-Martínez | Rafael C. Carrasco | Mikel L. Forcada | Juan Antonio Pérez-Ortiz
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

Black-box integration of heterogeneous bilingual resources into an interactive translation system
Juan Antonio Pérez-Ortiz | Daniel Torregrosa | Mikel Forcada
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

2012

Source-Language Dictionaries Help Non-Expert Users to Enlarge Target-Language Dictionaries for Machine Translation
Víctor M. Sánchez-Cartagena | Miquel Esplà-Gomis | Juan Antonio Pérez-Ortiz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, a previous work on the enlargement of monolingual dictionaries of rule-based machine translation systems by non-expert users is extended to tackle the complete task of adding both source-language and target-language words to the monolingual dictionaries and the bilingual dictionary. In the original method, users validate whether some suffix variations of the word to be inserted are correct in order to find the most appropriate inflection paradigm. This method is now improved by taking advantage from the strong correlation detected between paradigms in both languages to reduce the search space of the target-language paradigm once the source-language paradigm is known. Results show that, when the source-language word has already been inserted, the system is able to more accurately predict which is the right target-language paradigm, and the number of queries posed to users is significantly reduced. Experiments also show that, when the source language and the target language are not closely related, it is only the source-language part-of-speech category, but not the rest of information provided by the source-language paradigm, which helps to correctly classify the target-language word.

2011

Integrating shallow-transfer rules into phrase-based statistical machine translation
Víctor M. Sánchez-Cartagena | Felipe Sánchez-Martínez | Juan Antonio Pérez-Ortiz
Proceedings of Machine Translation Summit XIII: Papers

Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation
Felipe Sánchez-Martinez | Juan Antonio Pérez-Ortiz
Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation

Multimodal Building of Monolingual Dictionaries for Machine Translation by Non-Expert Users
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of Machine Translation Summit XIII: Papers

Enriching a statistical machine translation system trained on small parallel corpora with rule-based bilingual phrases
Víctor M. Sánchez-Cartagena | Felipe Sánchez-Martínez | Juan Antonio Pérez-Ortiz
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

The Universitat d’Alacant hybrid machine translation system for WMT 2011
Víctor M. Sánchez-Cartagena | Felipe Sánchez-Martínez | Juan Antonio Pérez-Ortiz
Proceedings of the Sixth Workshop on Statistical Machine Translation

Enlarging Monolingual Dictionaries for Machine Translation with Active Learning and Non-Expert Users
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2009

An open-source highly scalable web service architecture for the Apertium machine translation engine
Victor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

Some machine translation services like Google Ajax Language API have become very popular as they make the collaboratively created contents of the web 2.0 available to speakers of many languages. One of the keys of its success is its clear and easy-to-use application programming interface (API) and a scalable and reliable service. This paper describes a highly scalable implementation of an Apertium-based translation web service, that aims to make contents available to speakers of lesser resourced languages. The API of this service is compatible with Google’s one, and the scalability of the system is achieved by a new architecture that allows adding or removing new servers at any time; for that, an application placement algorithm which decides which language pairs should be translated on which servers is designed. Our experiments show how the resulting architecture improves the translation rate in comparison to existing Apertium-based servers.

Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation
Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martinez | Francis M. Tyers
Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation

2005

An open-source shallow-transfer machine translation engine for the Romance languages of Spain
Antonio M. Corbi-Bellot | Mikel L. Forcada | Sergio Ortíz-Rojas | Juan Antonio Pérez-Ortiz | Gema Ramírez-Sánchez | Felipe Sánchez-Martínez | Iñaki Alegria | Aingeru Mayor | Kepa Sarasola
Proceedings of the 10th EAMT Conference: Practical applications of machine translation

An Open-Source Shallow-Transfer Machine Translation Toolbox: Consequences of Its Release and Availability
Carme Armentano-Oller | Antonio M. Corbí-Bellot | Mikel L. Forcada | Mireia Ginestí-Rosell | Boyan Bonev | Sergio Ortiz-Rojas | Juan Antonio Pérez-Ortiz | Gema Ramírez-Sánchez | Felipe Sánchez-Martínez
Workshop on open-source machine translation

By the time Machine Translation Summit X is held in September 2005, our group will have released an open-source machine translation toolbox as part of a large government-funded project involving four universities and three linguistic technology companies from Spain. The machine translation toolbox, which will most likely be released under a GPL-like license includes (a) the open-source engine itself, a modular shallow-transfer machine translation engine suitable for related languages and largely based upon that of systems we have already developed, such as interNOSTRUM for Spanish—Catalan and Traductor Universia for Spanish—Portuguese, (b) extensive documentation (including document type declarations) specifying the XML format of all linguistic (dictionaries, rules) and document format management files, (c) compilers converting these data into the high-speed (tens of thousands of words a second) format used by the engine, and (d) pilot linguistic data for Spanish—Catalan and Spanish—Galician and format management specifications for the HTML, RTF and plain text formats. After describing very briefly this toolbox, this paper aims at exploring possible consequences of the availability of this architecture, including the community-driven development of machine translation systems for languages lacking this kind of linguistic technology.

2004

Cooperative unsupervised training of the part-of-speech taggers in a bidirectional machine translation system
Felipe Sánchez-Martínez | Juan Antonio Pérez-Ortiz | Mikel L. Forcada
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

2001

Discovering machine translation strategies beyond word-for-word translation: a laboratory assignment
Juan Antonio Pérez-Ortiz | Mikel L. Forcada
Workshop on Teaching Machine Translation

It is a common mispreconception to say that machine translation programs translate word-for-word, but real systems follow strategies which are much more complex. This paper proposes a laboratory assignment to study the way in which some commercial machine translation programs translate whole sentences and how the translation differs from a word-for-word translation. Students are expected to infer some of these extra strategies by observing the outcome of real systems when translating a set of sentences designed on purpose. The assignment also makes students aware of the difficulty of constructing such programs while bringing some technological light into the apparent “magic” of machine translation.

Co-authors

Gema Ramírez-Sánchez 4

Víctor Sánchez-Cartagena 4

Francis Tyers 3

Ekaterina Ageeva 2

Alexandra Birch 2

Antonio M. Corbí-Bellot 2

Dorothy Kenny 2

Antonio Valerio Miceli-Barone 2

Antoni Oliver 2

Sergio Ortiz Rojas 2

Caroline Rossi 2

Andrew Secker 2

Riccardo Superbo 2

Pilar Sánchez-Gijón 2

Daniel Torregrosa 2

Olga Torres-Hostench 2

Peggy van der Kreeft 2

Iñaki Alegría 1

Carme Armentano-Oller 1

Claudi Aventín-Boya 1

Rachel Bawden 1

Lev Berezhnoy 1

Rafael C. Carrasco 1

Roman Chernysh 1

Susie Coleman 1

Lucile Favero 1

Daniel Frases 1

Cristian García-Romero 1

Mireia Ginestí-Rosell 1

Jindřich Helcl 1

Kay Macquarrie 1

André F. T. Martins 1

Juan Pablo Martínez 1

Aingeru Mayor 1

Gabriel Mora-Rodríguez 1

Nuria M Oliver 1

Alejandro Pardos 1

Maja Popović 1

Jusèp Loís Sans Socasau 1

Kepa Sarasola 1

Sevi Sariisik 1

Víctor M. Sánchez-Cartegna 1

Cristina Valdés 1

Joachim Van Den Bogaert 1

Jonas Waldendorf 1

Venues