Miquel Esplà-Gomis

Also published as: Miquel Esplà


2021

pdf bib
Surprise Language Challenge: Developing a Neural Machine Translation System between Pashto and English in Two Months
Alexandra Birch | Barry Haddow | Antonio Valerio Miceli Barone | Jindrich Helcl | Jonas Waldendorf | Felipe Sánchez Martínez | Mikel Forcada | Víctor Sánchez Cartagena | Juan Antonio Pérez-Ortiz | Miquel Esplà-Gomis | Wilker Aziz | Lina Murady | Sevi Sariisik | Peggy van der Kreeft | Kay Macquarrie
Proceedings of Machine Translation Summit XVIII: Research Track

In the media industry and the focus of global reporting can shift overnight. There is a compelling need to be able to develop new machine translation systems in a short period of time and in order to more efficiently cover quickly developing stories. As part of the EU project GoURMET and which focusses on low-resource machine translation and our media partners selected a surprise language for which a machine translation system had to be built and evaluated in two months(February and March 2021). The language selected was Pashto and an Indo-Iranian language spoken in Afghanistan and Pakistan and India. In this period we completed the full pipeline of development of a neural machine translation system: data crawling and cleaning and aligning and creating test sets and developing and testing models and and delivering them to the user partners. In this paperwe describe rapid data creation and experiments with transfer learning and pretraining for this low-resource language pair. We find that starting from an existing large model pre-trained on 50languages leads to far better BLEU scores than pretraining on one high-resource language pair with a smaller model. We also present human evaluation of our systems and which indicates that the resulting systems perform better than a freely available commercial system when translating from English into Pashto direction and and similarly when translating from Pashto into English.

2020

pdf bib
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

pdf bib
An English-Swahili parallel corpus and its use for neural machine translation in the news domain
Felipe Sánchez-Martínez | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Mikel L. Forcada | Miquel Esplà-Gomis | Andrew Secker | Susie Coleman | Julie Wall
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

This paper describes our approach to create a neural machine translation system to translate between English and Swahili (both directions) in the news domain, as well as the process we followed to crawl the necessary parallel corpora from the Internet. We report the results of a pilot human evaluation performed by the news media organisations participating in the H2020 EU-funded project GoURMET.

pdf bib
Bicleaner at WMT 2020: Universitat d’Alacant-Prompsit’s submission to the parallel corpus filtering shared task
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Jaume Zaragoza-Bernabeu | Felipe Sánchez-Martínez
Proceedings of the Fifth Conference on Machine Translation

This paper describes the joint submission of Universitat d’Alacant and Prompsit Language Engineering to the WMT 2020 shared task on parallel corpus filtering. Our submission, based on the free/open-source tool Bicleaner, enhances it with Extremely Randomised Trees and lexical similarity features that account for the frequency of the words in the parallel sentences to determine if two sentences are parallel. To train this classifier we used the clean corpora provided for the task and synthetic noisy parallel sentences. In addition we re-score the output of Bicleaner using character-level language models and n-gram saturation.

2019

pdf bib
ParaCrawl: Web-scale parallel corpora for the languages of the EU
Miquel Esplà | Mikel Forcada | Gema Ramírez-Sánchez | Hieu Hoang
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib
Global Under-Resourced Media Translation (GoURMET)
Alexandra Birch | Barry Haddow | Ivan Tito | Antonio Valerio Miceli Barone | Rachel Bawden | Felipe Sánchez-Martínez | Mikel L. Forcada | Miquel Esplà-Gomis | Víctor Sánchez-Cartagena | Juan Antonio Pérez-Ortiz | Wilker Aziz | Andrew Secker | Peggy van der Kreeft
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

2018

pdf bib
UAlacant machine translation quality estimation at WMT 2018: a simple approach using phrase tables and feed-forward neural networks
Felipe Sánchez-Martínez | Miquel Esplà-Gomis | Mikel L. Forcada
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We describe the Universitat d’Alacant submissions to the word- and sentence-level machine translation (MT) quality estimation (QE) shared task at WMT 2018. Our approach to word-level MT QE builds on previous work to mark the words in the machine-translated sentence as OK or BAD, and is extended to determine if a word or sequence of words need to be inserted in the gap after each word. Our sentence-level submission simply uses the edit operations predicted by the word-level approach to approximate TER. The method presented ranked first in the sub-task of identifying insertions in gaps for three out of the six datasets, and second in the rest of them.

2016

pdf bib
Producing Monolingual and Parallel Web Corpora at the Same Time - SpiderLing and Bitextor’s Love Affair
Nikola Ljubešić | Miquel Esplà-Gomis | Antonio Toral | Sergio Ortiz Rojas | Filip Klubička
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain “.hr” and the Slovene top-level domain “.si”, and extrinsically on the English-Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English-Croatian, English-Finnish, English-Serbian and English-Slovene language pairs.

pdf bib
Bitextor’s participation in WMT’16: shared task on document alignment
Miquel Esplà-Gomis | Mikel Forcada | Sergio Ortiz-Rojas | Jorge Ferrández-Tordera
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
UAlacant word-level and phrase-level machine translation quality estimation systems at WMT 2016
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel Forcada
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Stand-off Annotation of Web Content as a Legally Safer Alternative to Crawling for Distribution
Mikel L. Forcada | Miquel Esplà-Gomis | Juan Antonio Pérez-Ortiz
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

2015

pdf bib
Abu-MaTran at WMT 2015 Translation Task: Morphological Segmentation and Web Crawling
Raphael Rubino | Tommi Pirinen | Miquel Esplà-Gomis | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis | Antonio Toral
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
UAlacant word-level machine translation quality estimation system at WMT 2015
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel Forcada
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
Using on-line available sources of bilingual information for word-level machine translation quality estimation
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A. Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel L. Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Using on-line available sources of bilingual information for word-level machine translation quality estimation
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Abu-MaTran: Automatic building of Machine Translation
Antonio Toral | Tommi A Pirinen | Andy Way | Gema Ramírez-Sánchez | Sergio Ortiz Rojas | Raphael Rubino | Miquel Esplà | Mikel Forcada | Vassilis Papavassiliou | Prokopis Prokopidis | Nikola Ljubešić
Proceedings of the 18th Annual Conference of the European Association for Machine Translation

pdf bib
Predicting Inflectional Paradigms and Lemmata of Unknown Words for Semi-automatic Expansion of Morphological Lexicons
Nikola Ljubešić | Miquel Esplà-Gomis | Filip Klubička | Nives Mikelić Preradović
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
Comparing two acquisition systems for automatically building an English—Croatian parallel corpus from multilingual websites
Miquel Esplà-Gomis | Filip Klubička | Nikola Ljubešić | Sergio Ortiz-Rojas | Vassilis Papavassiliou | Prokopis Prokopidis
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English―Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manually examined and the success rate was computed on the collection of pairs of documents detected by each setting. We compare the performance of the settings and the amount of different corpora detected by each setting. In addition, we describe the resource obtained, both by the settings and through the human evaluation, which has been released as a high-quality parallel corpus.

pdf bib
An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknknown words
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartegna | Felipe Sánchez-Martínez | Rafael C. Carrasco | Mikel L. Forcada | Juan Antonio Pérez-Ortiz
Proceedings of the 17th Annual conference of the European Association for Machine Translation

pdf bib
Extrinsic evaluation of web-crawlers in machine translation: a study on Croatian-English for the tourism domain
Antonio Toral | Raphael Rubino | Miquel Esplà-Gomis | Tommi Pirinen | Andy Way | Gema Ramírez-Sánchez
Proceedings of the 17th Annual conference of the European Association for Machine Translation

2013

pdf bib
Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks
José G.C. de Souza | Miquel Esplà-Gomis | Marco Turchi | Matteo Negri
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
UAlacant: Using Online Machine Translation for Cross-Lingual Textual Entailment
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel L. Forcada
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Source-Language Dictionaries Help Non-Expert Users to Enlarge Target-Language Dictionaries for Machine Translation
Víctor M. Sánchez-Cartagena | Miquel Esplà-Gomis | Juan Antonio Pérez-Ortiz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, a previous work on the enlargement of monolingual dictionaries of rule-based machine translation systems by non-expert users is extended to tackle the complete task of adding both source-language and target-language words to the monolingual dictionaries and the bilingual dictionary. In the original method, users validate whether some suffix variations of the word to be inserted are correct in order to find the most appropriate inflection paradigm. This method is now improved by taking advantage from the strong correlation detected between paradigms in both languages to reduce the search space of the target-language paradigm once the source-language paradigm is known. Results show that, when the source-language word has already been inserted, the system is able to more accurately predict which is the right target-language paradigm, and the number of queries posed to users is significantly reduced. Experiments also show that, when the source language and the target language are not closely related, it is only the source-language part-of-speech category, but not the rest of information provided by the source-language paradigm, which helps to correctly classify the target-language word.

2011

pdf bib
Using word alignments to assist computer-aided translation users by marking which target-side words to change or keep unedited
Miquel Esplà | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf bib
Multimodal Building of Monolingual Dictionaries for Machine Translation by Non-Expert Users
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Using machine translation in computer-aided translation to suggest the target-side words to change
Miquel Esplà-Gomis | Felipe Sánchez-Martínez | Mikel L. Forcada
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Enlarging Monolingual Dictionaries for Machine Translation with Active Learning and Non-Expert Users
Miquel Esplà-Gomis | Víctor M. Sánchez-Cartagena | Juan Antonio Pérez-Ortiz
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2009

pdf bib
Bitextor: a Free/Open-source Software to Harvest Translation Memories from Multilingual Websites
Miquel Esplà-Gomis
Beyond Translation Memories: New Tools for Translators Workshop