Maite Melero


2022

pdf bib
Unsupervised Machine Translation in Real-World Scenarios
Ona de Gibert Bonet | Iakes Goenaga | Jordi Armengol-Estapé | Olatz Perez-de-Viñaspre | Carla Parra Escartín | Marina Sanchez | Mārcis Pinnis | Gorka Labaka | Maite Melero
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this work, we present the work that has been carried on in the MT4All CEF project and the resources that it has generated by leveraging recent research carried out in the field of unsupervised learning. In the course of the project 18 monolingual corpora for specific domains and languages have been collected, and 12 bilingual dictionaries and translation models have been generated. As part of the research, the unsupervised MT methodology based only on monolingual corpora (Artetxe et al., 2017) has been tested on a variety of languages and domains. Results show that in specialised domains, when there is enough monolingual in-domain data, unsupervised results are comparable to those of general domain supervised translation, and that, at any rate, unsupervised techniques can be used to boost results whenever very little data is available.

pdf bib
On the Multilingual Capabilities of Very Large-Scale English Language Models
Jordi Armengol-Estapé | Ona de Gibert Bonet | Maite Melero
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning. These models, solely trained on the language modeling objective, have been shown to exhibit outstanding zero, one, and few-shot learning capabilities in a number of different tasks. Nevertheless, aside from anecdotal experiences, little is known regarding their multilingual capabilities, given the fact that the pre-training corpus is almost entirely composed of English text. In this work, we investigate its potential and limits in three tasks: extractive question-answering, text summarization and natural language generation for five different languages, as well as the effect of scale in terms of model size. Our results show that GPT-3 can be almost as useful for many languages as it is for English, with room for improvement if optimization of the tokenization is addressed.

pdf bib
Spanish Datasets for Sensitive Entity Detection in the Legal Domain
Ona de Gibert Bonet | Aitor García Pablos | Montse Cuadros | Maite Melero
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The de-identification of sensible data, also known as automatic textual anonymisation, is essential for data sharing and reuse, both for research and commercial purposes. The first step for data anonymisation is the detection of sensible entities. In this work, we present four new datasets for named entity detection in Spanish in the legal domain. These datasets have been generated in the framework of the MAPA project, three smaller datasets have been manually annotated and one large dataset has been automatically annotated, with an estimated error rate of around 14%. In order to assess the quality of the generated datasets, we have used them to fine-tune a battery of entity-detection models, using as foundation different pre-trained language models: one multilingual, two general-domain monolingual and one in-domain monolingual. We compare the results obtained, which validate the datasets as a valuable resource to fine-tune models for the task of named entity detection. We further explore the proposed methodology by applying it to a real use case scenario.

pdf bib
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages
Maite Melero | Sakriani Sakti | Claudia Soria
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

pdf bib
Quality versus Quantity: Building Catalan-English MT Resources
Ona de Gibert Bonet | Ksenia Kharitonova | Blanca Calvo Figueras | Jordi Armengol-Estapé | Maite Melero
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

In this work, we make the case of quality over quantity when training a MT system for a medium-to-low-resource language pair, namely Catalan-English. We compile our training corpus out of existing resources of varying quality and a new high-quality corpus. We also provide new evaluation translation datasets in three different domains. In the process of building Catalan-English parallel resources, we evaluate the impact of drastically filtering alignments in the resulting MT engines. Our results show that even when resources are limited, as in this case, it is worth filtering for quality. We further explore the cross-lingual transfer learning capabilities of the proposed model for parallel corpus filtering by applying it to other languages. All resources generated in this work are released under open license to encourage the development of language technology in Catalan.

2021

pdf bib
Transfer Learning with Shallow Decoders: BSC at WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task
Ksenia Kharitonova | Ona de Gibert Bonet | Jordi Armengol-Estapé | Mar Rodriguez i Alvarez | Maite Melero
Proceedings of the Sixth Conference on Machine Translation

This paper describes the participation of the BSC team in the WMT2021’s Multilingual Low-Resource Translation for Indo-European Languages Shared Task. The system aims to solve the Subtask 2: Wikipedia cultural heritage articles, which involves translation in four Romance languages: Catalan, Italian, Occitan and Romanian. The submitted system is a multilingual semi-supervised machine translation model. It is based on a pre-trained language model, namely XLM-RoBERTa, that is later fine-tuned with parallel data obtained mostly from OPUS. Unlike other works, we only use XLM to initialize the encoder and randomly initialize a shallow decoder. The reported results are robust and perform well for all tested languages.

pdf bib
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé | Casimiro Pio Carrino | Carlos Rodriguez-Penagos | Ona de Gibert Bonet | Carme Armentano-Oller | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

bib
Neural Translation for European Union (NTEU)
Mercedes García-Martínez | Laurent Bié | Aleix Cerdà | Amando Estela | Manuel Herranz | Rihards Krišlauks | Maite Melero | Tony O’Dowd | Sinead O’Gorman | Marcis Pinnis | Artūrs Stafanovič | Riccardo Superbo | Artūrs Vasiļevskis
Proceedings of Machine Translation Summit XVIII: Users and Providers Track

The Neural Translation for the European Union (NTEU) engine farm enables direct machine translation for all 24 official languages of the European Union without the necessity to use a high-resourced language as a pivot. This amounts to a total of 552 translation engines for all combinations of the 24 languages. We have collected parallel data for all the language combinations publickly shared in elrc-share.eu. The translation engines have been customized to domain,for the use of the European public administrations. The delivered engines will be published in the European Language Grid. In addition to the usual automatic metrics, all the engines have been evaluated by humans based on the direct assessment methodology. For this purpose, we built an open-source platform called MTET The evaluation shows that most of the engines reach high quality and get better scores compared to an external machine translation service in a blind evaluation setup.

2020

pdf bib
ELRI: A Decentralised Network of National Relay Stations to Collect, Prepare and Share Language Resources
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | José Luis Fonseca | Patricia Fonseca | Paulo Vale | Jane Dunne | Federico Gaspari | Teresa Lynn | Helen McHugh | Andy Way | Victoria Arranz | Khalid Choukri | Hervé Pusset | Alexandre Sicard | Rui Neto | Maite Melero | David Perez | António Branco | Ruben Branco | Luís Gomes
Proceedings of the 1st International Workshop on Language Technology Platforms

We describe the European Language Resource Infrastructure (ELRI), a decentralised network to help collect, prepare and share language resources. The infrastructure was developed within a project co-funded by the Connecting Europe Facility Programme of the European Union, and has been deployed in the four Member States participating in the project, namely France, Ireland, Portugal and Spain. ELRI provides sustainable and flexible means to collect and share language resources via National Relay Stations, to which members of public institutions can freely subscribe. The infrastructure includes fully automated data processing engines to facilitate the preparation, sharing and wider reuse of useful language resources that can help optimise human and automated translation services in the European Union.

pdf bib
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
Maite Melero
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

pdf bib
The Multilingual Anonymisation Toolkit for Public Administrations (MAPA) Project
Ēriks Ajausks | Victoria Arranz | Laurent Bié | Aleix Cerdà-i-Cucó | Khalid Choukri | Montse Cuadros | Hans Degroote | Amando Estela | Thierry Etchegoyhen | Mercedes García-Martínez | Aitor García-Pablos | Manuel Herranz | Alejandro Kohan | Maite Melero | Mike Rosner | Roberts Rozis | Patrick Paroubek | Artūrs Vasiļevskis | Pierre Zweigenbaum
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

We describe the MAPA project, funded under the Connecting Europe Facility programme, whose goal is the development of an open-source de-identification toolkit for all official European Union languages. It will be developed since January 2020 until December 2021.

pdf bib
Neural Translation for the European Union (NTEU) Project
Laurent Bié | Aleix Cerdà-i-Cucó | Hans Degroote | Amando Estela | Mercedes García-Martínez | Manuel Herranz | Alejandro Kohan | Maite Melero | Tony O’Dowd | Sinéad O’Gorman | Mārcis Pinnis | Roberts Rozis | Riccardo Superbo | Artūrs Vasiļevskis
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

The Neural Translation for the European Union (NTEU) project aims to build a neural engine farm with all European official language combinations for eTranslation, without the necessity to use a high-resourced language as a pivot. NTEU started in September 2019 and will run until August 2021.

2018

pdf bib
ELRI - European Language Resources Infrastructure
Thierry Etchegoyhen | Borja Anza Porras | Andoni Azpeitia | Eva Martínez Garcia | Paulo Vale | José Luis Fonseca | Teresa Lynn | Jane Dunne | Federico Gaspari | Andy Way | Victoria Arranz | Khalid Choukri | Vladimir Popescu | Pedro Neiva | Rui Neto | Maite Melero | David Perez Fernandez | Antonio Branco | Ruben Branco | Luis Gomes
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

We describe the European Language Resources Infrastructure project, whose main aim is the provision of an infrastructure to help collect, prepare and share language resources that can in turn improve translation services in Europe.

2016

pdf bib
Leveraging RDF Graphs for Crossing Multiple Bilingual Dictionaries
Marta Villegas | Maite Melero | Núria Bel | Jorge Gracia
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The experiments presented here exploit the properties of the Apertium RDF Graph, principally cycle density and nodes’ degree, to automatically generate new translation relations between words, and therefore to enrich existing bilingual dictionaries with new entries. Currently, the Apertium RDF Graph includes data from 22 Apertium bilingual dictionaries and constitutes a large unified array of linked lexical entries and translations that are available and accessible on the Web (http://linguistic.linkeddata.es/apertium/). In particular, its graph structure allows for interesting exploitation opportunities, some of which are addressed in this paper. Two ‘massive’ experiments are reported: in the first one, the original EN-ES translation set was removed from the Apertium RDF Graph and a new EN-ES version was generated. The results were compared against the previously removed EN-ES data and against the Concise Oxford Spanish Dictionary. In the second experiment, a new non-existent EN-FR translation set was generated. In this case the results were compared against a converted wiktionary English-French file. The results we got are really good and perform well for the extreme case of correlated polysemy. This lead us to address the possibility to use cycles and nodes degree to identify potential oddities in the source data. If cycle density proves efficient when considering potential targets, we can assume that in dense graphs nodes with low degree may indicate potential errors.

2014

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
Metadata as Linked Open Data: mapping disparate XML metadata registries into one RDF/OWL registry.
Marta Villegas | Maite Melero | Núria Bel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The proliferation of different metadata schemas and models pose serious problems of interoperability. Maintaining isolated repositories with overlapping data is costly in terms of time and effort. In this paper, we describe how we have achieved a Linked Open Data version of metadata descriptions coming from heterogeneous sources, originally encoded in XML. The resulting model is much simpler than the original XSD schema and avoids problems typical of XML syntax, such as semantic ambiguity and order constraint. Moreover, the open world assumption of RDF/OWL allows to naturally integrate objects from different schemas and to add further extensions, facilitating merging of different models as well as linking to external data. Apart from the advantages in terms of interoperability and maintainability, the merged repository enables end-users to query multiple sources using a unified schema and is able to present them with implicit knowledge derived from the linked data. The approach we present here is easily scalable to any number of sources and schemas.

pdf bib
EUMSSI: a Platform for Multimodal Analysis and Recommendation using UIMA
Jens Grivolla | Maite Melero | Toni Badia | Cosmin Cabulea | Yannick Estève | Eelco Herder | Jean-Marc Odobez | Susanne Preuß | Raúl Marín
Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT

2012

pdf bib
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT
Josef van Genabith | Toni Badia | Christian Federmann | Maite Melero | Marta R. Costa-jussà | Tsuyoshi Okita
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf bib
Results from the ML4HMT-12 Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation
Christian Federmann | Tsuyoshi Okita | Maite Melero | Marta R. Costa-Jussa | Toni Badia | Josef van Genabith
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf bib
A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
Eleftherios Avramidis | Marta R. Costa-jussà | Christian Federmann | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In recent years, machine translation (MT) research has focused on investigating how hybrid machine translation as well as system combination approaches can be designed so that the resulting hybrid translations show an improvement over the individual “component” translations. As a first step towards achieving this objective we have developed a parallel corpus with source text and the corresponding translation output from a number of machine translation engines, annotated with metadata information, capturing aspects of the translation process performed by the different MT systems. This corpus aims to serve as a basic resource for further research on whether hybrid machine translation algorithms and system combination techniques can benefit from additional (linguistically motivated, decoding, and runtime) information provided by the different systems involved. In this paper, we describe the annotated corpus we have created. We provide an overview on the component MT systems and the XLIFF-based annotation format we have developed. We also report on first experiments with the ML4HMT corpus data.

pdf bib
Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Maite Melero | Marta R. Costa-Jussà | Judith Domingo | Montse Marquina | Martí Quixal
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.

pdf bib
The ML4HMT Workshop on Optimising the Division of Labour in Hybrid Machine Translation
Christian Federmann | Eleftherios Avramidis | Marta R. Costa-jussà | Josef van Genabith | Maite Melero | Pavel Pecina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the “Shared Task on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid Machine Translation” (ML4HMT) which aims to foster research on improved system combination approaches for machine translation (MT). Participants of the challenge are requested to build hybrid translations by combining the output of several MT systems of different types. We first describe the ML4HMT corpus used in the shared task, then explain the XLIFF-based annotation format we have designed for it, and briefly summarize the participating systems. Using both automated metrics scores and extensive manual evaluation, we discuss the individual performance of the various systems. An interesting result from the shared task is the fact that we were able to observe different systems winning according to the automated metrics scores when compared to the results from the manual evaluation. We conclude by summarising the first edition of the challenge and by giving an outlook to future work.

2010

pdf bib
Language Technology Challenges of a ‘Small’ Language (Catalan)
Maite Melero | Gemma Boleda | Montse Cuadros | Cristina España-Bonet | Lluís Padró | Martí Quixal | Carlos Rodríguez | Roser Saurí
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present a brief snapshot of the state of affairs in computational processing of Catalan and the initiatives that are starting to take place in an effort to bring the field a step forward, by making a better and more efficient use of the already existing resources and tools, by bridging the gap between research and market, and by establishing periodical meeting points for the community. In particular, we present the results of the First Workshop on the Computational Processing of Catalan, which succeeded in putting together a fair representation of the research in the area, and received attention from both the industry and the administration. Aside from facilitating communication among researchers and between developers and users, the Workshop provided the organizers with valuable information about existing resources, tools, developers and providers. This information has allowed us to go a step further by setting up a “harvesting” procedure which will hopefully build the seed of a portal-catalogue-observatory of language resources and technologies in Catalan.

2008

pdf bib
Rapid Deployment of a New METIS Language Pair: Catalan-English
Toni Badia | Maite Melero | Oriol Valentín
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We show here the viability of a rapid deployment of a new language pair within the METIS architecture. In order to do it, we have benefited from the approach of our existing Spanish-English system, which is particularly generation intensive. Contrarily to other SMT or EBMT systems, the METIS architecture allows us to forgo parallel texts, which for many language pairs, such as Catalan-English are hard to obtain. In this experiment, we have successfully built a Catalan-English prototype by simply plugging a POS tagger for Catalan and a bilingual Catalan-English dictionary to the English generation part of the system already developed for other language pairs.

pdf bib
Evaluation of a Machine Translation System for Low Resource Languages: METIS-II
Vincent Vandeghinste | Peter Dirix | Ineke Schuurman | Stella Markantonatou | Sokratis Sofianopoulos | Marina Vassiliou | Olga Yannoutsou | Toni Badia | Maite Melero | Gemma Boleda | Michael Carl | Paul Schmidt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.

2007

pdf bib
Demonstration of the Spanish to English METIS-II MT system
Maite Melero | Toni Badia
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2005

pdf bib
An n-gram Approach to Exploiting a Monolingual Corpus for Machine Translation
Toni Badia | Gemma Boleda | Maite Melero | Antoni Oliver
Workshop on example-based machine translation

2002

pdf bib
Combining Machine Learning and Rule-based Approaches in Spanish and Japanese Sentence Realization
Maite Melero | Takako Aikawa | Lee Schwartz
Proceedings of the International Natural Language Generation Conference

2001

pdf bib
Generation for multilingual MT
Takako Aikawa | Maite Melero | Lee Schwartz | Andi Wu
Proceedings of Machine Translation Summit VIII

This paper presents an overview of the broad-coverage, application-independent natural language generation component of the NLP system being developed at Microsoft Research. It demonstrates how this component functions within a multilingual Machine Translation system (MSR-MT), using the languages that we are currently working on (English, Spanish, Japanese, and Chinese). Section 1 provides a system description of MSR-MT. Section 2 focuses on the generation component and its set of core rules. Section 3 describes an additional layer of generation rules with examples that address issues specific to MT. Section 4 presents evaluation results in the context of MSR-MT. Section 5 addresses generation issues outside of MT.

pdf bib
Multilingual Sentence Generation
Takako Aikawa | Maite Melero | Lee Schwartz | Andi Wu
Proceedings of the ACL 2001 Eighth European Workshop on Natural Language Generation (EWNLG)

Search
Co-authors