Mireia Farrús


2024

pdf bib
TEMA: Token Embeddings Mapping for Enriching Low-Resource Language Models
Rodolfo Zevallos | Núria Bel | Mireia Farrús
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The objective of the research we present is to remedy the problem of the low quality of language models for low-resource languages. We introduce an algorithm, the Token Embedding Mapping Algorithm (TEMA), that maps the token embeddings of a richly pre-trained model L1 to a poorly trained model L2, thus creating a richer L2’ model. Our experiments show that the L2’ model reduces perplexity with respect to the original monolingual model L2, and that for downstream tasks, including SuperGLUE, the results are state-of-the-art or better for the most semantic tasks. The models obtained with TEMA are also competitive or better than multilingual or extended models proposed as solutions for mitigating the low-resource language problems.

pdf bib
Improving NMT from a Low-Resource Source Language: A Use Case from Catalan to Chinese via Spanish
Yongjian Chen | Antonio Toral | Zhijian Li | Mireia Farrús
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

The effectiveness of neural machine translation is markedly constrained in low-resource scenarios, where the scarcity of parallel data hampers the development of robust models. This paper focuses on the scenario where the source language is low-resourceand there exists a related high-resource language, for which we introduce a novel approach that combines pivot translation and multilingual training. As a use case we tackle the automatic translation from Catalan to Chinese, using Spanish as an additional language. Our evaluation, conducted on the FLORES-200 benchmark, compares our new approach against a vanilla baseline alongside other models representing various low-resource techniques in the Catalan-to-Chinese context. Experimental results highlight the efficacy of our proposed method, which outperforms existing models, notably demonstrating significant improvements both in translation quality and in lexical diversity.

2023

pdf bib
Frequency Balanced Datasets Lead to Better Language Models
Rodolfo Zevallos | Mireia Farrús | Núria Bel
Findings of the Association for Computational Linguistics: EMNLP 2023

This paper reports on the experiments aimed to improve our understanding of the role of the amount of data required for training attention-based transformer language models. Specifically, we investigate the impact of reducing the immense amounts of required pre-training data through sampling strategies that identify and reduce high-frequency tokens as different studies have indicated that the existence of very high-frequency tokens in pre-training data might bias learning, causing undesired effects. In this light, we describe our sampling algorithm that iteratively assesses token frequencies and removes sentences that contain still high-frequency tokens, eventually delivering a balanced, linguistically correct dataset. We evaluate the results in terms of model perplexity and fine-tuning linguistic probing tasks, NLP downstream tasks as well as more semantic SuperGlue tasks. The results show that pre-training with the resulting balanced dataset allows reducing up to three times the pre-training data.

2022

pdf bib
Recycle Your Wav2Vec2 Codebook: A Speech Perceiver for Keyword Spotting
Guillermo Cámbara | Jordi Luque | Mireia Farrús
Proceedings of the 29th International Conference on Computational Linguistics

Speech information in a pretrained wav2vec2.0 model is usually leveraged through its encoder, which has at least 95M parameters, being not so suitable for small footprint Keyword Spotting. In this work, we show an efficient way of profiting from wav2vec2.0’s linguistic knowledge, by recycling the phonetic information encoded in its latent codebook, which has been typically thrown away after pretraining. We do so by transferring the codebook as weights for the latent bottleneck of a Keyword Spotting Perceiver, thus initializing such model with phonetic embeddings already. The Perceiver design relies on cross-attention between these embeddings and input data to generate better representations. Our method delivers accuracy gains compared to random initialization, at no latency costs. Plus, we show that the phonetic embeddings can easily be downsampled with k-means clustering, speeding up inference in 3.5 times at only slight accuracy penalties.

2018

pdf bib
Compilation of Corpora for the Study of the Information Structure–Prosody Interface
Alicia Burga | Mónica Domínguez | Mireia Farrús | Leo Wanner
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Automatic Extraction of Parallel Speech Corpora from Dubbed Movies
Alp Öktem | Mireia Farrús | Leo Wanner
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper presents a methodology to extract parallel speech corpora based on any language pair from dubbed movies, together with an application framework in which some corresponding prosodic parameters are extracted. The obtained parallel corpora are especially suitable for speech-to-speech translation applications when a prosody transfer between source and target languages is desired.

2016

pdf bib
An Automatic Prosody Tagger for Spontaneous Speech
Mónica Domínguez | Mireia Farrús | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Speech prosody is known to be central in advanced communication technologies. However, despite the advances of theoretical studies in speech prosody, so far, no large scale prosody annotated resources that would facilitate empirical research and the development of empirical computational approaches are available. This is to a large extent due to the fact that current common prosody annotation conventions offer a descriptive framework of intonation contours and phrasing based on labels. This makes it difficult to reach a satisfactory inter-annotator agreement during the annotation of gold standard annotations and, subsequently, to create consistent large scale annotations. To address this problem, we present an annotation schema for prominence and boundary labeling of prosodic phrases based upon acoustic parameters and a tagger for prosody annotation at the prosodic phrase level. Evaluation proves that inter-annotator agreement reaches satisfactory values, from 0.60 to 0.80 Cohen’s kappa, while the prosody tagger achieves acceptable recall and f-measure figures for five spontaneous samples used in the evaluation of monologue and dialogue formats in English and Spanish. The work presented in this paper is a first step towards a semi-automatic acquisition of large corpora for empirical prosodic analysis.

pdf bib
Praat on the Web: An Upgrade of Praat for Semi-Automatic Speech Annotation
Mónica Domínguez | Iván Latorre | Mireia Farrús | Joan Codina-Filbà | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

This paper presents an implementation of the widely used speech analysis tool Praat as a web application with an extended functionality for feature annotation. In particular, Praat on the Web addresses some of the central limitations of the original Praat tool and provides (i) enhanced visualization of annotations in a dedicated window for feature annotation at interval and point segments, (ii) a dynamic scripting composition exemplified with a modular prosody tagger, and (iii) portability and an operational web interface. Speech annotation tools with such a functionality are key for exploring large corpora and designing modular pipelines.

2010

pdf bib
Linguistic-based Evaluation Criteria to identify Statistical Machine Translation Errors
Mireia Farrús | Marta R. Costa-jussà | José B. Mariño | José A. R. Fonollosa
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf bib
Automatic and Human Evaluation Study of a Rule-based and a Statistical Catalan-Spanish Machine Translation Systems
Marta R. Costa-jussà | Mireia Farrús | José B. Mariño | José A. R. Fonollosa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Machine translation systems can be classified into rule-based and corpus-based approaches, in terms of their core technology. Since both paradigms have largely been used during the last years, one of the aims in the research community is to know how these systems differ in terms of translation quality. To this end, this paper reports a study and comparison of a rule-based and a corpus-based (particularly, statistical) Catalan-Spanish machine translation systems, both of them freely available in the web. The translation quality analysis is performed under two different domains: journalistic and medical. The systems are evaluated by using standard automatic measures, as well as by native human evaluators. Automatic results show that the statistical system performs better than the rule-based system. Human judgements show that in the Spanish-to-Catalan direction the statistical system also performs better than the rule-based system, while in the Catalan-to-Spanish direction is the other way round. Although the statistical system obtains the best automatic scores, its errors tend to be more penalized by human judgements than the errors of the rule-based system. This can be explained because statistical errors are usually unexpected and they do not follow any pattern.

2009

pdf bib
Improving a Catalan-Spanish Statistical Translation System using Morphosyntactic Knowledge
Mireia Farrús | Marta R. Costa-jussà | Marc Poch | Adolfo Hernández | José B. Mariño
Proceedings of the 13th Annual Conference of the European Association for Machine Translation