Rodrigo Agerri

Also published as: R. Agerri


2022

pdf bib
BasqueGLUE: A Natural Language Understanding Benchmark for Basque
Gorka Urbizu | Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri | Aitor Soroa
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.

pdf bib
BasqueParl: A Bilingual Corpus of Basque Parliamentary Transcriptions
Nayla Escribano | Jon Ander Gonzalez | Julen Orbegozo-Terradillos | Ainara Larrondo-Ureta | Simón Peña-Fernández | Olatz Perez-de-Viñaspre | Rodrigo Agerri
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Parliamentary transcripts provide a valuable resource to understand the reality and know about the most important facts that occur over time in our societies. Furthermore, the political debates captured in these transcripts facilitate research on political discourse from a computational social science perspective. In this paper we release the first version of a newly compiled corpus from Basque parliamentary transcripts. The corpus is characterized by heavy Basque-Spanish code-switching, and represents an interesting resource to study political discourse in contrasting languages such as Basque and Spanish. We enrich the corpus with metadata related to relevant attributes of the speakers and speeches (language, gender, party...) and process the text to obtain named entities and lemmas. The obtained metadata is then used to perform a detailed corpus analysis which provides interesting insights about the language use of the Basque political representatives across time, parties and gender.

pdf bib
SemEval 2022 Task 10: Structured Sentiment Analysis
Jeremy Barnes | Laura Oberlaender | Enrica Troiano | Andrey Kutuzov | Jan Buchmann | Rodrigo Agerri | Lilja Øvrelid | Erik Velldal
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In this paper, we introduce the first SemEval shared task on Structured Sentiment Analysis, for which participants are required to predict all sentiment graphs in a text, where a single sentiment graph is composed of a sentiment holder, target, expression and polarity. This new shared task includes two subtracks (monolingual and cross-lingual) with seven datasets available in five languages, namely Norwegian, Catalan, Basque, Spanish and English. Participants submitted their predictions on a held-out test set and were evaluated on Sentiment Graph F1 . Overall, the task received over 200 submissions from 32 participating teams. We present the results of the 15 teams that provided system descriptions and our own expanded analysis of the test predictions.

pdf bib
A Semantics-Aware Approach to Automated Claim Verification
Blanca Calvo Figueras | Montse Oller | Rodrigo Agerri
Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER)

The influence of fake news in the perception of reality has become a mainstream topic in the last years due to the fast propagation of misleading information. In order to help in the fight against misinformation, automated solutions to fact-checking are being actively developed within the research community. In this context, the task of Automated Claim Verification is defined as assessing the truthfulness of a claim by finding evidence about its veracity. In this work we empirically demonstrate that enriching a BERT model with explicit semantic information such as Semantic Role Labelling helps to improve results in claim verification as proposed by the FEVER benchmark. Furthermore, we perform a number of explainability tests that suggest that the semantically-enriched model is better at handling complex cases, such as those including passive forms or multiple propositions.

2021

pdf bib
Benchmarking Meta-embeddings: What Works and What Does Not
Iker García-Ferrero | Rodrigo Agerri | German Rigau
Findings of the Association for Computational Linguistics: EMNLP 2021

In the last few years, several methods have been proposed to build meta-embeddings. The general aim was to obtain new representations integrating complementary knowledge from different source pre-trained embeddings thereby improving their overall quality. However, previous meta-embeddings have been evaluated using a variety of methods and datasets, which makes it difficult to draw meaningful conclusions regarding the merits of each approach. In this paper we propose a unified common framework, including both intrinsic and extrinsic tasks, for a fair and objective meta-embeddings evaluation. Furthermore, we present a new method to generate meta-embeddings, outperforming previous work on a large number of intrinsic evaluation benchmarks. Our evaluation framework also allows us to conclude that previous extrinsic evaluations of meta-embeddings have been overestimated.

pdf bib
Multilingual Counter Narrative Type Classification
Yi-Ling Chung | Marco Guerini | Rodrigo Agerri
Proceedings of the 8th Workshop on Argument Mining

The growing interest in employing counter narratives for hatred intervention brings with it a focus on dataset creation and automation strategies. In this scenario, learning to recognize counter narrative types from natural text is expected to be useful for applications such as hate speech countering, where operators from non-governmental organizations are supposed to answer to hate with several and diverse arguments that can be mined from online sources. This paper presents the first multilingual work on counter narrative type classification, evaluating SoTA pre-trained language models in monolingual, multilingual and cross-lingual settings. When considering a fine-grained annotation of counter narrative classes, we report strong baseline classification results for the majority of the counter narrative types, especially if we translate every language to English before cross-lingual prediction. This suggests that knowledge about counter narratives can be successfully transferred across languages.

2020

pdf bib
Multilingual Stance Detection in Tweets: The Catalonia Independence Corpus
Elena Zotova | Rodrigo Agerri | Manuel Nuñez | German Rigau
Proceedings of the Twelfth Language Resources and Evaluation Conference

Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 referendum Dataset released at IberEval 2018 is a previous effort to provide multilingual stance-annotated data in Catalan and Spanish. Unfortunately, the TW-10 Catalan subset is extremely imbalanced. This paper addresses these issues by presenting a new multilingual dataset for stance detection in Twitter for the Catalan and Spanish languages, with the aim of facilitating research on stance detection in multilingual and cross-lingual settings. The dataset is annotated with stance towards one topic, namely, the ndependence of Catalonia. We also provide a semi-automatic method to annotate the dataset based on a categorization of Twitter users. We experiment on the new corpus with a number of supervised approaches, including linear classifiers and deep learning methods. Comparison of our new corpus with the with the TW-1O dataset shows both the benefits and potential of a well balanced corpus for multilingual and cross-lingual research on stance detection. Finally, we establish new state-of-the-art results on the TW-10 dataset, both for Catalan and Spanish.

pdf bib
Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the Twelfth Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

2019

pdf bib
Doris Martin at SemEval-2019 Task 4: Hyperpartisan News Detection with Generic Semi-supervised Features
Rodrigo Agerri
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper we describe our participation to the Hyperpartisan News Detection shared task at SemEval 2019. Motivated by the late arrival of Doris Martin, we test a previously developed document classification system which consists of a combination of clustering features implemented on top of some simple shallow local features. We show how leveraging distributional features obtained from large in-domain unlabeled data helps to easily and quickly develop a reasonably good performing system for detecting hyperpartisan news. The system and models generated for this task are publicly available.

2018

pdf bib
Developing New Linguistic Resources and Tools for the Galician Language
Rodrigo Agerri | Xavier Gómez Guinovart | German Rigau | Miguel Anxo Solla Portela
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Annotating Abstract Meaning Representations for Spanish
Noelia Migueles-Abraira | Rodrigo Agerri | Arantza Diaz de Ilarraza
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Building Named Entity Recognition Taggers via Parallel Corpora
Rodrigo Agerri | Yiling Chung | Itziar Aldabe | Nora Aranberri | Gorka Labaka | German Rigau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2015

pdf bib
EliXa: A Modular and Flexible ABSA Platform
Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
IXA pipeline: Efficient and Ready to Use Multilingual NLP tools
Rodrigo Agerri | Josu Bermudez | German Rigau
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

IXA pipeline is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology. It offers robust and efficient linguistic annotation to both researchers and non-NLP experts with the aim of lowering the barriers of using NLP technology either for research purposes or for small industrial developers and SMEs. IXA pipeline can be used “as is” or exploit its modularity to pick and change different components. Given its open-source nature, it can also be modified and extended for it to work with other languages. This paper describes the general data-centric architecture of IXA pipeline and presents competitive results in several NLP annotations for English and Spanish.

pdf bib
Generating Polarity Lexicons with WordNet propagation in 5 languages
Isa Maks | Ruben Izquierdo | Francesca Frontini | Rodrigo Agerri | Piek Vossen | Andoni Azpeitia
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarity lexicons in five languages: French, Italian, Dutch, English and Spanish using WordNet propagation. WordNet propagation is a commonly used method to generate these lexicons as it gives high coverage of general purpose language and the semantically rich WordNets where concepts are organised in synonym , antonym and hyperonym/hyponym structures seem to be well suited to the identification of positive and negative words. However, WordNets of different languages may vary in many ways such as the way they are compiled, the number of synsets, number of synonyms and number of semantic relations they include. In this study we investigate whether this variability translates into differences of performance when these WordNets are used for polarity propagation. Although many variants of the propagation method are developed for English, little is known about how they perform with WordNets of other languages. We implemented a propagation algorithm and designed a method to obtain seed lists similar with respect to quality and size, for each of the five languages. We evaluated the results against gold standards also developed according to a common method in order to achieve as less variance as possible between the different languages.

pdf bib
Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages
Iñaki San Vicente | Rodrigo Agerri | German Rigau
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Multilingual, Efficient and Easy NLP Processing with IXA Pipeline
Rodrigo Agerri | Josu Bermudez | German Rigau
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2012

pdf bib
SUMAT: Data Collection and Parallel Corpus Compilation for Machine Translation of Subtitles
Volha Petukhova | Rodrigo Agerri | Mark Fishel | Sergio Penkale | Arantza del Pozo | Mirjam Sepesy Maučec | Andy Way | Panayota Georgakopoulou | Martin Volk
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Subtitling and audiovisual translation have been recognized as areas that could greatly benefit from the introduction of Statistical Machine Translation (SMT) followed by post-editing, in order to increase efficiency of subtitle production process. The FP7 European project SUMAT (An Online Service for SUbtitling by MAchine Translation: http://www.sumat-project.eu) aims to develop an online subtitle translation service for nine European languages, combined into 14 different language pairs, in order to semi-automate the subtitle translation processes of both freelance translators and subtitling companies on a large scale. In this paper we discuss the data collection and parallel corpus compilation for training SMT systems, which includes several procedures such as data partition, conversion, formatting, normalization and alignment. We discuss in detail each data pre-processing step using various approaches. Apart from the quantity (around 1 million subtitles per language pair), the SUMAT corpus has a number of very important characteristics. First of all, high quality both in terms of translation and in terms of high-precision alignment of parallel documents and their contents has been achieved. Secondly, the contents are provided in one consistent format and encoding. Finally, additional information such as type of content in terms of genres and domain is available.

2010

pdf bib
Q-WordNet: Extracting Polarity from WordNet Senses
Rodrigo Agerri | Ana García-Serrano
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents Q-WordNet, a lexical resource consisting of WordNet senses automatically annotated by positive and negative polarity. Polarity classification amounts to decide whether a text (sense, sentence, etc.) may be associated to positive or negative connotations. Polarity classification is becoming important within the fields of Opinion Mining and Sentiment Analysis for determining opinions about commercial products, on companies reputation management, brand monitoring, or to track attitudes by mining online forums, blogs, etc. Inspired by work on classification of word senses by polarity (e.g., SentiWordNet), and taking WordNet as a starting point, we build Q-WordNet. Instead of applying external tools such as supervised classifiers to annotated WordNet synsets by polarity, we try to effectively maximize the linguistic information contained in WordNet, thereby taking advantage of the human effort put by lexicographers and annotators. The resulting resource is a subset of WordNet senses classified as positive or negative. In this approach, neutral polarity is seen as the absence of positive or negative polarity. The evaluation of Q-WordNet shows an improvement with respect to previous approaches. We believe that Q-WordNet can be used as a starting point for data-driven approaches in sentiment analysis.

2008

pdf bib
Metaphor in Textual Entailment
Rodrigo Agerri
Coling 2008: Companion volume: Posters

pdf bib
Textual Entailment as an Evaluation Framework for Metaphor Resolution: A Proposal
Rodrigo Agerri | John Barnden | Mark Lee | Alan Wallington
Semantics in Text Processing. STEP 2008 Conference Proceedings

2007

pdf bib
On the formalization of Invariant Mappings for Metaphor Interpretation
Rodrigo Agerri | John Barnden | Mark Lee | Alan Wallington
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf bib
Considerations on the nature of metaphorical meaning arising from a computational treatment of metaphor interpretation
A.M. Wallington | R. Agerri | J.A. Barnden | S.R. Glasbey | M.G. Lee
Proceedings of the Fifth International Workshop on Inference in Computational Semantics (ICoS-5)