Aitor Soroa - ACL Anthology

Aitor Soroa

Also published as: Aitor Soroa Etxabe, A. Soroa

2025

EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching
Maite Heredia | Jeremy Barnes | Aitor Soroa
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching

Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the name EuskañolDS.

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
Oscar Sainz | Naiara Perez | Julen Etxaniz | Joseba Fernandez de Landa | Itziar Aldabe | Iker García-Ferrero | Aimar Zabala | Ekhi Azurmendi | German Rigau | Eneko Agirre | Mikel Artetxe | Aitor Soroa
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.

Proceedings of the First Workshop on Multilingual Counterspeech Generation
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Ráez | Aitor Soroa | María Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation

The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Raez | Aitor Soroa | María-Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri
Proceedings of the First Workshop on Multilingual Counterspeech Generation

This paper presents an overview of the Shared Task organized in the First Workshop on Multilingual Counterspeech Generation at COLING 2025. While interest in automatic approaches to Counterspeech generation has been steadily growing, the large majority of the published experimental work has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. The task’s goal is to promote and encourage research on Counterspeech generation in a multilingual setting (Basque, English, Italian, and Spanish) potentially leveraging background knowledge provided in the proposed dataset. The task attracted 11 participants, 9 of whom presented a paper describing their systems. Together with the task, we introduce a new multilingual counterspeech dataset with 2384 triplets of hate speech, counterspeech, and related background knowledge covering 4 languages. The dataset is available at: https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN.

2024

Latxa: An Open Language Model and Evaluation Suite for Basque
Julen Etxaniz | Oscar Sainz | Naiara Perez | Itziar Aldabe | German Rigau | Eneko Agirre | Aitor Ormazabal | Mikel Artetxe | Aitor Soroa
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,046 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

A LLM-based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
Irune Zubiaga | Aitor Soroa | Rodrigo Agerri
Findings of the Association for Computational Linguistics: EMNLP 2024

This paper proposes a novel approach to evaluate Counter Narrative (CN) generation using a Large Language Model (LLM) as an evaluator. We show that traditional automatic metrics correlate poorly with human judgements and fail to capture the nuanced relationship between generated CNs and human perception. To alleviate this, we introduce a model ranking pipeline based on pairwise comparisons of generated CNs from different models, organized in a tournament-style format. The proposed evaluation method achieves a high correlation with human preference, with a ρ score of 0.88. As an additional contribution, we leverage LLMs as zero-shot CN generators and provide a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in zero-shot are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.

XNLIeu: a dataset for cross-lingual NLI in Basque
Maite Heredia | Julen Etxaniz | Muitze Zulaika | Xabier Saralegi | Jeremy Barnes | Aitor Soroa
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

Do Multilingual Language Models Think Better in English?
Julen Etxaniz | Gorka Azkune | Aitor Soroa | Oier Lopez de Lacalle | Mikel Artetxe
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Translate-test is a popular technique to improve the performance of multilingual language models. This approach works by translating the input into English using an external machine translation system before running inference. However, these improvements can be attributed to the use of a separate translation system, which is typically trained on large amounts of parallel data not seen by the language model. In this work, we introduce a new approach called self-translate that leverages the few-shot translation capabilities of multilingual language models. This allows us to analyze the effect of translation in isolation. Experiments over 5 tasks show that self-translate consistently outperforms direct inference, demonstrating that language models are unable to leverage their full multilingual potential when prompted in non-English languages. Our code is available at https://github.com/juletx/self-translate.

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.

2023

Scaling Laws for BERT in Low-Resource Settings
Gorka Urbizu | Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri | Aitor Soroa
Findings of the Association for Computational Linguistics: ACL 2023

Large language models are very resource intensive, both financially and environmentally, and require an amount of training data which is simply unobtainable for the majority of NLP practitioners. Previous work has researched the scaling laws of such models, but optimal ratios of model parameters, dataset size, and computation costs focused on the large scale. In contrast, we analyze the effect those variables have on the performance of language models in constrained settings, by building three lightweight BERT models (16M/51M/124M parameters) trained over a set of small corpora (5M/25M/125M words).We experiment on four languages of different linguistic characteristics (Basque, Spanish, Swahili and Finnish), and evaluate the models on MLM and several NLU tasks. We conclude that the power laws for parameters, data and compute for low-resource settings differ from the optimal scaling laws previously inferred, and data requirements should be higher. Our insights are consistent across all the languages we study, as well as across the MLM and downstream tasks. Furthermore, we experimentally establish when the cost of using a Transformer-based approach is worth taking, instead of favouring other computationally lighter solutions.

2022

Principled Paraphrase Generation with Parallel Corpora
Aitor Ormazabal | Mikel Artetxe | Aitor Soroa | Gorka Labaka | Eneko Agirre
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match, and implement a relaxation of it through the Information Bottleneck method. Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible, while keeping as little information about the input as possible. Paraphrases can be generated by decoding back to the source from this representation, without having to generate pivot translations. In addition to being more principled and efficient than round-trip MT, our approach offers an adjustable parameter to control the fidelity-diversity trade-off, and obtains better results in our experiments.

Does Corpus Quality Really Matter for Low-Resource Languages?
Mikel Artetxe | Itziar Aldabe | Rodrigo Agerri | Olatz Perez-de-Viñaspre | Aitor Soroa
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role.

PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation
Aitor Ormazabal | Mikel Artetxe | Manex Agirrezabal | Aitor Soroa | Eneko Agirre
Findings of the Association for Computational Linguistics: EMNLP 2022

Formal verse poetry imposes strict constraints on the meter and rhyme scheme of poems. Most prior work on generating this type of poetry uses existing poems for supervision, which are difficult to obtain for most languages and poetic forms. In this work, we propose an unsupervised approach to generate poems that follow any given meter and rhyme scheme, without requiring any poetic text for training. Our method works by splitting a regular, non-poetic corpus into phrases, prepending control codes that describe the length and end rhyme of each phrase, and training a transformer language model in the augmented corpus. The transformer learns to link the structure descriptor with the control codes to the number of lines, their length and their end rhyme. During inference, we build control codes for the desired meter and rhyme scheme, and condition our language model on them to generate formal verse poetry. Experiments in Spanish and Basque show that our approach is able to generate valid poems, which are often comparable in quality to those written by humans.

BasqueGLUE: A Natural Language Understanding Benchmark for Basque
Gorka Urbizu | Iñaki San Vicente | Xabier Saralegi | Rodrigo Agerri | Aitor Soroa
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.

IrekiaLFes: a New Open Benchmark and Baseline Systems for Spanish Automatic Text Simplification
Itziar Gonzalez-Dios | Iker Gutiérrez-Fandiño | Oscar m. Cumbicus-Pineda | Aitor Soroa
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.

2021

Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring
Aitor Ormazabal | Mikel Artetxe | Aitor Soroa | Gorka Labaka | Eneko Agirre
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.

Ontology Population Reusing Resources for Dialogue Intent Detection: Generic and Multilingual Approach
Cristina Aceta | Izaskun Fernández | Aitor Soroa
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

This work presents a generic semi-automatic strategy to populate the domain ontology of an ontology-driven task-oriented dialogue system, with the aim of performing successful intent detection in the dialogue process, reusing already existing multilingual resources. This semi-automatic approach allows ontology engineers to exploit available resources so as to associate the potential situations in the use case to FrameNet frames and obtain the relevant lexical units associated to them in the target language, following lexical and semantic criteria, without linguistic expert knowledge. This strategy has been validated and evaluated in two use cases, from industrial scenarios, for interaction in Spanish with a guide robot and with a Computerized Maintenance Management System (CMMS). In both cases, this method has allowed the ontology engineer to instantiate the domain ontology with the intent-relevant information with quality data in a simple and low-resource-consuming manner.

A Syntax-Aware Edit-based System for Text Simplification
Oscar M. Cumbicus-Pineda | Itziar Gonzalez-Dios | Aitor Soroa
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Edit-based text simplification systems have attained much attention in recent years due to their ability to produce simplification solutions that are interpretable, as well as requiring less training examples compared to traditional seq2seq systems. Edit-based systems learn edit operations at a word level, but it is well known that many of the operations performed when simplifying text are of a syntactic nature. In this paper we propose to add syntactic information into a well known edit-based system. We extend the system with a graph convolutional network module that mimics the dependency structure of the sentence, thus giving the model an explicit representation of syntax. We perform a series of experiments in English, Spanish and Italian, and report improvements of the state of the art in four out of five datasets. Further analysis shows that syntactic information is always beneficial, and suggest that syntax is more helpful in complex sentences.

2020

DoQA - Accessing Domain-Specific FAQs via Conversational QA
Jon Ander Campos | Arantxa Otegi | Aitor Soroa | Jan Deriu | Mark Cieliebak | Eneko Agirre
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval (IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound.

Improving Conversational Question Answering Systems after Deployment using Feedback-Weighted Learning
Jon Ander Campos | Kyunghyun Cho | Arantxa Otegi | Aitor Soroa | Eneko Agirre | Gorka Azkune
Proceedings of the 28th International Conference on Computational Linguistics

The interaction of conversational systems with users poses an exciting opportunity for improving them after deployment, but little evidence has been provided of its feasibility. In most applications, users are not able to provide the correct answer to the system, but they are able to provide binary (correct, incorrect) feedback. In this paper we propose feedback-weighted learning based on importance sampling to improve upon an initial supervised system using binary user feedback. We perform simulated experiments on document classification (for development) and Conversational Question Answering datasets like QuAC and DoQA, where binary user feedback is derived from gold annotations. The results show that our method is able to improve over the initial supervised system, getting close to a fully-supervised system that has access to the same labeled examples in in-domain experiments (QuAC), and even matching in out-of-domain experiments (DoQA). Our work opens the prospect to exploit interactions with real users and improve conversational systems after deployment.

Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
Jan Deriu | Don Tuggener | Pius von Däniken | Jon Ander Campos | Alvaro Rodrigo | Thiziri Belkacem | Aitor Soroa | Eneko Agirre | Mark Cieliebak
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.

Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque
Arantxa Otegi | Aitor Agirre | Jon Ander Campos | Aitor Soroa | Eneko Agirre
Proceedings of the Twelfth Language Resources and Evaluation Conference

Conversational Question Answering (CQA) systems meet user information needs by having conversations with them, where answers to the questions are retrieved from text. There exist a variety of datasets for English, with tens of thousands of training examples, and pre-trained language models have allowed to obtain impressive results. The goal of our research is to test the performance of CQA systems under low-resource conditions which are common for most non-English languages: small amounts of native annotations and other limitations linked to low resource languages, like lack of crowdworkers or smaller wikipedias. We focus on the Basque language, and present the first non-English CQA dataset and results. Our experiments show that it is possible to obtain good results with low amounts of native data thanks to cross-lingual transfer, with quality comparable to those obtained for English. We also discovered that dialogue history models are not directly transferable to another language, calling for further research. The dataset is publicly available.

Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the Twelfth Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

Automatic Evaluation vs. User Preference in Neural Textual QuestionAnswering over COVID-19 Scientific Literature
Arantxa Otegi | Jon Ander Campos | Gorka Azkune | Aitor Soroa | Eneko Agirre
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We present a Question Answering (QA) system that won one of the tasks of the Kaggle CORD-19 Challenge, according to the qualitative evaluation of experts. The system is a combination of an Information Retrieval module and a reading comprehension module that finds the answers in the retrieved passages. In this paper we present a quantitative and qualitative analysis of the system. The quantitative evaluation using manually annotated datasets contradicted some of our design choices, e.g. the fact that using QuAC for fine-tuning provided better answers over just using SQuAD. We analyzed this mismatch with an additional A/B test which showed that the system using QuAC was indeed preferred by users, confirming our intuition. Our analysis puts in question the suitability of automatic metrics and its correlation to user preferences. We also show that automatic metrics are highly dependent on the characteristics of the gold standard, such as the average length of the answers.

2019

Analyzing the Limitations of Cross-lingual Word Embedding Mappings
Aitor Ormazabal | Mikel Artetxe | Gorka Labaka | Aitor Soroa | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning cross-lingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.

2018

Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 22nd Conference on Computational Natural Language Learning

Named Entity Disambiguation algorithms typically learn a single model for all target entities. In this paper we present a word expert model and train separate deep learning models for each target entity string, yielding 500K classification tasks. This gives us the opportunity to benchmark popular text representation alternatives on this massive dataset. In order to face scarce training data we propose a simple data-augmentation technique and transfer-learning. We show that bag-of-word-embeddings are better than LSTMs for tasks with scarce training data, while the situation is reversed when having larger amounts. Transferring a LSTM which is learned on all datasets is the most effective context representation option for the word experts in all frequency bands. The experiments show that our system trained on out-of-domain Wikipedia data surpass comparable NED systems which have been trained on in-domain training data.

The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD
Eneko Agirre | Oier López de Lacalle | Aitor Soroa
Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

UKB is an open source collection of programs for performing, among other tasks, Knowledge-Based Word Sense Disambiguation (WSD). Since it was released in 2009 it has been often used out-of-the-box in sub-optimal settings. We show that nine years later it is the state-of-the-art on knowledge-based WSD. This case shows the pitfalls of releasing open source NLP software without optimal default settings and precise instructions for reproducibility.

2016

Interoperability of Annotation Schemes: Using the Pepper Framework to Display AWA Documents in the ANNIS Interface
Talvany Carlotto | Zuhaitz Beloki | Xabier Artola | Aitor Soroa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Natural language processing applications are frequently integrated to solve complex linguistic problems, but the lack of interoperability between these tools tends to be one of the main issues found in that process. That is often caused by the different linguistic formats used across the applications, which leads to attempts to both establish standard formats to represent linguistic information and to create conversion tools to facilitate this integration. Pepper is an example of the latter, as a framework that helps the conversion between different linguistic annotation formats. In this paper, we describe the use of Pepper to convert a corpus linguistically annotated by the annotation scheme AWA into the relANNIS format, with the ultimate goal of interacting with AWA documents through the ANNIS interface. The experiment converted 40 megabytes of AWA documents, allowed their use on the ANNIS interface, and involved making architectural decisions during the mapping from AWA into relANNIS using Pepper. The main issues faced during this process were due to technical issues mainly caused by the integration of the different systems and projects, namely AWA, Pepper and ANNIS.

Two Architectures for Parallel Processing of Huge Amounts of Text
Mathijs Kattenberg | Zuhaitz Beloki | Aitor Soroa | Xabier Artola | Antske Fokkens | Paul Huygen | Kees Verstoep
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallel processing. The two architectures focus on different processing scenarios, namely batch-processing and streaming processing. The batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. The streaming architecture aims to minimize the time to process real-time incoming documents and is therefore especially suitable for live feeds. The paper presents experiments with both architectures, and reports the overall gain when they are used for batch as well as for streaming processing. All the software described in the paper is publicly available under free licenses.

Alleviating Poor Context with Background Knowledge for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

Random Walks and Neural Network Language Models on Knowledge Bases
Josu Goikoetxea | Aitor Soroa | Eneko Agirre
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Improving distant supervision using inference learning
Roland Roller | Eneko Agirre | Aitor Soroa | Mark Stevenson
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2014

“One Entity per Discourse” and “One Entity per Collocation” Improve Named-Entity Disambiguation
Ander Barrena | Eneko Agirre | Bernardo Cabaleiro | Anselmo Peñas | Aitor Soroa
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

Random Walks for Knowledge-Based Word Sense Disambiguation
Eneko Agirre | Oier López de Lacalle | Aitor Soroa
Computational Linguistics, Volume 40, Issue 1 - March 2014

A stream computing approach towards scalable NLP
Xabier Artola | Zuhaitz Beloki | Aitor Soroa
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Computational power needs have grown dramatically in recent years. This is also the case in many language processing tasks, due to overwhelming quantities of textual information that must be processed in a reasonable time frame. This scenario has led to a paradigm shift in the computing architectures and large-scale data processing strategies used in the NLP field. In this paper we describe a series of experiments carried out in the context of the NewsReader project with the goal of analyzing the scaling capabilities of the language processing pipeline used in it. We explore the use of Storm in a new approach for scalable distributed language processing across multiple machines and evaluate its effectiveness and efficiency when processing documents on a medium and large scale. The experiments have shown that there is a big room for improvement regarding language processing performance when adopting parallel architectures, and that we might expect even better results with the use of large clusters with many processing nodes.

Exploring the use of word embeddings and random walks on Wikipedia for the CogAlex shared task
Josu Goikoetxea | Eneko Agirre | Aitor Soroa
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

2013

PATHS: A System for Accessing Cultural Heritage Collections
Eneko Agirre | Nikolaos Aletras | Paul Clough | Samuel Fernando | Paula Goodale | Mark Hall | Aitor Soroa | Mark Stevenson
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

Comparing Taxonomies for Organising Collections of Documents
Samuel Fernando | Mark Hall | Eneko Agirre | Aitor Soroa | Paul Clough | Mark Stevenson
Proceedings of COLING 2012

Matching Cultural Heritage items to Wikipedia
Eneko Agirre | Ander Barrena | Oier Lopez de Lacalle | Aitor Soroa | Samuel Fernando | Mark Stevenson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Digitised Cultural Heritage (CH) items usually have short descriptions and lack rich contextual information. Wikipedia articles, on the contrary, include in-depth descriptions and links to related articles, which motivate the enrichment of CH items with information from Wikipedia. In this paper we explore the feasibility of finding matching articles in Wikipedia for a given Cultural Heritage item. We manually annotated a random sample of items from Europeana, and performed a qualitative and quantitative study of the issues and problems that arise, showing that each kind of CH item is different and needs a nuanced definition of what ``matching article'' means. In addition, we test a well-known wikification (aka entity linking) algorithm on the task. Our results indicate that a substantial number of items can be effectively linked to their corresponding Wikipedia article.

Enabling the Discovery of Digital Cultural Heritage Objects through Wikipedia
Mark Michael Hall | Oier Lopez de Lacalle | Aitor Soroa Etxabe | Paul Clough | Eneko Agirre
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2010

Exploring Knowledge Bases for Similarity
Eneko Agirre | Montse Cuadros | German Rigau | Aitor Soroa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Graph-based similarity over WordNet has been previously shown to perform very well on word similarity. This paper presents a study of the performance of such a graph-based algorithm when using different relations and versions of Wordnet. The graph algorithm is based on Personalized PageRank, a random-walk based algorithm which computes the probability of a random-walk initiated in the target word to reach any synset following the relations in WordNet (Haveliwala, 2002). Similarity is computed as the cosine of the probability distributions for each word over WordNet. The best combination of relations includes all relations in WordNet 3.0, included disambiguated glosses, and automatically disambiguated topic signatures called KnowNets. All relations are part of the official release of WordNet, except KnowNets, which have been derived automatically. The results over the WordSim 353 dataset show that using the adequate relations the performance improves over previously published WordNet-based results on the WordSim353 dataset (Finkelstein et al., 2002). The similarity software and some graphs used in this paper are publicly available at http://ixa2.si.ehu.es/ukb.

Kyoto: An Integrated System for Specific Domain WSD
Aitor Soroa | Eneko Agirre | Oier Lopez de Lacalle | Wauter Bosma | Piek Vossen | Monica Monachini | Jessie Lo | Shu-Kai Hsieh
Proceedings of the 5th International Workshop on Semantic Evaluation

KYOTO: an open platform for mining facts
Piek Vossen | German Rigau | Eneko Agirre | Aitor Soroa | Monica Monachini | Roberto Bartolini
Proceedings of the 6th Workshop on Ontologies and Lexical Resources

2009

Personalizing PageRank for Word Sense Disambiguation
Eneko Agirre | Aitor Soroa
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches
Eneko Agirre | Enrique Alfonseca | Keith Hall | Jana Kravalova | Marius Paşca | Aitor Soroa
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

WikiWalk: Random walks on Wikipedia for Semantic Relatedness
Eric Yeh | Daniel Ramage | Christopher D. Manning | Eneko Agirre | Aitor Soroa
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

2008

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation
Eneko Agirre | Aitor Soroa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the results of a graph-based method for performing knowledge-based Word Sense Disambiguation (WSD). The technique exploits the structural properties of the graph underlying the chosen knowledge base. The method is general, in the sense that it is not tied to any particular knowledge base, but in this work we have applied it to the Multilingual Central Repository (MCR). The evaluation has been performed on the Senseval-3 all-words task. The main contributions of the paper are twofold: (1) We have evaluated the separate and combined performance of each type of relation in the MCR, and thus indirectly validated the contents of the MCR and their potential for WSD. (2) We obtain state-of-the-art results, and in fact yield the best results that can be obtained using publicly available data.

Spelling Correction: from Two-Level Morphology to Open Source
Iñaki Alegria | Klara Ceberio | Nerea Ezeiza | Aitor Soroa | Gregorio Hernandez
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the free world (OpenOffice, Mozilla, emacs, LaTeX, etc.)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper.

2007

SemEval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems
Eneko Agirre | Aitor Soroa
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

UBC-AS: A Graph Based Unsupervised System for Induction and Classification
Eneko Agirre | Aitor Soroa
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

Structure, Annotation and Tools in the Basque ZT Corpus
N. Areta | A. Gurrutxaga | I. Leturia | Z. Polin | R. Saiz | I. Alegria | X. Artola | A. Diaz de Ilarraza | N. Ezeiza | A. Sologaistoa | A. Soroa | A. Valverde
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.

Two graph-based algorithms for state-of-the-art WSD
Eneko Agirre | David Martínez | Oier López de Lacalle | Aitor Soroa
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

Evaluating and optimizing the parameters of an unsupervised graph-based WSD algorithm
Eneko Agirre | David Martínez | Oier López de Lacalle | Aitor Soroa
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

2002

A Class Library for the Integration of NLP Tools: Definition and implementation of an Abstract Data Type Collection for the manipulation of SGML documents in a context of stand-off linguistic annotation
X. Artola | A. Díaz de Ilarraza | N. Ezeiza | K. Gojenola | G. Hernández | A. Soroa
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

A Proposal for the Integration of NLP Tools using SGML-Tagged Documents
X. Artola | A. Díaz de Ilarraza | N. Ezeiza | K. Gojenola | A. Maritxalar | A. Soroa
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Co-authors

Ander Barrena 6

Jon Ander Campos 6

Aitor Ormazabal 5

Julen Etxaniz 4

Arantxa Otegi 4

Xabier Saralegi 4

Mark Stevenson 4

Itziar Aldabe 3

Zuhaitz Beloki 3

Arantza Díaz de Ilarraza 3

Samuel Fernando 3

Iñaki San Vicente 3

Irune Zubiaga 3

Iñaki Alegría 2

Jeremy Barnes 2

Helena Bonaldi 2

Mark Cieliebak 2

Oscar M. Cumbicus-Pineda 2

Jan Milan Deriu 2

Josu Goikoetxea 2

Koldo Gojenola 2

Itziar Gonzalez-Dios 2

Marco Guerini 2

Maite Heredia 2

Gregorio Hernández 2

David Martinez Iraola 2

Monica Monachini 2

Arturo Montejo-Ráez 2

Naiara Pérez 2

María Estrella Vallecillo-Rodríguez 2

Cristina Aceta 1

Manex Agirrezabal 1

Alham Fikri Aji 1

Nikolaos Aletras 1

Enrique Alfonseca 1

Zaid Alyafeai 1

Ekhi Azurmendi 1

Roberto Bartolini 1

Thiziri Belkacem 1

Stella Biderman 1

Bernardo Cabaleiro 1

Talvany Carlotto 1

Klara Ceberio 1

Kyunghyun Cho 1

Montse Cuadros 1

Francesco De Toni 1

Gérard Dupont 1

Chris Chinenye Emezue 1

Joseba Fernandez de Landa 1

Izaskun Fernández 1

Antske Fokkens 1

Iker García-Ferrero 1

Paula Goodale 1

Antton Gurrutxaga 1

Iker Gutiérrez-Fandiño 1

Shu-Kai Hsieh 1

Yacine Jernite 1

Mathijs Kattenberg 1

Nurulaqilla Khamis 1

Jana Kravalová 1

Christopher D. Manning 1

Alberto Maritxalar 1

M. Teresa Martín-Valdivia 1

María-Teresa Martín-Valdivia 1

Maraim Masoud 1

Angelina McMillan-Major 1

Pedro Ortiz Suarez 1

Olatz Perez-de-Vinaspre 1

Anselmo Peñas 1

Ziortza Polin 1

Daniel Ramage 1

Álvaro Rodrigo 1

Roland Roller 1

Aitor Sologaistoa 1

Andoni Valverde 1

Daniel Van Strien 1

Kees Verstoep 1

Pius Von Däniken 1

Muitze Zulaika 1

Venues