Eneko Agirre

Also published as: E. Agirre


2021

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Ionut-Teodor Sorodoc | Madhumita Sushil | Ece Takmaz | Eneko Agirre
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
Eneko Agirre | Marianna Apidianaki | Ivan Vulić
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

pdf bib
Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring
Aitor Ormazabal | Mikel Artetxe | Aitor Soroa | Gorka Labaka | Eneko Agirre
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent research on cross-lingual word embeddings has been dominated by unsupervised mapping approaches that align monolingual embeddings. Such methods critically rely on those embeddings having a similar structure, but it was recently shown that the separate training in different languages causes departures from this assumption. In this paper, we propose an alternative approach that does not have this limitation, while requiring a weak seed dictionary (e.g., a list of identical words) as the only form of supervision. Rather than aligning two fixed embedding spaces, our method works by fixing the target language embeddings, and learning a new set of embeddings for the source language that are aligned with them. To that end, we use an extension of skip-gram that leverages translated context words as anchor points, and incorporates self-learning and iterative restarts to reduce the dependency on the initial dictionary. Our approach outperforms conventional mapping methods on bilingual lexicon induction, and obtains competitive results in the downstream XNLI task.

2020

pdf bib
Improving Conversational Question Answering Systems after Deployment using Feedback-Weighted Learning
Jon Ander Campos | Kyunghyun Cho | Arantxa Otegi | Aitor Soroa | Eneko Agirre | Gorka Azkune
Proceedings of the 28th International Conference on Computational Linguistics

The interaction of conversational systems with users poses an exciting opportunity for improving them after deployment, but little evidence has been provided of its feasibility. In most applications, users are not able to provide the correct answer to the system, but they are able to provide binary (correct, incorrect) feedback. In this paper we propose feedback-weighted learning based on importance sampling to improve upon an initial supervised system using binary user feedback. We perform simulated experiments on document classification (for development) and Conversational Question Answering datasets like QuAC and DoQA, where binary user feedback is derived from gold annotations. The results show that our method is able to improve over the initial supervised system, getting close to a fully-supervised system that has access to the same labeled examples in in-domain experiments (QuAC), and even matching in out-of-domain experiments (DoQA). Our work opens the prospect to exploit interactions with real users and improve conversational systems after deployment.

pdf bib
Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque
Arantxa Otegi | Aitor Agirre | Jon Ander Campos | Aitor Soroa | Eneko Agirre
Proceedings of the 12th Language Resources and Evaluation Conference

Conversational Question Answering (CQA) systems meet user information needs by having conversations with them, where answers to the questions are retrieved from text. There exist a variety of datasets for English, with tens of thousands of training examples, and pre-trained language models have allowed to obtain impressive results. The goal of our research is to test the performance of CQA systems under low-resource conditions which are common for most non-English languages: small amounts of native annotations and other limitations linked to low resource languages, like lack of crowdworkers or smaller wikipedias. We focus on the Basque language, and present the first non-English CQA dataset and results. Our experiments show that it is possible to obtain good results with low amounts of native data thanks to cross-lingual transfer, with quality comparable to those obtained for English. We also discovered that dialogue history models are not directly transferable to another language, calling for further research. The dataset is publicly available.

pdf bib
Give your Text Representation Models some Love: the Case for Basque
Rodrigo Agerri | Iñaki San Vicente | Jon Ander Campos | Ander Barrena | Xabier Saralegi | Aitor Soroa | Eneko Agirre
Proceedings of the 12th Language Resources and Evaluation Conference

Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their own. This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora. In addition, monolingual pre-trained models for non-English languages are not always available. At best, models for those languages are included in multilingual versions, where each language shares the quota of substrings and parameters with the rest of the languages. This is particularly true for smaller languages such as Basque. In this paper we show that a number of monolingual models (FastText word embeddings, FLAIR and BERT language models) trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks, including topic classification, sentiment classification, PoS tagging and NER. This work sets a new state-of-the-art in those tasks for Basque. All benchmarks and models used in this work are publicly available.

pdf bib
A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation
Jan Deriu | Katsiaryna Mlynchyk | Philippe Schläpfer | Alvaro Rodrigo | Dirk von Grünigen | Nicolas Kaiser | Kurt Stockinger | Eneko Agirre | Mark Cieliebak
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we introduce a novel methodology to efficiently construct a corpus for question answering over structured data. For this, we introduce an intermediate representation that is based on the logical query plan in a database, called Operation Trees (OT). This representation allows us to invert the annotation process without loosing flexibility in the types of queries that we generate. Furthermore, it allows for fine-grained alignment of the tokens to the operations. Thus, we randomly generate OTs from a context free grammar and annotators just have to write the appropriate question and assign the tokens. We compare our corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus for evaluating natural language interfaces to databases, to Spider and LC-QuaD 2.0 and show that our methodology more than triples the annotation speed while maintaining the complexity of the queries. Finally, we train a state-of-the-art semantic parsing model on our data and show that our dataset is a challenging dataset and that the token alignment can be leveraged to significantly increase the performance.

pdf bib
DoQA - Accessing Domain-Specific FAQs via Conversational QA
Jon Ander Campos | Arantxa Otegi | Aitor Soroa | Jan Deriu | Mark Cieliebak | Eneko Agirre
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The goal of this work is to build conversational Question Answering (QA) interfaces for the large body of domain-specific information available in FAQ sites. We present DoQA, a dataset with 2,437 dialogues and 10,917 QA pairs. The dialogues are collected from three Stack Exchange sites using the Wizard of Oz method with crowdsourcing. Compared to previous work, DoQA comprises well-defined information needs, leading to more coherent and natural conversations with less factoid questions and is multi-domain. In addition, we introduce a more realistic information retrieval (IR) scenario where the system needs to find the answer in any of the FAQ documents. The results of an existing, strong, system show that, thanks to transfer learning from a Wikipedia QA dataset and fine tuning on a single FAQ domain, it is possible to build high quality conversational QA systems for FAQs without in-domain training data. The good results carry over into the more challenging IR scenario. In both cases, there is still ample room for improvement, as indicated by the higher human upperbound.

pdf bib
A Call for More Rigor in Unsupervised Cross-lingual Learning
Mikel Artetxe | Sebastian Ruder | Dani Yogatama | Gorka Labaka | Eneko Agirre
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world’s languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.

pdf bib
Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Ivana Kvapilíková | Mikel Artetxe | Gorka Labaka | Eneko Agirre | Ondřej Bojar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.

pdf bib
Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
Jan Deriu | Don Tuggener | Pius von Däniken | Jon Ander Campos | Alvaro Rodrigo | Thiziri Belkacem | Aitor Soroa | Eneko Agirre | Mark Cieliebak
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.

pdf bib
Translation Artifacts in Cross-lingual Transfer Learning
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Both human and machine translation play a central role in cross-lingual transfer learning: many multilingual datasets have been created through professional translation services, and using machine translation to translate either the test set or the training set is a widely used transfer technique. In this paper, we show that such translation process can introduce subtle artifacts that have a notable impact in existing cross-lingual models. For instance, in natural language inference, translating the premise and the hypothesis independently can reduce the lexical overlap between them, which current models are highly sensitive to. We show that some previous findings in cross-lingual transfer learning need to be reconsidered in the light of this phenomenon. Based on the gained insights, we also improve the state-of-the-art in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points, respectively.

pdf bib
Automatic Evaluation vs. User Preference in Neural Textual QuestionAnswering over COVID-19 Scientific Literature
Arantxa Otegi | Jon Ander Campos | Gorka Azkune | Aitor Soroa | Eneko Agirre
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We present a Question Answering (QA) system that won one of the tasks of the Kaggle CORD-19 Challenge, according to the qualitative evaluation of experts. The system is a combination of an Information Retrieval module and a reading comprehension module that finds the answers in the retrieved passages. In this paper we present a quantitative and qualitative analysis of the system. The quantitative evaluation using manually annotated datasets contradicted some of our design choices, e.g. the fact that using QuAC for fine-tuning provided better answers over just using SQuAD. We analyzed this mismatch with an additional A/B test which showed that the system using QuAC was indeed preferred by users, confirming our intuition. Our analysis puts in question the suitability of automatic metrics and its correlation to user preferences. We also show that automatic metrics are highly dependent on the characteristics of the gold standard, such as the average length of the answers.

pdf bib
Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
Eneko Agirre | Marianna Apidianaki | Ivan Vulić
Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

2019

pdf bib
An Effective Approach to Unsupervised Machine Translation
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

While machine translation has traditionally relied on large amounts of parallel corpora, a recent research line has managed to train both Neural Machine Translation (NMT) and Statistical Machine Translation (SMT) systems using monolingual corpora only. In this paper, we identify and address several deficiencies of existing unsupervised SMT approaches by exploiting subword information, developing a theoretically well founded unsupervised tuning method, and incorporating a joint refinement procedure. Moreover, we use our improved SMT system to initialize a dual NMT model, which is further fine-tuned through on-the-fly back-translation. Together, we obtain large improvements over the previous state-of-the-art in unsupervised machine translation. For instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points more than the previous best unsupervised system, and 0.5 points more than the (supervised) shared task winner back in 2014.

pdf bib
Analyzing the Limitations of Cross-lingual Word Embedding Mappings
Aitor Ormazabal | Mikel Artetxe | Gorka Labaka | Aitor Soroa | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning cross-lingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.

pdf bib
Bilingual Lexicon Induction through Unsupervised Machine Translation
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A recent research line has obtained strong results on bilingual lexicon induction by aligning independently trained word embeddings in two languages and using the resulting cross-lingual embeddings to induce word translation pairs through nearest neighbor or related retrieval methods. In this paper, we propose an alternative approach to this problem that builds on the recent work on unsupervised machine translation. This way, instead of directly inducing a bilingual lexicon from cross-lingual embeddings, we use them to build a phrase-table, combine it with a language model, and use the resulting machine translation system to generate a synthetic parallel corpus, from which we extract the bilingual lexicon using statistical word alignment techniques. As such, our method can work with any word embedding and cross-lingual mapping technique, and it does not require any additional resource besides the monolingual corpus used to train the embeddings. When evaluated on the exact same cross-lingual embeddings, our proposed method obtains an average improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS retrieval, establishing a new state-of-the-art in the standard MUSE dataset.

pdf bib
Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings
Yadollah Yaghoobzadeh | Katharina Kann | T. J. Hazen | Eneko Agirre | Hinrich Schütze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Word embeddings typically represent different meanings of a word in a single conflated vector. Empirical analysis of embeddings of ambiguous words is currently limited by the small size of manually annotated resources and by the fact that word senses are treated as unrelated individual concepts. We present a large dataset based on manual Wikipedia annotations and word senses, where word senses from different words are related by semantic classes. This is the basis for novel diagnostic tests for an embedding’s content: we probe word embeddings for semantic classes and analyze the embedding space by classifying embeddings into semantic classes. Our main findings are: (i) Information about a sense is generally represented well in a single-vector embedding – if the sense is frequent. (ii) A classifier can accurately predict whether a word is single-sense or multi-sense, based only on its embedding. (iii) Although rare senses are not well represented in single-vector embeddings, this does not have negative impact on an NLP application whose performance depends on frequent senses.

2018

pdf bib
Learning Text Representations for 500K Classification Tasks on Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 22nd Conference on Computational Natural Language Learning

Named Entity Disambiguation algorithms typically learn a single model for all target entities. In this paper we present a word expert model and train separate deep learning models for each target entity string, yielding 500K classification tasks. This gives us the opportunity to benchmark popular text representation alternatives on this massive dataset. In order to face scarce training data we propose a simple data-augmentation technique and transfer-learning. We show that bag-of-word-embeddings are better than LSTMs for tasks with scarce training data, while the situation is reversed when having larger amounts. Transferring a LSTM which is learned on all datasets is the most effective context representation option for the word experts in all frequency bands. The experiments show that our system trained on out-of-domain Wikipedia data surpass comparable NED systems which have been trained on in-domain training data.

pdf bib
Uncovering Divergent Linguistic Information in Word Embeddings with Lessons for Intrinsic and Extrinsic Evaluation
Mikel Artetxe | Gorka Labaka | Iñigo Lopez-Gazpio | Eneko Agirre
Proceedings of the 22nd Conference on Computational Natural Language Learning

Following the recent success of word embeddings, it has been argued that there is no such thing as an ideal representation for words, as different models tend to capture divergent and often mutually incompatible aspects like semantics/syntax and similarity/relatedness. In this paper, we show that each embedding model captures more information than directly apparent. A linear transformation that adjusts the similarity order of the model without any external resource can tailor it to achieve better results in those aspects, providing a new perspective on how embeddings encode divergent linguistic information. In addition, we explore the relation between intrinsic and extrinsic evaluation, as the effect of our transformations in downstream tasks is higher for unsupervised systems than for supervised ones.

pdf bib
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems. Our implementation is released as an open source project at https://github.com/artetxem/vecmap.

pdf bib
The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD
Eneko Agirre | Oier López de Lacalle | Aitor Soroa
Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

UKB is an open source collection of programs for performing, among other tasks, Knowledge-Based Word Sense Disambiguation (WSD). Since it was released in 2009 it has been often used out-of-the-box in sub-optimal settings. We show that nine years later it is the state-of-the-art on knowledge-based WSD. This case shows the pitfalls of releasing open source NLP software without optimal default settings and precise instructions for reproducibility.

pdf bib
Unsupervised Statistical Machine Translation
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses.

2017

pdf bib
SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation
Daniel Cer | Mona Diab | Eneko Agirre | Iñigo Lopez-Gazpio | Lucia Specia
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).

pdf bib
Learning bilingual word embeddings with (almost) no bilingual data
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most methods to learn bilingual word embeddings rely on large parallel corpora, which is difficult to obtain for most language pairs. This has motivated an active research line to relax this requirement, with methods that use document-aligned corpora or bilingual dictionaries of a few thousand words instead. In this work, we further reduce the need of bilingual resources using a very simple self-learning approach that can be combined with any dictionary-based mapping technique. Our method exploits the structural similarity of embedding spaces, and works with as little bilingual evidence as a 25 word dictionary or even an automatically generated list of numerals, obtaining results comparable to those of systems that use richer resources.

2016

pdf bib
Learning principled bilingual mappings of word embeddings while preserving monolingual invariance
Mikel Artetxe | Gorka Labaka | Eneko Agirre
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Alleviating Poor Context with Background Knowledge for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Evaluating Translation Quality and CLIR Performance of Query Sessions
Xabier Saralegi | Eneko Agirre | Iñaki Alegria
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the evaluation of the translation quality and Cross-Lingual Information Retrieval (CLIR) performance when using session information as the context of queries. The hypothesis is that previous queries provide context that helps to solve ambiguous translations in the current query. We tested several strategies on the TREC 2010 Session track dataset, which includes query reformulations grouped by generalization, specification, and drifting types. We study the Basque to English direction, evaluating both the translation quality and CLIR performance, with positive results in both cases. The results show that the quality of translation improved, reducing error rate by 12% (HTER) when using session information, which improved CLIR results 5% (nDCG). We also provide an analysis of the improvements across the three kinds of sessions: generalization, specification, and drifting. Translation quality improved in all three types (generalization, specification, and drifting), and CLIR improved for generalization and specification sessions, preserving the performance in drifting sessions.

pdf bib
A comparison of Named-Entity Disambiguation and Word Sense Disambiguation
Angel Chang | Valentin I. Spitkovsky | Christopher D. Manning | Eneko Agirre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Named Entity Disambiguation (NED) is the task of linking a named-entity mention to an instance in a knowledge-base, typically Wikipedia-derived resources like DBpedia. This task is closely related to word-sense disambiguation (WSD), where the mention of an open-class word is linked to a concept in a knowledge-base, typically WordNet. This paper analyzes the relation between two annotated datasets on NED and WSD, highlighting the commonalities and differences. We detail the methods to construct a NED system following the WSD word-expert approach, where we need a dictionary and one classifier is built for each target entity mention string. Constructing a dictionary for NED proved challenging, and although similarity and ambiguity are higher for NED, the results are also higher due to the larger number of training data, and the more crisp and skewed meaning differences.

pdf bib
Addressing the MFS Bias in WSD systems
Marten Postma | Ruben Izquierdo | Eneko Agirre | German Rigau | Piek Vossen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word Sense Disambiguation (WSD) systems tend to have a strong bias towards assigning the Most Frequent Sense (MFS), which results in high performance on the MFS but in a very low performance on the less frequent senses. We addressed the MFS bias in WSD systems by combining the output from a WSD system with a set of mostly static features to create a MFS classifier to decide when to and not to choose the MFS. The output from this MFS classifier, which is based on the Random Forest algorithm, is then used to modify the output from the original WSD system. We applied our classifier to one of the state-of-the-art supervised WSD systems, i.e. IMS, and to of the best state-of-the-art unsupervised WSD systems, i.e. UKB. Our main finding is that we are able to improve the system output in terms of choosing between the MFS and the less frequent senses. When we apply the MFS classifier to fine-grained WSD, we observe an improvement on the less frequent sense cases, whereas we maintain the overall recall.

pdf bib
Word Sense-Aware Machine Translation: Including Senses as Contextual Features for Improved Translation Models
Steven Neale | Luís Gomes | Eneko Agirre | Oier Lopez de Lacalle | António Branco
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Although it is commonly assumed that word sense disambiguation (WSD) should help to improve lexical choice and improve the quality of machine translation systems, how to successfully integrate word senses into such systems remains an unanswered question. Some successful approaches have involved reformulating either WSD or the word senses it produces, but work on using traditional word senses to improve machine translation have met with limited success. In this paper, we build upon previous work that experimented on including word senses as contextual features in maxent-based translation models. Training on a large, open-domain corpus (Europarl), we demonstrate that this aproach yields significant improvements in machine translation from English to Portuguese.

pdf bib
QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages
Arantxa Otegi | Nora Aranberri | Antonio Branco | Jan Hajič | Martin Popel | Kiril Simov | Eneko Agirre | Petya Osenova | Rita Pereira | João Silva | Steven Neale
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This work presents parallel corpora automatically annotated with several NLP tools, including lemma and part-of-speech tagging, named-entity recognition and classification, named-entity disambiguation, word-sense disambiguation, and coreference. The corpora comprise both the well-known Europarl corpus and a domain-specific question-answer troubleshooting corpus on the IT domain. English is common in all parallel corpora, with translations in five languages, namely, Basque, Bulgarian, Czech, Portuguese and Spanish. We describe the annotated corpora and the tools used for annotation, as well as annotation statistics for each language. These new resources are freely available and will help research on semantic processing for machine translation and cross-lingual transfer.

pdf bib
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)
Deyi Xiong | Kevin Duh | Eneko Agirre | Nora Aranberri | Houfeng Wang
Proceedings of the 2nd Workshop on Semantics-Driven Machine Translation (SedMT 2016)

pdf bib
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
Rosa Gaudio | Gorka Labaka | Eneko Agirre | Petya Osenova | Kiril Simov | Martin Popel | Dieke Oele | Gertjan van Noord | Luís Gomes | João António Rodrigues | Steven Neale | João Silva | Andreia Querido | Nuno Rendeiro | António Branco
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
Adding syntactic structure to bilingual terminology for improved domain adaptation
Mikel Artetxe | Gorka Labaka | Chakaveh Saedi | João Rodrigues | João Silva | António Branco | Eneko Agirre
Proceedings of the 2nd Deep Machine Translation Workshop

pdf bib
Improving Translation Selection with Supersenses
Haiqing Tang | Deyi Xiong | Oier Lopez de Lacalle | Eneko Agirre
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Selecting appropriate translations for source words with multiple meanings still remains a challenge for statistical machine translation (SMT). One reason for this is that most SMT systems are not good at detecting the proper sense for a polysemic word when it appears in different contexts. In this paper, we adopt a supersense tagging method to annotate source words with coarse-grained ontological concepts. In order to enable the system to choose an appropriate translation for a word or phrase according to the annotated supersense of the word or phrase, we propose two translation models with supersense knowledge: a maximum entropy based model and a supersense embedding model. The effectiveness of our proposed models is validated on a large-scale English-to-Spanish translation task. Results indicate that our method can significantly improve translation quality via correctly conveying the meaning of the source language to the target language.

pdf bib
SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation
Eneko Agirre | Carmen Banea | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Rada Mihalcea | German Rigau | Janyce Wiebe
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
SemEval-2016 Task 2: Interpretable Semantic Textual Similarity
Eneko Agirre | Aitor Gonzalez-Agirre | Iñigo Lopez-Gazpio | Montse Maritxalar | German Rigau | Larraitz Uria
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
iUBC at SemEval-2016 Task 2: RNNs and LSTMs for interpretable STS
Iñigo Lopez-Gazpio | Eneko Agirre | Montse Maritxalar
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Crowdsourced Word Sense Annotations and Difficult Words and Examples
Oier Lopez de Lacalle | Eneko Agirre
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation
Dekai Wu | Marine Carpuat | Eneko Agirre | Nora Aranberri
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Analyzing English-Spanish Named-Entity enhanced Machine Translation
Mikel Artetxe | Eneko Agirre | Inaki Alegria | Gorka Labaka
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Predicting word sense annotation agreement
Héctor Martínez Alonso | Anders Johannsen | Oier Lopez de Lacalle | Eneko Agirre
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

pdf bib
Deep-syntax TectoMT for English-Spanish MT
Gorka Labaka | Oneka Jauregi | Arantza Díaz de Ilarraza | Michael Ustaszewski | Nora Aranberri | Eneko Agirre
Proceedings of the 1st Deep Machine Translation Workshop

pdf bib
A Methodology for Word Sense Disambiguation at 90% based on large-scale CrowdSourcing
Oier Lopez de Lacalle | Eneko Agirre
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation
Ander Barrena | Aitor Soroa | Eneko Agirre
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
UBC: Cubes for English Semantic Textual Similarity and Supervised Approaches for Interpretable STS
Eneko Agirre | Aitor Gonzalez-Agirre | Iñigo Lopez-Gazpio | Montse Maritxalar | German Rigau | Larraitz Uria
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability
Eneko Agirre | Carmen Banea | Claire Cardie | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo | Iñigo Lopez-Gazpio | Montse Maritxalar | Rada Mihalcea | German Rigau | Larraitz Uria | Janyce Wiebe
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
SemEval-2015 Task 4: TimeLine: Cross-Document Event Ordering
Anne-Lyse Minard | Manuela Speranza | Eneko Agirre | Itziar Aldabe | Marieke van Erp | Bernardo Magnini | German Rigau | Rubén Urizar
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Improving distant supervision using inference learning
Roland Roller | Eneko Agirre | Aitor Soroa | Mark Stevenson
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts
Eneko Agirre | Kevin Duh
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts

pdf bib
Diamonds in the Rough: Event Extraction from Imperfect Microblog Data
Ander Intxaurrondo | Eneko Agirre | Oier Lopez de Lacalle | Mihai Surdeanu
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Random Walks and Neural Network Language Models on Knowledge Bases
Josu Goikoetxea | Aitor Soroa | Eneko Agirre
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Random Walks for Knowledge-Based Word Sense Disambiguation
Eneko Agirre | Oier López de Lacalle | Aitor Soroa
Computational Linguistics, Volume 40, Issue 1 - March 2014

pdf bib
Exploring the use of word embeddings and random walks on Wikipedia for the CogAlex shared task
Josu Goikoetxea | Eneko Agirre | Aitor Soroa
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

pdf bib
“One Entity per Discourse” and “One Entity per Collocation” Improve Named-Entity Disambiguation
Ander Barrena | Eneko Agirre | Bernardo Cabaleiro | Anselmo Peñas | Aitor Soroa
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
SemEval-2014 Task 10: Multilingual Semantic Textual Similarity
Eneko Agirre | Carmen Banea | Claire Cardie | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo | Rada Mihalcea | German Rigau | Janyce Wiebe
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib
On WordNet Semantic Classes and Dependency Parsing
Kepa Bengoetxea | Eneko Agirre | Joakim Nivre | Yue Zhang | Koldo Gojenola
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Generating Paths through Cultural Heritage Collections
Samuel Fernando | Paula Goodale | Paul Clough | Mark Stevenson | Mark Hall | Eneko Agirre
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Text Understanding using Knowledge-Bases and Random Walks
Eneko Agirre
Proceedings of the Joint Symposium on Semantic Processing. Textual Inference and Structures in Corpora

pdf bib
PATHS: A System for Accessing Cultural Heritage Collections
Eneko Agirre | Nikolaos Aletras | Paul Clough | Samuel Fernando | Paula Goodale | Mark Hall | Aitor Soroa | Mark Stevenson
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Selectional Preferences for Semantic Role Classification
Beñat Zapirain | Eneko Agirre | Lluís Màrquez | Mihai Surdeanu
Computational Linguistics, Volume 39, Issue 3 - September 2013

pdf bib
*SEM 2013 shared task: Semantic Textual Similarity
Eneko Agirre | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre | Weiwei Guo
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib
UBC_UOS-TYPED: Regression for typed-similarity
Eneko Agirre | Nikolaos Aletras | Aitor Gonzalez-Agirre | German Rigau | Mark Stevenson
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf bib
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)
Eneko Agirre | Johan Bos | Mona Diab | Suresh Manandhar | Yuval Marton | Deniz Yuret
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Eneko Agirre | Daniel Cer | Mona Diab | Aitor Gonzalez-Agirre
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
SRIUBC: Simple Similarity Features for Semantic Textual Similarity
Eric Yeh | Eneko Agirre
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Enabling the Discovery of Digital Cultural Heritage Objects through Wikipedia
Mark Michael Hall | Oier Lopez de Lacalle | Aitor Soroa Etxabe | Paul Clough | Eneko Agirre
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Matching Cultural Heritage items to Wikipedia
Eneko Agirre | Ander Barrena | Oier Lopez de Lacalle | Aitor Soroa | Samuel Fernando | Mark Stevenson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Digitised Cultural Heritage (CH) items usually have short descriptions and lack rich contextual information. Wikipedia articles, on the contrary, include in-depth descriptions and links to related articles, which motivate the enrichment of CH items with information from Wikipedia. In this paper we explore the feasibility of finding matching articles in Wikipedia for a given Cultural Heritage item. We manually annotated a random sample of items from Europeana, and performed a qualitative and quantitative study of the issues and problems that arise, showing that each kind of CH item is different and needs a nuanced definition of what ``matching article'' means. In addition, we test a well-known wikification (aka entity linking) algorithm on the task. Our results indicate that a substantial number of items can be effectively linked to their corresponding Wikipedia article.

pdf bib
Contribution of Complex Lexical Information to Solve Syntactic Ambiguity in Basque
Aitziber Atutxa | Eneko Agirre | Kepa Sarasola
Proceedings of COLING 2012

pdf bib
Comparing Taxonomies for Organising Collections of Documents
Samuel Fernando | Mark Hall | Eneko Agirre | Aitor Soroa | Paul Clough | Mark Stevenson
Proceedings of COLING 2012

2011

pdf bib
Query Expansion for IR using Knowledge-Based Relatedness
Arantxa Otegi | Xabier Arregi | Eneko Agirre
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Improving Dependency Parsing with Semantic Classes
Eneko Agirre | Kepa Bengoetxea | Koldo Gojenola | Joakim Nivre
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
SemEval-2010 Task 17: All-Words Word Sense Disambiguation on a Specific Domain
Eneko Agirre | Oier Lopez de Lacalle | Christiane Fellbaum | Shu-Kai Hsieh | Maurizio Tesconi | Monica Monachini | Piek Vossen | Roxanne Segers
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Kyoto: An Integrated System for Specific Domain WSD
Aitor Soroa | Eneko Agirre | Oier Lopez de Lacalle | Wauter Bosma | Piek Vossen | Monica Monachini | Jessie Lo | Shu-Kai Hsieh
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Improving Semantic Role Classification with Selectional Preferences
Beñat Zapirain | Eneko Agirre | Lluís Màrquez | Mihai Surdeanu
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
KYOTO: an open platform for mining facts
Piek Vossen | German Rigau | Eneko Agirre | Aitor Soroa | Monica Monachini | Roberto Bartolini
Proceedings of the 6th Workshop on Ontologies and Lexical Resources

pdf bib
Exploring Knowledge Bases for Similarity
Eneko Agirre | Montse Cuadros | German Rigau | Aitor Soroa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Graph-based similarity over WordNet has been previously shown to perform very well on word similarity. This paper presents a study of the performance of such a graph-based algorithm when using different relations and versions of Wordnet. The graph algorithm is based on Personalized PageRank, a random-walk based algorithm which computes the probability of a random-walk initiated in the target word to reach any synset following the relations in WordNet (Haveliwala, 2002). Similarity is computed as the cosine of the probability distributions for each word over WordNet. The best combination of relations includes all relations in WordNet 3.0, included disambiguated glosses, and automatically disambiguated topic signatures called KnowNets. All relations are part of the official release of WordNet, except KnowNets, which have been derived automatically. The results over the WordSim 353 dataset show that using the adequate relations the performance improves over previously published WordNet-based results on the WordSim353 dataset (Finkelstein et al., 2002). The similarity software and some graphs used in this paper are publicly available at http://ixa2.si.ehu.es/ukb.

pdf bib
Plagiarism Detection across Distant Language Pairs
Alberto Barrón-Cedeño | Paolo Rosso | Eneko Agirre | Gorka Labaka
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Document Expansion Based on WordNet for Robust IR
Eneko Agirre | Xabier Arregi | Arantxa Otegi
Coling 2010: Posters

2009

pdf bib
Generalizing over Lexical Features: Selectional Preferences for Semantic Role Classification
Beñat Zapirain | Eneko Agirre | Lluís Màrquez
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Personalizing PageRank for Word Sense Disambiguation
Eneko Agirre | Aitor Soroa
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Supervised Domain Adaption for WSD
Eneko Agirre | Oier Lopez de Lacalle
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Use of Rich Linguistic Information to Translate Prepositions and Grammar Cases to Basque
Eneko Agirre | Aitziber Atutxa | Gorka Labaka | Mikel Lersundi | Aingeru Mayor | Kepa Sarasola
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches
Eneko Agirre | Enrique Alfonseca | Keith Hall | Jana Kravalova | Marius Paşca | Aitor Soroa
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)
Eneko Agirre | Lluís Màrquez | Richard Wicentowski
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
SemEval-2010 Task 17: All-words Word Sense Disambiguation on a Specific Domain
Eneko Agirre | Oier Lopez de Lacalle | Christiane Fellbaum | Andrea Marchetti | Antonio Toral | Piek Vossen
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
WikiWalk: Random walks on Wikipedia for Semantic Relatedness
Eric Yeh | Daniel Ramage | Christopher D. Manning | Eneko Agirre | Aitor Soroa
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

pdf bib
A Study on Linking Wikipedia Categories to Wordnet Synsets using Text Similarity
Antonio Toral | Óscar Ferrández | Eneko Agirre | Rafael Muñoz
Proceedings of the International Conference RANLP-2009

2008

pdf bib
On Robustness and Domain Adaptation using SVD for Word Sense Disambiguation
Eneko Agirre | Oier Lopez de Lacalle
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation
Eneko Agirre | Aitor Soroa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the results of a graph-based method for performing knowledge-based Word Sense Disambiguation (WSD). The technique exploits the structural properties of the graph underlying the chosen knowledge base. The method is general, in the sense that it is not tied to any particular knowledge base, but in this work we have applied it to the Multilingual Central Repository (MCR). The evaluation has been performed on the Senseval-3 all-words task. The main contributions of the paper are twofold: (1) We have evaluated the separate and combined performance of each type of relation in the MCR, and thus indirectly validated the contents of the MCR and their potential for WSD. (2) We obtain state-of-the-art results, and in fact yield the best results that can be obtained using publicly available data.

pdf bib
KYOTO: a System for Mining, Structuring and Distributing Knowledge across Languages and Cultures
Piek Vossen | Eneko Agirre | Nicoletta Calzolari | Christiane Fellbaum | Shu-kai Hsieh | Chu-Ren Huang | Hitoshi Isahara | Kyoko Kanzaki | Andrea Marchetti | Monica Monachini | Federico Neri | Remo Raffaelli | German Rigau | Maurizio Tescon | Joop VanGent
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (“Kybots”). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.

pdf bib
WNTERM: Enriching the MCR with a Terminological Dictionary
Eli Pociello | Antton Gurrutxaga | Eneko Agirre | Izaskun Aldezabal | German Rigau
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the methodology and the first steps for the creation of WNTERM (from WordNet and Terminology), a specialized lexicon produced from the merger of the EuroWordNet-based Multilingual Central Repository (MCR) and the Basic Encyclopaedic Dictionary of Science and Technology (BDST). As an example, the ecology domain has been used. The final result is a multilingual (Basque and English) light-weight domain ontology, including taxonomic and other semantic relations among its concepts, which is tightly connected to other wordnets.

pdf bib
Improving Parsing and PP Attachment Performance with Sense Information
Eneko Agirre | Timothy Baldwin | David Martinez
Proceedings of ACL-08: HLT

pdf bib
Robustness and Generalization of Role Sets: PropBank vs. VerbNet
Beñat Zapirain | Eneko Agirre | Lluís Màrquez
Proceedings of ACL-08: HLT

2007

pdf bib
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
Eneko Agirre | Lluís Màrquez | Richard Wicentowski
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
SemEval-2007 Task 01: Evaluating WSD on Cross-Language Information Retrieval
Eneko Agirre | Bernardo Magnini | Oier Lopez de Lacalle | Arantxa Otegi | German Rigau | Piek Vossen
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
SemEval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems
Eneko Agirre | Aitor Soroa
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
UBC-ALM: Combining k-NN with SVD for WSD
Eneko Agirre | Oier Lopez de Lacalle
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
UBC-AS: A Graph Based Unsupervised System for Induction and Classification
Eneko Agirre | Aitor Soroa
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
UBC-UMB: Combining unsupervised and supervised systems for all-words WSD
David Martinez | Timothy Baldwin | Eneko Agirre | Oier Lopez de Lacalle
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
UBC-UPC: Sequential SRL Using Selectional Preferences. An approach with Maximum Entropy Markov Models
Beñat Zapirain | Eneko Agirre | Lluís Màrquez
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Two graph-based algorithms for state-of-the-art WSD
Eneko Agirre | David Martínez | Oier López de Lacalle | Aitor Soroa
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
Evaluating and optimizing the parameters of an unsupervised graph-based WSD algorithm
Eneko Agirre | David Martínez | Oier López de Lacalle | Aitor Soroa
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

pdf bib
A Preliminary Study for Building the Basque PropBank
Eneko Agirre | Izaskun Aldezabal | Jone Etxeberria | Eli Pociello
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents a methodology for adding a layer of semantic annotation to a syntactically annotated corpus of Basque (EPEC), in terms of semantic roles. The proposal we make here is the combination of three resources: the model used in the PropBank project (Palmer et al., 2005), an in-house database with syntactic/semantic subcategorization frames for Basque verbs (Aldezabal, 2004) and the Basque dependency treebank (Aduriz et al., 2003). In order to validate the methodology and to confirm whether the PropBank model is suitable for Basque and our treebank design, we have built lexical entries and labelled all argument and adjuncts occurring in our treebank for 3 Basque verbs. The result of this study has been very positive, and has produced a methodology adapted to the characteristics of the language and the Basque dependency treebank. Another goal of this study was to study whether semi-automatic tagging was possible. The idea is to present the human taggers a pre-tagged version of the corpus. We have seen that many arguments could be automatically tagged with high precision, given only the verbal entries for the verbs and a handful of examples.

pdf bib
A methodology for the joint development of the Basque WordNet and Semcor
Eneko Agirre | Izaskun Aldezabal | Jone Etxeberria | Eli Izagirre | Karmele Mendizabal | Eli Pociello | Mikel Quintian
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the methodology adopted to jointly develop the Basque WordNet and a hand annotated corpora (the Basque Semcor). This joint development allows for better motivated sense distinctions, and a tighter coupling between both resources. The methodology involves edition, tagging and refereeing tasks. We are currently half way through the nominal part of the 300.000 word corpus (roughly equivalent to a 500.000 word corpus for English). We present a detailed description of the task, including the main criteria for difficult cases in the edition of the senses and the tagging of the corpus, with special mention to multiword entries. Finally we give a detailed picture of the current figures, as well as an analysis of the agreement rates.

pdf bib
Word Relatives in Context for Word Sense Disambiguation
David Martinez | Eneko Agirre | Xinglong Wang
Proceedings of the Australasian Language Technology Workshop 2006

2004

pdf bib
The Basque lexical-sample task
Eneko Agirre | Itziar Aldabe | Mikel Lersundi | David Martínez | Eli Pociello | Larraitz Uria
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
The Basque Country University system: English and Basque tasks
Eneko Agirre | David Martínez
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
The “Meaning” system on the English all-words task
Luís Villarejo | Lluis Màrquez | Eneko Agirre | David Martínez | Bernardo Magnini | Carlo Strapparava | Diana McCarthy | Andrés Montoyo | Armando Suárez
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

pdf bib
Unsupervised WSD based on Automatically Retrieved Examples: The Importance of Bias
Eneko Agirre | David Martinez
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
Cross-Language Acquisition of Semantic Models for Verbal Predicates
Jordi Atserias | Bernardo Magnini | Octavian Popescu | Eneko Agirre | Aitziber Atutxa | German Rigau | John Carroll | Rob Koeling
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
The Effect of Bias on an Automatically-built Word Sense Corpus
David Martínez | Eneko Agirre
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Exploring Portability of Syntactic Information from English to Basque
Eneko Agirre | Aitziber Atutxa | Koldo Gojenola | Kepa Sarasola
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Publicly Available Topic Signatures for all WordNet Nominal Senses
Eneko Agirre | Oier Lopez de Lacalle
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
A Multilingual Approach to Disambiguate Prepositions and Case Suffixes
Eneko Agirre | Mikel Lersundi | David Martinez
Proceedings of the ACL-02 Workshop on Word Sense Disambiguation: Recent Successes and Future Directions

pdf bib
MEANING: a Roadmap to Knowledge Technologies
German Rigau | Bernardo Magnini | Eneko Agirre | Piek Vossen | John Carroll
COLING-02: A Roadmap for Computational Linguistics

pdf bib
Syntactic Features for High Precision Word Sense Disambiguation
David Martínez | Eneko Agirre | Lluís Màrquez
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Learning class-to-class selectional preferences
Eneko Agirre | David Martinez
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)

pdf bib
The Basque Task: Did Systems Perform in the Upperbound?
Eneko Agirre | Elena Garcia | Mikel Lersundi | David Martinez | Eli Pociello
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
Decision Lists for English and Basque
David Martinez | Eneko Agirre
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf bib
A Word-level Morphosyntactic Analyzer for Basque
I. Aduriz | E. Agirre | I. Aldezabal | X. Arregi | J. M. Arriola | X. Artola | K. Gojenola | A. Maritxalar | K. Sarasola | M. Urkia
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
A word-grammar based morphological analyzer for agglutinative languages
I. Aduriz | E. Agirre | I. Aldezabal | I. Alegria | X. Arregi | J. M. Arriola | X. Artola | K. Gojenola | A. Maritxalar | K. Sarasola | M. Urkia
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
One Sense per Collocation and Genre/Topic Variations
David Martinez | Eneko Agirre
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web
Eneko Agirre | David Martinez
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

1998

pdf bib
Towards a single proposal in spelling correction
Eneko Agirre | Koldo Gojenola | Kepa Sarasola | Atro Voutilainen
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Building Accurate Semantic Taxonomies Monolingual MRDs
German Rigau | Horacio Rodriguez | Eneko Agirre
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
Towards a Single Proposal in Spelling Correction
Eneko Agirre | Koldo Gojenola | Kepa Sarasola | Atro Voutilainen
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Building Accurate Semantic Taxonomies from Monolingual MRDs
German Rigau | Horacio Rodriguez | Eneko Agirre
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

1997

pdf bib
Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation
German Rigau | Jordi Atserias | Eneko Agirre
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics

1996

pdf bib
Word Sense Disambiguation using Conceptual Density
Eneko Agirre | German Rigau
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

1994

pdf bib
Lexical, Knowledge Representation in an Intelligent Dictionary Help System
E. Agirre | X. Arregi | X. Artola | A. Diaz de Ilarraza | K. Sarasola
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

1993

pdf bib
A Morphological Analysis Based Method for Spelling Correction
I. Aduriz | E. Agirre | I. Alegria | X. Arregi | J.M Arriola | X. Artola | A. Diaz de Ilarraza | N. Ezeiza | M. Maritxalar | K. Sarasola | M. Urkia
Sixth Conference of the European Chapter of the Association for Computational Linguistics

1992

pdf bib
XUXEN: A Spelling Checker/Corrector for Basque Based on Two-Level Morphology
E. Agirre | I Alegria | X Arregi | X Artola | A Diaz de Ilarraza | M Maritxalar | K Sarasola | M Urkia
Third Conference on Applied Natural Language Processing

Search
Co-authors