Rubén Izquierdo

Also published as: Ruben Izquierdo, Ruben Izquierdo Bevia

2016

More is not always better: balancing sense distributions for all-words Word Sense Disambiguation
Marten Postma | Ruben Izquierdo Bevia | Piek Vossen
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Current Word Sense Disambiguation systems show an extremely poor performance on low frequent senses, which is mainly caused by the difference in sense distributions between training and test data. The main focus in tackling this problem has been on acquiring more data or selecting a single predominant sense and not necessarily on the meta properties of the data itself. We demonstrate that these properties, such as the volume, provenance, and balancing, play an important role with respect to system performance. In this paper, we describe a set of experiments to analyze these meta properties in the framework of a state-of-the-art WSD system when evaluated on the SemEval-2013 English all-words dataset. We show that volume and provenance are indeed important, but that approximating the perfect balancing of the selected training data leads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points while using only simple features. We therefore conclude that unsupervised acquisition of training data should be guided by strategies aimed at matching meta properties.

pdf bib abs

Addressing the MFS Bias in WSD systems
Marten Postma | Ruben Izquierdo | Eneko Agirre | German Rigau | Piek Vossen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word Sense Disambiguation (WSD) systems tend to have a strong bias towards assigning the Most Frequent Sense (MFS), which results in high performance on the MFS but in a very low performance on the less frequent senses. We addressed the MFS bias in WSD systems by combining the output from a WSD system with a set of mostly static features to create a MFS classifier to decide when to and not to choose the MFS. The output from this MFS classifier, which is based on the Random Forest algorithm, is then used to modify the output from the original WSD system. We applied our classifier to one of the state-of-the-art supervised WSD systems, i.e. IMS, and to of the best state-of-the-art unsupervised WSD systems, i.e. UKB. Our main finding is that we are able to improve the system output in terms of choosing between the MFS and the less frequent senses. When we apply the MFS classifier to fine-grained WSD, we observe an improvement on the less frequent sense cases, whereas we maintain the overall recall.

2015

pdf bib

VUA-background : When to Use Background Information to Perform Word Sense Disambiguation
Marten Postma | Ruben Izquierdo | Piek Vossen
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib abs

Generating Polarity Lexicons with WordNet propagation in 5 languages
Isa Maks | Ruben Izquierdo | Francesca Frontini | Rodrigo Agerri | Piek Vossen | Andoni Azpeitia
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we focus on the creation of general-purpose (as opposed to domain-specific) polarity lexicons in five languages: French, Italian, Dutch, English and Spanish using WordNet propagation. WordNet propagation is a commonly used method to generate these lexicons as it gives high coverage of general purpose language and the semantically rich WordNets where concepts are organised in synonym , antonym and hyperonym/hyponym structures seem to be well suited to the identification of positive and negative words. However, WordNets of different languages may vary in many ways such as the way they are compiled, the number of synsets, number of synonyms and number of semantic relations they include. In this study we investigate whether this variability translates into differences of performance when these WordNets are used for polarity propagation. Although many variants of the propagation method are developed for English, little is known about how they perform with WordNets of other languages. We implemented a propagation algorithm and designed a method to obtain seed lists similar with respect to quality and size, for each of the five languages. We evaluated the results against gold standards also developed according to a common method in order to achieve as less variance as possible between the different languages.

2013

pdf bib

DutchSemCor: in quest of the ideal sense-tagged corpus
Piek Vossen | Rubén Izquierdo | Attila Görög
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib abs

DutchSemCor: Targeting the ideal sense-tagged corpus
Piek Vossen | Attila Görög | Rubén Izquierdo | Antal van den Bosch
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word Sense Disambiguation (WSD) systems require large sense-tagged corpora along with lexical databases to reach satisfactory results. The number of English language resources for developed WSD increased in the past years while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. In this paper, we discuss the different conflicting requirements for a sense-tagged corpus and our strategies to fulfill them. We report on a first series of experiments to sup- port our semi-automatic approach to build the corpus.