2017
pdf
bib
Experiments in taxonomy induction in Spanish and French
Irene Renau
|
Rogelio Nazar
|
Rafael Marín
Proceedings of Language, Ontology, Terminology and Knowledge Structures Workshop (LOTKS 2017)
2016
pdf
bib
abs
A Taxonomy of Spanish Nouns, a Statistical Algorithm to Generate it and its Implementation in Open Source Code
Rogelio Nazar
|
Irene Renau
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper we describe our work in progress in the automatic development of a taxonomy of Spanish nouns, we offer the Perl implementation we have so far, and we discuss the different problems that still need to be addressed. We designed a statistically-based taxonomy induction algorithm consisting of a combination of different strategies not involving explicit linguistic knowledge. Being all quantitative, the strategies we present are however of different nature. Some of them are based on the computation of distributional similarity coefficients which identify pairs of sibling words or co-hyponyms, while others are based on asymmetric co-occurrence and identify pairs of parent-child words or hypernym-hyponym relations. A decision making process is then applied to combine the results of the previous steps, and finally connect lexical units to a basic structure containing the most general categories of the language. We evaluate the quality of the taxonomy both manually and also using Spanish Wordnet as a gold-standard. We estimate an average of 89.07% precision and 25.49% recall considering only the results which the algorithm presents with high degree of certainty, or 77.86% precision and 33.72% recall considering all results.
2012
pdf
bib
abs
Spell Checking in Spanish: The Case of Diacritic Accents
Jordi Atserias
|
Maria Fuentes
|
Rogelio Nazar
|
Irene Renau
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This article presents the problem of diacritic restoration (or diacritization) in the context of spell-checking, with the focus on an orthographically rich language such as Spanish. We argue that despite the large volume of work published on the topic of diacritization, currently available spell-checking tools have still not found a proper solution to the problem in those cases where both forms of a word are listed in the checker's dictionary. This is the case, for instance, when a word form exists with and without diacritics, such as continuo continuous' and continuó he/she/it continued', or when different diacritics make other word distinctions, as in continúo I continue'. We propose a very simple solution based on a word bigram model derived from correctly typed Spanish texts and evaluate the ability of this model to restore diacritics in artificial as well as real errors. The case of diacritics is only meant to be an example of the possible applications for this idea, yet we believe that the same method could be applied to other kinds of orthographic or even grammatical errors. Moreover, given that no explicit linguistic knowledge is required, the proposed model can be used with other languages provided that a large normative corpus is available.
pdf
bib
Google Books N-gram Corpus used as a Grammar Checker
Rogelio Nazar
|
Irene Renau
Proceedings of the Second Workshop on Computational Linguistics and Writing (CL&W 2012): Linguistic and Cognitive Aspects of Document Creation and Document Engineering