Iñaki Alegría

Also published as: Iñaki Alegria, Inaki Alegria, Ińaki Alegria, I Alegria, I. Alegria

2018

We describe the first experimental results in neural machine translation for Basque. As a synthetic language featuring agglutinative morphology, an extended case system, complex verbal morphology and relatively free word order, Basque presents a large number of challenging characteristics for machine translation in general, and for data-driven approaches such as attentionbased encoder-decoder models in particular. We present our results on a large range of experiments in Basque-Spanish translation, comparing several neural machine translation system variants with both rule-based and statistical machine translation systems. We demonstrate that significant gains can be obtained with a neural network approach for this challenging language pair, and describe optimal configurations in terms of word segmentation and decoding parameters, measured against test sets that feature multiple references to account for word order variability.

pdf bib abs
Massively multilingual accessible audioguides via cell phones
Itziar Cortes | Igor Leturia | Ińaki Alegria | Aitzol Astigarraga | Kepa Sarasola | Manex Garaio
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Bidaide1 is a web service that allows the visitors of a museum, route or building to read or listen to explanations relative to the visited place on their own mobile and in their own language. The visitor can access the explanations in various ways: by scanning some QR codes located in the place, by GPS positioning (in outdoor routes), or by automatic Bluetooth proximity activation. This makes it accessible for people with reduced or null vision. On the other hand, this platform also offers to the manager of the visited site the most advanced language resources to create the texts and audios of the explanations in many languages.

pdf bib abs
Measuring language distance among historical varieties using perplexity. Application to European Portuguese.
Jose Ramom Pichel Campos | Pablo Gamallo | Iñaki Alegria
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

The objective of this work is to quantify, with a simple and robust measure, the distance between historical varieties of a language. The measure will be inferred from text corpora corresponding to historical periods. Different approaches have been proposed for similar aims: Language Identification, Phylogenetics, Historical Linguistics or Dialectology. In our approach, we used a perplexity-based measure to calculate language distance between all the historical periods of a specific language: European Portuguese. Perplexity has also proven to be a robust metric to calculate distance between languages. However, this measure has not been tested yet to identify diachronic periods within the historical evolution of a specific language. For this purpose, a historical Portuguese corpus has been constructed from different open sources containing texts with close original spelling. The results of our experiments show that Portuguese keeps an important degree of homogeneity over time. We anticipate this metric to be a starting point to be applied to other languages.

pdf bib abs
Verbal Multiword Expressions in Basque Corpora
Uxoa Iñurrieta | Itziar Aduriz | Ainara Estarrona | Itziar Gonzalez-Dios | Antton Gurrutxaga | Ruben Urizar | Iñaki Alegria
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper presents a Basque corpus where Verbal Multiword Expressions (VMWEs) were annotated following universal guidelines. Information on the annotation is given, and some ideas for discussion upon the guidelines are also proposed. The corpus is useful not only for NLP-related research, but also to draw conclusions on Basque phraseology in comparison with other languages.

2017

pdf bib abs
A Comparison of Feature-Based and Neural Scansion of Poetry
Manex Agirrezabal | Iñaki Alegria | Mans Hulden
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Automatic analysis of poetic rhythm is a challenging task that involves linguistics, literature, and computer science. When the language to be analyzed is known, rule-based systems or data-driven methods can be used. In this paper, we analyze poetic rhythm in English and Spanish. We show that the representations of data learned from character-based neural models are more informative than the ones from hand-crafted features, and that a Bi-LSTM+CRF-model produces state-of-the art accuracy on scansion of poetry in two languages. Results also show that the information about whole word structure, and not just independent syllables, is highly informative for performing scansion.

pdf bib abs
A Perplexity-Based Method for Similar Languages Discrimination
Pablo Gamallo | Jose Ramom Pichel | Iñaki Alegria
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This article describes the system submitted by the Citius_Ixa_Imaxin team to the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is based on a language distance computed by means of model perplexity. The best model configuration we have tested is a voting system making use of several n-grams models of both words and characters, even if word unigrams turned out to be a very competitive model with reasonable results in the tasks we have participated. An error analysis has been performed in which we identified many test examples with no linguistic evidences to distinguish among the variants.

2016

pdf bib abs
Machine Learning for Metrical Analysis of English Poetry
Manex Agirrezabal | Iñaki Alegria | Mans Hulden
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this work we tackle the challenge of identifying rhythmic patterns in poetry written in English. Although poetry is a literary form that makes use standard meters usually repeated among different authors, we will see in this paper how performing such analyses is a difficult task in machine learning due to the unexpected deviations from such standard patterns. After breaking down some examples of classical poetry, we apply a number of NLP techniques for the scansion of poetry, training and testing our systems against a human-annotated corpus. With these experiments, our purpose is establish a baseline of automatic scansion of poetry using NLP tools in a straightforward manner and to raise awareness of the difficulties of this task.

pdf bib abs
Evaluating Translation Quality and CLIR Performance of Query Sessions
Xabier Saralegi | Eneko Agirre | Iñaki Alegria
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents the evaluation of the translation quality and Cross-Lingual Information Retrieval (CLIR) performance when using session information as the context of queries. The hypothesis is that previous queries provide context that helps to solve ambiguous translations in the current query. We tested several strategies on the TREC 2010 Session track dataset, which includes query reformulations grouped by generalization, specification, and drifting types. We study the Basque to English direction, evaluating both the translation quality and CLIR performance, with positive results in both cases. The results show that the quality of translation improved, reducing error rate by 12% (HTER) when using session information, which improved CLIR results 5% (nDCG). We also provide an analysis of the improvements across the three kinds of sessions: generalization, specification, and drifting. Translation quality improved in all three types (generalization, specification, and drifting), and CLIR improved for generalization and specification sessions, preserving the performance in drifting sessions.

pdf bib abs
Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a method for the normalization of historical texts using a combination of weighted finite-state transducers and language models. We have extended our previous work on the normalization of dialectal texts and tested the method against a 17th century literary work in Basque. This preprocessed corpus is made available in the LREC repository. The performance of this method for learning relations between historical and contemporary word forms is evaluated against resources in three languages. The method we present learns to map phonological changes using a noisy channel model. The model is based on techniques commonly used for phonological inference and producing Grapheme-to-Grapheme conversion systems encoded as weighted transducers and produces F-scores above 80% in the task for Basque. A wider evaluation shows that the approach performs equally well with all the languages in our evaluation suite: Basque, Spanish and Slovene. A comparison against other methods that address the same task is also provided.

pdf bib abs
Domain Adaptation in MT Using Titles in Wikipedia as a Parallel Corpus: Resources and Evaluation
Gorka Labaka | Iñaki Alegria | Kepa Sarasola
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents how an state-of-the-art SMT system is enriched by using an extra in-domain parallel corpora extracted from Wikipedia. We collect corpora from parallel titles and from parallel fragments in comparable articles from Wikipedia. We carried out an evaluation with a double objective: evaluating the quality of the extracted data and evaluating the improvement due to the domain-adaptation. We think this can be very useful for languages with limited amount of parallel corpora, where in-domain data is crucial to improve the performance of MT sytems. The experiments on the Spanish-English language pair improve a baseline trained with the Europarl corpus in more than 2 points of BLEU when translating in the Computer Science domain.

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

pdf bib
EHU at the SIGMORPHON 2016 Shared Task. A Simple Proposal: Grapheme-to-Phoneme for Inflection
Iñaki Alegria | Izaskun Etxeberria
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Combining Phonology and Morphology for the Normalization of Historical Texts
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib abs
Comparing Two Basic Methods for Discriminating Between Similar Languages and Varieties
Pablo Gamallo | Iñaki Alegria | José Ramom Pichel | Manex Agirrezabal
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This article describes the systems submitted by the Citius_Ixa_Imaxin team to the Discriminating Similar Languages Shared Task 2016. The systems are based on two different strategies: classification with ranked dictionaries and Naive Bayes classifiers. The results of the evaluation show that ranking dictionaries are more sound and stable across different domains while basic bayesian models perform reasonably well on in-domain datasets, but their performance drops when they are applied on out-of-domain texts.

2015

pdf bib
Analyzing English-Spanish Named-Entity enhanced Machine Translation
Mikel Artetxe | Eneko Agirre | Inaki Alegria | Gorka Labaka
Proceedings of the Ninth Workshop on Syntax, Semantics and Structure in Statistical Translation

2014

In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.

2013

pdf bib
Combining Different Features of Idiomaticity for the Automatic Classification of Noun+Verb Expressions in Basque
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the 9th Workshop on Multiword Expressions

2012

pdf bib abs
Measuring the compositionality of NV expressions in Basque by means of distributional similarity techniques
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present several experiments aiming at measuring the semantic compositionality of NV expressions in Basque. Our approach is based on the hypothesis that compositionality can be related to distributional similarity. The contexts of each NV expression are compared with the contexts of its corresponding components, by means of different techniques, as similarity measures usually used with the Vector Space Model (VSM), Latent Semantic Analysis (LSA) and some measures implemented in the Lemur Toolkit, as Indri index, tf-idf, Okapi index and Kullback-Leibler divergence. Using our previous work with cooccurrence techniques as a baseline, the results point to improvements using the Indri index or Kullback-Leibler divergence, and a slight further improvement when used in combination with cooccurrence measures such as $t$-score, via rank-aggregation. This work is part of a project for MWE extraction and characterization using different techniques aiming at measuring the properties related to idiomaticity, as institutionalization, non-compositionality and lexico-syntactic fixedness.

pdf bib
BAD: An Assistant tool for making verses in Basque
Manex Agirrezabal | Iñaki Alegria | Bertol Arrieta | Mans Hulden
Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing
Iñaki Alegria | Mans Hulden
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

pdf bib
Finite-State Technology in a Verse-Making Tool
Manex Agirrezabal | Iñaki Alegria | Bertol Arrieta | Mans Hulden
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing

2011

pdf bib
Recognition and Classification of Numerical Entities in Basque
Ander Soraluze | Iñaki Alegria | Olatz Ansa | Olatz Arregi | Xabier Arregi
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

pdf bib
Automatic Extraction of NV Expressions in Basque: Basic Issues on Cooccurrence Techniques
Antton Gurrutxaga | Iñaki Alegria
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf bib
Learning word-level dialectal variation as phonological replacement rules using a limited parallel corpus
Mans Hulden | Iñaki Alegria | Izaskun Etxeberria | Montse Maritxalar
Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

2010

pdf bib abs
A Morphological Processor Based on Foma for Biscayan (a Basque dialect)
Iñaki Alegria | Garbiñe Aranbarri | Klara Ceberio | Gorka Labaka | Bittor Laskurain | Ruben Urizar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a new morphological processor for Biscayan, a dialect of Basque, developed on the description of the morphology of standard Basque. The database for the standard morphology has been extended for dialects and an open-source tool for morphological description named foma is used for building the processor. Biscayan is a dialect of the Basque language spoken mainly in Biscay, a province on the western of the Basque Country. The description of the lexicon and the morphotactics (or word grammar) for the standard Basque was carried out using a relational database and the database has been extended in order to include dialectal variants linked to the standard entries. XuxenB, a spelling checker/corrector for this dialect, is the first application of this work. Additionally to the basic analyzer used for spelling, a new transducer is included. It is an enhanced analyzer for linking standard form with the corresponding standard ones. It is used in correction for generation of proposals when in the input text appear standard forms which we want to replace with dialectal forms.

2008

We present our initial strategy for Spanish-to-Basque MultiEngine Machine Translation, a language pair with very different structure and word order and with no huge parallel corpus available. This hybrid proposal is based on the combination of three different MT paradigms: Example-Based MT, Statistical MT and Rule- Based MT. We have evaluated the system, reporting automatic evaluation metrics for a corpus in a test domain. The first results obtained are encouraging.

pdf bib abs
Spelling Correction: from Two-Level Morphology to Open Source
Iñaki Alegria | Klara Ceberio | Nerea Ezeiza | Aitor Soroa | Gregorio Hernandez
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the free world (OpenOffice, Mozilla, emacs, LaTeX, etc.)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper.

2006

The ZT corpus (Basque Corpus of Science and Technology) is a tagged collection of specialized texts in Basque, which wants to be a main resource in research and development about written technical Basque: terminology, syntax and style. It will be the first written corpus in Basque which will be distributed by ELDA (at the end of 2006) and it wants to be a methodological and functional reference for new projects in the future (i.e. a national corpus for Basque). We also present the technology and the tools to build this Corpus. These tools, Corpusgile and Eulia, provide a flexible and extensible infrastructure for creating, visualizing and managing corpora and for consulting, visualizing and modifying annotations generated by linguistic tools.

pdf bib
Using Machine Learning Techniques to Build a Comma Checker for Basque
Iñaki Alegria | Bertol Arrieta | Arantza Diaz de Ilarraza | Eli Izagirre | Montse Maritxalar
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Named Entities Translation Based on Comparable Corpora
Iñaki Alegria | Nerea Ezeiza | Izaskun Fernandez
Proceedings of the Workshop on Multi-word-expressions in a multilingual context

pdf bib
A Multiclassifier based Document Categorization System: profiting from the Singular Value Decomposition Dimensionality Reduction Technique
Ana Zelaia | Iñaki Alegria | Olatz Arregi | Basilio Sierra
Proceedings of the Workshop on Learning Structured Information in Natural Language Applications

2005

We present the current status of development of an open architecture for the translation from Spanish into Basque. The machine translation architecture uses an open source analyser for Spanish and new modules mainly based on finite-state transducers. The project is integrated in the OpenTrad initiative, a larger government funded project shared among different universities and small companies, which will also include MT engines for translation among the main languages in Spain. The main objective is the construction of an open, reusable and interoperable framework. This paper describes the design of the engine, the formats it uses for the communication among the modules, the modules reused from other project named Matxin and the new modules we are building.

2004

This project combines linguistic and statistical information to develop a term extraction tool for Basque. Being Basque an agglutinative and highly inflected language, the treatment of morphosyntactic information is vital. In addition, due to late unification process of the language, texts present more elevated term dispersion than in a highly normalized language. The result is a semi-automatic terminology extraction tool based on XML, for its use in technical and scientific information managing.