Izaskun Etxeberria

2020

Dealing with dialectal variation in the construction of the Basque historical corpus
Ainara Estarrona | Izaskun Etxeberria | Ricardo Etxepare | Manuel Padilla-Moyano | Ander Soraluze
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper analyses the challenge of working with dialectal variation when semi-automatically normalising and analysing historical Basque texts. This work is part of a more general ongoing project for the construction of a morphosyntactically annotated historical corpus of Basque called Basque in the Making (BIM): A Historical Look at a European Language Isolate, whose main objective is the systematic and diachronic study of a number of grammatical features. This will be not only the first tagged corpus of historical Basque, but also a means to improve language processing tools by analysing historical Basque varieties more or less distant from present-day standard Basque.

2016

pdf bib

Combining Phonology and Morphology for the Normalization of Historical Texts
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib

EHU at the SIGMORPHON 2016 Shared Task. A Simple Proposal: Grapheme-to-Phoneme for Inflection
Iñaki Alegria | Izaskun Etxeberria
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib abs

Evaluating the Noisy Channel Model for the Normalization of Historical Texts: Basque, Spanish and Slovene
Izaskun Etxeberria | Iñaki Alegria | Larraitz Uria | Mans Hulden
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a method for the normalization of historical texts using a combination of weighted finite-state transducers and language models. We have extended our previous work on the normalization of dialectal texts and tested the method against a 17th century literary work in Basque. This preprocessed corpus is made available in the LREC repository. The performance of this method for learning relations between historical and contemporary word forms is evaluated against resources in three languages. The method we present learns to map phonological changes using a noisy channel model. The model is based on techniques commonly used for phonological inference and producing Grapheme-to-Grapheme conversion systems encoded as weighted transducers and produces F-scores above 80% in the task for Basque. A wider evaluation shows that the approach performs equally well with all the languages in our evaluation suite: Basque, Spanish and Slovene. A comparison against other methods that address the same task is also provided.