Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage

Kevin Black; Eric Ringger; Paul Felt; Kevin Seppi; Kristian Heal; Deryle Lonsdale

Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage

Kevin Black, Eric Ringger, Paul Felt, Kevin Seppi, Kristian Heal, Deryle Lonsdale

Abstract

The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word-senses. Such CDL resources are essential in learning a language and in linguistic research, translation, and philology. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas are treated as dictionary entry headwords. We intend to use data-driven lemmatization models to provide machine assistance to human annotators in the form of pre-annotations, and thereby reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy but only by a margin of 0.86% over Morfette at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.

Anthology ID:: L14-1142
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3798–3805
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1203_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Kevin Black, Eric Ringger, Paul Felt, Kevin Seppi, Kristian Heal, and Deryle Lonsdale. 2014. Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3798–3805, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage (Black et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1203_Paper.pdf

PDF Cite Search Fix data