2014
pdf
bib
abs
Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage
Kevin Black
|
Eric Ringger
|
Paul Felt
|
Kevin Seppi
|
Kristian Heal
|
Deryle Lonsdale
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word-senses. Such CDL resources are essential in learning a language and in linguistic research, translation, and philology. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas are treated as dictionary entry headwords. We intend to use data-driven lemmatization models to provide machine assistance to human annotators in the form of pre-annotations, and thereby reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy but only by a margin of 0.86% over Morfette at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.
pdf
bib
abs
Using Transfer Learning to Assist Exploratory Corpus Annotation
Paul Felt
|
Eric Ringger
|
Kevin Seppi
|
Kristian Heal
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We describe an under-studied problem in language resource management: that of providing automatic assistance to annotators working in exploratory settings. When no satisfactory tagset already exists, such as in under-resourced or undocumented languages, it must be developed iteratively while annotating data. This process naturally gives rise to a sequence of datasets, each annotated differently. We argue that this problem is best regarded as a transfer learning problem with multiple source tasks. Using part-of-speech tagging data with simulated exploratory tagsets, we demonstrate that even simple transfer learning techniques can significantly improve the quality of pre-annotations in an exploratory annotation.
2012
pdf
bib
abs
First Results in a Study Evaluating Pre-annotation and Correction Propagation for Machine-Assisted Syriac Morphological Analysis
Paul Felt
|
Eric Ringger
|
Kevin Seppi
|
Kristian Heal
|
Robbie Haertel
|
Deryle Lonsdale
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Manual annotation of large textual corpora can be cost-prohibitive, especially for rare and under-resourced languages. One potential solution is pre-annotation: asking human annotators to correct sentences that have already been annotated, usually by a machine. Another potential solution is correction propagation: using annotator corrections to bad pre-annotations to dynamically improve to the remaining pre-annotations within the current sentence. The research presented in this paper employs a controlled user study to discover under what conditions these two machine-assisted annotation techniques are effective in increasing annotator speed and accuracy and thereby reducing the cost for the task of morphologically annotating texts written in classical Syriac. A preliminary analysis of the data indicates that pre-annotations improve annotator accuracy when they are at least 60% accurate, and annotator speed when they are at least 80% accurate. This research constitutes the first systematic evaluation of pre-annotation and correction propagation together in a controlled user study.
2010
pdf
bib
A Probabilistic Morphological Analyzer for Syriac
Peter McClanahan
|
George Busby
|
Robbie Haertel
|
Kristian Heal
|
Deryle Lonsdale
|
Kevin Seppi
|
Eric Ringger
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing