Leveraging Machine Readable Dictionaries in Discriminative Sequence Models

Ben Wellner; Marc Vilain

Leveraging Machine Readable Dictionaries in Discriminative Sequence Models

Abstract

Many natural language processing tasks make use of a lexicon typically the words collected from some annotated training data along with their associated properties. We demonstrate here the utility of corpora-independent lexicons derived from machine readable dictionaries. Lexical information is encoded in the form of features in a Conditional Random Field tagger providing improved performance in cases where: i) limited training data is made available ii) the data is case-less and iii) the test data genre or domain is different than that of the training data. We show substantial error reductions, especially on unknown words, for the tasks of part-of-speech tagging and shallow parsing, achieving up to 20% error reduction on Penn TreeBank part-of-speech tagging and up to a 15.7% error reduction for shallow parsing using the CoNLL 2000 data. Our results here point towards a simple, but effective methodology for increasing the adaptability of text processing systems by training models with annotated data in one genre augmented with general lexical information or lexical information pertinent to the target genre (or domain).

Anthology ID:: L06-1236
Volume:: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:: May
Year:: 2006
Address:: Genoa, Italy
Editors:: Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
External URL:: http://www.lrec-conf.org/proceedings/lrec2006/pdf/404_pdf.pdf
DOI:
Bibkey:
Cite (ACL):: Ben Wellner and Marc Vilain. 2006. Leveraging Machine Readable Dictionaries in Discriminative Sequence Models. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):: Leveraging Machine Readable Dictionaries in Discriminative Sequence Models (Wellner & Vilain, LREC 2006)
Copy Citation:

External Cite Search Fix data