Emily Danchik
2014
Comprehensive Annotation of Multiword Expressions in a Social Web Corpus
Nathan Schneider
|
Spencer Onuffer
|
Nora Kazour
|
Emily Danchik
|
Michael T. Mordowanec
|
Henrietta Conrad
|
Noah A. Smith
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Multiword expressions (MWEs) are quite frequent in languages such as English, but their diversity, the scarcity of individual MWE types, and contextual ambiguity have presented obstacles to corpus-based studies and NLP systems addressing them as a class. Here we advocate for a comprehensive annotation approach: proceeding sentence by sentence, our annotators manually group tokens into MWEs according to guidelines that cover a broad range of multiword phenomena. Under this scheme, we have fully annotated an English web corpus for multiword expressions, including those containing gaps.
Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut
Nathan Schneider
|
Emily Danchik
|
Chris Dyer
|
Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 2
We present a novel representation, evaluation measure, and supervised models for the task of identifying the multiword expressions (MWEs) in a sentence, resulting in a lexical semantic segmentation. Our approach generalizes a standard chunking representation to encode MWEs containing gaps, thereby enabling efficient sequence tagging algorithms for feature-rich discriminative models. Experiments on a new dataset of English web text offer the first linguistically-driven evaluation of MWE identification with truly heterogeneous expression types. Our statistical sequence model greatly outperforms a lookup-based segmentation procedure, achieving nearly 60% F1 for MWE identification.
Search
Fix data
Co-authors
- Nathan Schneider 2
- Noah A. Smith 2
- Henrietta Conrad 1
- Chris Dyer 1
- Nora Kazour 1
- show all...