Rachel Lathbury
2010
New Tools for Web-Scale N-grams
Dekang Lin
|
Kenneth Church
|
Heng Ji
|
Satoshi Sekine
|
David Yarowsky
|
Shane Bergsma
|
Kailash Patil
|
Emily Pitler
|
Rachel Lathbury
|
Vikram Rao
|
Kapil Dalwani
|
Sushant Narsale
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.
Search
Co-authors
- Dekang Lin 1
- Kenneth Church 1
- Heng Ji 1
- Satoshi Sekine 1
- David Yarowsky 1
- show all...
Venues
- lrec1