Christoph Schwarz
2012
Large Scale Lexical Analysis
Gregor Thurmair
|
Vera Aleksić
|
Christoph Schwarz
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The following paper presents a lexical analysis component as implemented in the PANACEA project. The goal is to automatically extract lexicon entries from crawled corpora, in an attempt to use corpus-based methods for high-quality linguistic text processing, and to focus on the quality of data without neglecting quantitative aspects. Lexical analysis has the task to assign linguistic information (like: part of speech, inflectional class, gender, subcategorisation frame, semantic properties etc.) to all parts of the input text. If tokens are ambiguous, lexical analysis must provide all possible sets of annotation for later (syntactic) disambiguation, be it tagging, or full parsing. The paper presents an approach for assigning part-of-speech tags for German and English to large input corpora (> 50 mio tokens), providing a workflow which takes as input crawled corpora and provides POS-tagged lemmata ready for lexicon integration. Tools include sentence splitting, lexicon lookup, decomposition, and POS defaulting. Evaluation shows that the overall error rate can be brought down to about 2% if language resources are properly designed. The complete workflow is implemented as a sequence of web services integrated into the PANACEA platform.
2006
LiSa–morphological analysis for information retrieval
Hans Hjelm
|
Christoph Schwarz
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)
Search