2010
pdf
bib
How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing
Fabienne Fritzinger
|
Alexander Fraser
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
pdf
bib
abs
Pattern-Based Extraction of Negative Polarity Items from Dependency-Parsed Text
Fabienne Fritzinger
|
Frank Richter
|
Marion Weller
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We describe a new method for extracting Negative Polarity Item candidates (NPI candidates) from dependency-parsed German text corpora. Semi-automatic extraction of NPIs is a challenging task since NPIs do not have uniform categorical or other syntactic properties that could be used for detecting them; they occur as single words or as multi-word expressions of almost any syntactic category. Their defining property is of a semantic nature, they may only occur in the scope of negation and related semantic operators. In contrast to an earlier approach to NPI extraction from corpora, we specifically target multi-word expressions. Besides applying statistical methods to measure the co-occurrence of our candidate expressions with negative contexts, we also apply linguistic criteria in an attempt to determine to which degree they are idiomatic. Our method is evaluated by comparing the set of NPIs we found with the most comprehensive electronic list of German NPIs, which currently contains 165 entries. Our method retrieved 142 NPIs, 114 of which are new.
pdf
bib
abs
Term and Collocation Extraction by Means of Complex Linguistic Web Services
Ulrich Heid
|
Fabienne Fritzinger
|
Erhard Hinrichs
|
Marie Hinrichs
|
Thomas Zastrow
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
We present a web service-based environment for the use of linguistic resources and tools to address issues of terminology and language varieties. We discuss the architecture, corpus representation formats, components and a chainer supporting the combination of tools into task-specific services. Integrated into this environment, single web services also become part of complex scenarios for web service use. Our web services take for example corpora of several million words as an input on which they perform preprocessing, such as tokenisation, tagging, lemmatisation and parsing, and corpus exploration, such as collocation extraction and corpus comparison. Here we present an example on extraction of single and multiword items typical of a specific domain or typical of a regional variety of German. We also give a critical review on needs and available functions from a user's point of view. The work presented here is part of ongoing experimentation in the D-SPIN project, the German national counterpart of CLARIN.
pdf
bib
abs
A Survey of Idiomatic Preposition-Noun-Verb Triples on Token Level
Fabienne Fritzinger
|
Marion Weller
|
Ulrich Heid
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Most of the research on the extraction of idiomatic multiword expressions (MWEs) focused on the acquisition of MWE types. In the present work we investigate whether a text instance of a potentially idiomatic MWE is actually used idiomatically in a given context or not. Inspired by the dataset provided by (Cook et al., 2008), we manually analysed 9,700 instances of potentially idiomatic prepositionnoun- verb triples (a frequent pattern among German MWEs) to identify, on token level, idiomatic vs. literal uses. In our dataset, all sentences are provided along with their morpho-syntactic properties. We describe our data extraction and annotation steps, and we discuss quantitative results from both EUROPARL and a German newspaper corpus. We discuss the relationship between idiomaticity and morpho-syntactic fixedness, and we address issues of ambiguity between literal and idiomatic use of MWEs. Our data show that EUROPARL is particularly well suited for MWE extraction, as most MWEs in this corpus are indeed used only idiomatically.