Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds.

Ekaterina Lapshinova-Koltunski, Ulrich Heid


Abstract
In this paper we discuss an approach to the semi-automatic extraction and classification of the compounds extracted from German corpora. Compound nominals are semi-automatically extracted from text corpora along with their sentential complements. In this study we concentrate on that­, wh­ or if subclauses although our methods can be applied to other complements as well. We elaborate an architecture using linguistic knowledge about the phenomena we extract, and aim at answering the following questions: how can data about subcategorisation properties of nominal compounds be extracted from text corpora, and how can compounds be classified according to their subcategorisation properties? Our classification is based on the relationships between the subcategorisation of nominal compounds, e.g. Grundfrage, Wettstreit and Beweismittel, and that of their constituent parts, such as Frage, Streit, Beweis, etc. We show that there are cases which do not match the commonly accepted assumption that the head of a compound is its valency bearer. Such cases should receive a specific treatment in NLP dictionary building. This calls for tools to identify and classify such cases by means of data extraction from corpora. We propose precision-oriented semi­automatic extraction which can operate on tokenized, tagged and lemmatized texts. In the future, we are going to extend the kinds of extracted complements beyond subclauses and analyze the nature of the non-head valency-bearer of compounds, as well as an extension of the kinds of extracted complements beyond subclauses.
Anthology ID:
L08-1449
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/88_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ekaterina Lapshinova-Koltunski and Ulrich Heid. 2008. Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds.. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Head or Non-head? Semi-automatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds. (Lapshinova-Koltunski & Heid, LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/88_paper.pdf