Katarzyna Głowińska
2010
The Design of Syntactic Annotation Levels in the National Corpus of Polish
Katarzyna Głowińska
|
Adam Przepiórkowski
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The paper presents the procedure of syntactic annotation of the National Corpus of Polish. The paper concentrates on the delimitation of syntactic words (analytical forms, reflexive verbs, discontinuous conjunctions, etc.) and syntactic groups, as well as on problems encountered during the annotation process: syntactic group boundaries, multiword entities, abbreviations, discontinuous phrases and syntactic words. It includes the complete tagset for syntactic words and the list of syntactic groups recognized in NKJP. The tagset defines grammatical classes and categories according to morphosyntactic and syntactic criteria only. Syntactic annotation in the National Corpus of Polish is limited to making constituents of combinations of words. Annotation depends on shallow parsing and manual post-editing of the results by annotators. Manual annotation is performed by two independents annotators, with a referee in cases of disagreement. The manually constructed grammar, both for syntactic words and for syntactic groups, is encoded in the shallow parsing system Spejd.
Domain-related Annotation of Polish Spoken Dialogue Corpus LUNA.PL
Agnieszka Mykowiecka
|
Katarzyna Głowińska
|
Joanna Rabiega-Wiśniewska
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The paper presents a corpus of Polish spoken dialogues annotated on several levels, from transcription of dialogues and their morphosyntactic analysis, to semantic annotation. The LUNA.PL corpus is the first semantically annotated corpus of Polish spontaneous speech. It contains 500 dialogues recorded at the Warsaw Transport Authority call centre. For each dialogue, the corpus contains recorded audio signal, its transcription and five XML files with annotations on subsequent levels. Speech transcription was done manually. Text annotation was constructed using a combination of rule based programmes and computer-aided manual work. For morphological annotation we used the already existing analyzer and manually disambiguated the results. Morphologically annotated texts of dialogues were automatically segmented into elementary syntactic chunks. Semantic annotation was done by a set of specially designed rules and then manually corrected. The paper describes details of the domain related semantic annotation which consists of two levels - concept level at which around 200 attributes and their values are annotated, and predicate level at which 47 frame types are recognized. We describe the domain model accepted, and the statistics over the entire annotated set of dialogues.