An Annotated Social Media Corpus for German

Eckhard Bick


Abstract
This paper presents the German Twitter section of a large (2 billion word) bilingual Social Media corpus for Hate Speech research, discussing the compilation, pseudonymization and grammatical annotation of the corpus, as well as special linguistic features and peculiarities encountered in the data. Among other things, compounding, accidental and intentional orthographic variation, gendering and the use of emoticons/emojis are addressed in a genre-specific fashion. We present the different layers of linguistic annotation (morphosyntactic, dependencies and semantic types) and explain how a general parser (GerGram) can be made to work on Social Media data, pointing out necessary adaptations and extensions. In an evaluation run on a random cross-section of tweets, the modified parser achieved F-scores of 97% for morphology (fine-grained POS) and 92% for syntax (labeled attachment score). Predictably, performance was twice as good in tweets with standard orthography than in tweets with spelling/casing irregularities or lack of sentence separation, the effect being more marked for morphology than for syntax.
Anthology ID:
2020.lrec-1.752
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6127–6135
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.752
DOI:
Bibkey:
Cite (ACL):
Eckhard Bick. 2020. An Annotated Social Media Corpus for German. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6127–6135, Marseille, France. European Language Resources Association.
Cite (Informal):
An Annotated Social Media Corpus for German (Bick, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.752.pdf