Andreas Blombach


2024

pdf bib
Automatic Identification of COVID-19-Related Conspiracy Narratives in German Telegram Channels and Chats
Philipp Heinrich | Andreas Blombach | Bao Minh Doan Dang | Leonardo Zilio | Linda Havenstein | Nathan Dykes | Stephanie Evert | Fabian Schäfer
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We are concerned with mapping the discursive landscape of conspiracy narratives surrounding the COVID-19 pandemic. In the present study, we analyse a corpus of more than 1,000 German Telegram posts tagged with 14 fine-grained conspiracy narrative labels by three independent annotators. Since emerging narratives on social media are short-lived and notoriously hard to track, we experiment with different state-of-the-art approaches to few-shot and zero-shot text classification. We report performance in terms of ROC-AUC and in terms of optimal F1, and compare fine-tuned methods with off-the-shelf approaches and human performance.

2020

pdf bib
EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus
Thomas Proisl | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Andreas Blombach | Stefan Evert
Proceedings of the Twelfth Language Resources and Evaluation Conference

The EmpiriST corpus (Beißwenger et al., 2016) is a manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC (computer-mediated communication) data. We extend the corpus with manually created annotation layers for word form normalization, lemmatization and lexical semantics. All annotations have been independently performed by multiple human annotators. We report inter-annotator agreements and results of baseline systems and state-of-the-art off-the-shelf tools.

pdf bib
A Corpus of German Reddit Exchanges (GeRedE)
Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Thomas Proisl
Proceedings of the Twelfth Language Resources and Evaluation Conference

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.

2019

pdf bib
The_Illiterati: Part-of-Speech Tagging for Magahi and Bhojpuri without even knowing the alphabet
Thomas Proisl | Peter Uhrig | Andreas Blombach | Natalie Dykes | Philipp Heinrich | Besim Kabashi | Sefora Mammarella
Proceedings of the First International Workshop on NLP Solutions for Under Resourced Languages (NSURL 2019) co-located with ICNLSP 2019 - Short Papers