2022
pdf
bib
MONAPipe: Modes of Narration and Attribution Pipeline for German Computational Literary Studies and Language Analysis in spaCy
Tillmann Dönicke
|
Florian Barth
|
Hanna Varachkina
|
Caroline Sporleder
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
pdf
bib
abs
Levels of Non-Fictionality in Fictional Texts
Florian Barth
|
Hanna Varachkina
|
Tillmann Dönicke
|
Luisa Gödeke
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022
The annotation and automatic recognition of non-fictional discourse within a text is an important, yet unresolved task in literary research. While non-fictional passages can consist of several clauses or sentences, we argue that 1) an entity-level classification of fictionality and 2) the linking of Wikidata identifiers can be used to automatically identify (non-)fictional discourse. We query Wikidata and DBpedia for relevant information about a requested entity as well as the corresponding literary text to determine the entity’s fictionality status and assign a Wikidata identifier, if unequivocally possible. We evaluate our methods on an exemplary text from our diachronic literary corpus, where our methods classify 97% of persons and 62% of locations correctly as fictional or real. Furthermore, 75% of the resolved persons and 43% of the resolved locations are resolved correctly. In a quantitative experiment, we apply the entity-level fictionality tagger to our corpus and conclude that more non-fictional passages can be identified when information about real entities is available.
2021
pdf
bib
abs
Annotating Quantified Phenomena in Complex Sentence Structures Using the Example of Generalising Statements in Literary Texts
Tillmann Dönicke
|
Luisa Gödeke
|
Hanna Varachkina
Proceedings of the 17th Joint ACL - ISO Workshop on Interoperable Semantic Annotation
We present a tagset for the annotation of quantification which we currently use to annotate certain quantified statements in fictional works of literature. Literary texts feature a rich variety in expressing quantification, including a broad range of lexemes to express quantifiers and complex sentence structures to express the restrictor and the nuclear scope of a quantification. Our tagset consists of seven tags and covers all types of quantification that occur in natural language, including vague quantification and generic quantification. In the second part of the paper, we introduce our German corpus with annotations of generalising statements, which form a proper subset of quantified statements.
pdf
bib
abs
A Unified Approach to Discourse Relation Classification in nine Languages
Hanna Varachkina
|
Franziska Pannach
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)
This paper presents efforts to solve the shared task on discourse relation classification (disrpt task 3). The intricate prediction task aims to predict a large number of classes from the Rhetorical Structure Theory (RST) framework for nine target languages. Labels include discourse relations such as background, condition, contrast and elaboration. We present an approach using euclidean distance between sentence embeddings that were extracted using multlingual sentence BERT (sBERT) and directionality as features. The data was combined into five classes which were used for initial prediction. The second classification step predicts the target classes. We observe a substantial difference in results depending on the number of occurrences of the target label in the training data. We achieve the best results on Chinese, where our system achieves 70 % accuracy on 20 labels.
2020
pdf
bib
abs
#GCDH at WNUT-2020 Task 2: BERT-Based Models for the Detection of Informativeness in English COVID-19 Related Tweets
Hanna Varachkina
|
Stefan Ziehe
|
Tillmann Dönicke
|
Franziska Pannach
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
In this system paper, we present a transformer-based approach to the detection of informativeness in English tweets on the topic of the current COVID-19 pandemic. Our models distinguish informative tweets, i.e. tweets containing statistics on recovery, suspected and confirmed cases and COVID-19 related deaths, from uninformative tweets. We present two transformer-based approaches as well as a Naive Bayes classifier and a support vector machine as baseline systems. The transformer models outperform the baselines by more than 0.1 in F1-score, with F1-scores of 0.9091 and 0.9036. Our models were submitted to the shared task Identification of informative COVID-19 English tweets WNUT-2020 Task 2.