Tillmann Dönicke

2024

A First Look at the Ugaritic Poetic Text Corpus
Tillmann Dönicke | Clemens Steinberger | Max-Ferdinand Zeterberg | Noah Kröll
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)

For the Ugaritic poetic texts there is currently no digital corpus including extensive philological and poetological annotations. Within the research project “Edition des ugaritischen poetischen Textkorpus” (EUPT), these texts are digitised and provided as an online-accessible corpus. This paper briefly introduces the project and outlines the principles of the data model. The focus is on the different annotation levels and their connection with each other.

2023

pdf bib

Exploring Automatic Text Simplification of German Narrative Documents
Thorben Schomacker | Tillmann Dönicke | Marina Tropmann-Frick
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

2022

pdf bib abs

Rule-Based Clause-Level Morphology for Multiple Languages
Tillmann Dönicke
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)

This paper describes an approach for the morphosyntactic analysis of clauses, including the analysis of composite verb forms and both overt and covert pronouns. The approach uses grammatical rules for verb inflection and clause-internal word agreement to compute a clause’s morphosyntactic features from the morphological features of the individual words. The approach is tested for eight languages in the 1st Shared Task on Multilingual Clause-Level Morphology, where it achieves F1 scores between 79% and 99% (94% in average).

pdf bib abs

Levels of Non-Fictionality in Fictional Texts
Florian Barth | Hanna Varachkina | Tillmann Dönicke | Luisa Gödeke
Proceedings of the 18th Joint ACL - ISO Workshop on Interoperable Semantic Annotation within LREC2022

The annotation and automatic recognition of non-fictional discourse within a text is an important, yet unresolved task in literary research. While non-fictional passages can consist of several clauses or sentences, we argue that 1) an entity-level classification of fictionality and 2) the linking of Wikidata identifiers can be used to automatically identify (non-)fictional discourse. We query Wikidata and DBpedia for relevant information about a requested entity as well as the corresponding literary text to determine the entity’s fictionality status and assign a Wikidata identifier, if unequivocally possible. We evaluate our methods on an exemplary text from our diachronic literary corpus, where our methods classify 97% of persons and 62% of locations correctly as fictional or real. Furthermore, 75% of the resolved persons and 43% of the resolved locations are resolved correctly. In a quantitative experiment, we apply the entity-level fictionality tagger to our corpus and conclude that more non-fictional passages can be identified when information about real entities is available.

pdf bib

MONAPipe: Modes of Narration and Attribution Pipeline for German Computational Literary Studies and Language Analysis in spaCy
Tillmann Dönicke | Florian Barth | Hanna Varachkina | Caroline Sporleder
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)

2021

pdf bib abs

Delexicalised Multilingual Discourse Segmentation for DISRPT 2021 and Tense, Mood, Voice and Modality Tagging for 11 Languages
Tillmann Dönicke
Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021)

This paper describes our participating system for the Shared Task on Discourse Segmentation and Connective Identification across Formalisms and Languages. Key features of the presented approach are the formulation as a clause-level classification task, a language-independent feature inventory based on Universal Dependencies grammar, and composite-verb-form analysis. The achieved F1 is 92% for German and English and lower for other languages. The paper also presents a clause-level tagger for grammatical tense, aspect, mood, voice and modality in 11 languages.

pdf bib abs

Annotating Quantified Phenomena in Complex Sentence Structures Using the Example of Generalising Statements in Literary Texts
Tillmann Dönicke | Luisa Gödeke | Hanna Varachkina
Proceedings of the 17th Joint ACL - ISO Workshop on Interoperable Semantic Annotation

We present a tagset for the annotation of quantification which we currently use to annotate certain quantified statements in fictional works of literature. Literary texts feature a rich variety in expressing quantification, including a broad range of lexemes to express quantifiers and complex sentence structures to express the restrictor and the nuclear scope of a quantification. Our tagset consists of seven tags and covers all types of quantification that occur in natural language, including vague quantification and generic quantification. In the second part of the paper, we introduce our German corpus with annotations of generalising statements, which form a proper subset of quantified statements.

2020

pdf bib abs

#GCDH at WNUT-2020 Task 2: BERT-Based Models for the Detection of Informativeness in English COVID-19 Related Tweets
Hanna Varachkina | Stefan Ziehe | Tillmann Dönicke | Franziska Pannach
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

In this system paper, we present a transformer-based approach to the detection of informativeness in English tweets on the topic of the current COVID-19 pandemic. Our models distinguish informative tweets, i.e. tweets containing statistics on recovery, suspected and confirmed cases and COVID-19 related deaths, from uninformative tweets. We present two transformer-based approaches as well as a Naive Bayes classifier and a support vector machine as baseline systems. The transformer models outperform the baselines by more than 0.1 in F1-score, with F1-scores of 0.9091 and 0.9036. Our models were submitted to the shared task Identification of informative COVID-19 English tweets WNUT-2020 Task 2.

pdf bib abs

Identifying and Handling Cross-Treebank Inconsistencies in UD: A Pilot Study
Tillmann Dönicke | Xiang Yu | Jonas Kuhn
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

The Universal Dependencies treebanks are a still-growing collection of treebanks for a wide range of languages, all annotated with a common inventory of dependency relations. Yet, the usages of the relations can be categorically different even for treebanks of the same language. We present a pilot study on identifying such inconsistencies in a language-independent way and conduct an experiment which illustrates that a proper handling of inconsistencies can improve parsing performance by several percentage points.

pdf bib abs

Real-Valued Logics for Typological Universals: Framework and Application
Tillmann Dönicke | Xiang Yu | Jonas Kuhn
Proceedings of the 28th International Conference on Computational Linguistics

This paper proposes a framework for the expression of typological statements which uses real-valued logics to capture the empirical truth value (truth degree) of a formula on a given data source, e.g. a collection of multilingual treebanks with comparable annotation. The formulae can be arbitrarily complex expressions of propositional logic. To illustrate the usefulness of such a framework, we present experiments on the Universal Dependencies treebanks for two use cases: (i) empirical (re-)evaluation of established formulae against the spectrum of available treebanks and (ii) evaluating new formulae (i.e. potential candidates for universals) generated by a search algorithm.

pdf bib

Clause-Level Tense, Mood, Voice and Modality Tagging for German
Tillmann Dönicke
Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories

2019

pdf bib abs

Multiclass Text Classification on Unbalanced, Sparse and Noisy Data
Tillmann Dönicke | Matthias Damaschk | Florian Lux
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

This paper discusses methods to improve the performance of text classification on data that is difficult to classify due to a large number of unbalanced classes with noisy examples. A variety of features are tested, in combination with three different neural-network-based methods with increasing complexity. The classifiers are applied to a songtext–artist dataset which is large, unbalanced and noisy. We come to the conclusion that substantial improvement can be obtained by removing unbalancedness and sparsity from the data. This fulfils a classification task unsatisfactorily—however, with contemporary methods, it is a practical step towards fairly satisfactory results.