Heike Zinsmeister

2025

Coreference in simplified German: Linguistic features and challenges of automatic annotation
Sarah Jablotschkin | Ekaterina Lapshinova-Koltunski | Heike Zinsmeister
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

In this paper, we analyse coreference annotation of the German language, focussing on the phenomenon of simplification, that is, the tendency to use words and constructions that are assumed to be easier perceived, understood, or produced. Simplification is one of the tools used by language users in order to optimise communication effectively. We are interested in how simplification is reflected in coreference in two different language products exposed to the phenomena of simplification: simultaneous interpreting and Easy German. For this, we automatically annotate simplified texts with coreference. We then evaluate the outputs of automatic annotation. In addition, we also look into quantitative distributions of some coreference features. Our findings show that although the language products under analysis diverge in terms of simplification driving factors, they share some specific coreference features. We also show that this specificity may cause annotation errors in simplified language, e.g. in non-nominal or split antecedents.

pdf bib

Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)
Sarah Jablotschkin | Sandra Kübler | Heike Zinsmeister
Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)

pdf bib abs

Beyond semantics: the challenges of annotating pragmatic and discourse phenomena
Stefanie Dipper | Heike Zinsmeister | Bonnie Webber
Dialogue Discourse Volume 16

The goal of this special issue is to show the challenges faced in reliably annotating abstractsemantic and pragmatic information at both the sentence and discourse levels, and how those chal-lenges are being met. Such information is frequently not explicitly or unambiguously marked innatural language. It is usually dependent on contextual information, and annotators often have toreconstruct complex relations and situations from the context.

2024

pdf bib abs

DE-Lite - a New Corpus of Easy German: Compilation, Exploration, Analysis
Sarah Jablotschkin | Elke Teich | Heike Zinsmeister
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

In this paper, we report on a new corpus of simplified German. It is recently requested from public agencies in Germany to provide information in easy language on their outlets (e.g. websites) so as to facilitate participation in society for people with low-literacy levels related to learning difficulties or low language proficiency (e.g. L2 speakers). While various rule sets and guidelines for Easy German (a specific variant of simplified German) have emerged over time, it is unclear (a) to what extent authors and other content creators, including generative AI tools consistently apply them, and (b) how adequate texts in authentic Easy German really are for the intended audiences. As a first step in gaining insights into these issues and to further LT development for simplified German, we compiled DE-Lite, a corpus of easy-to-read texts including Easy German and comparable Standard German texts, by integrating existing collections and gathering new data from the web. We built n-gram models for an Easy German subcorpus of DE-Lite and comparable Standard German texts in order to identify typical features of Easy German. To this end, we use relative entropy (Kullback-Leibler Divergence), a standard technique for evaluating language models, which we apply here for corpus comparison. Our analysis reveals that some rules of Easy German are fairly dominant (e.g. punctuation) and that text genre has a strong effect on the distinctivity of the two language variants.

pdf bib

Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)
Daniel Dakota | Sarah Jablotschkin | Sandra Kübler | Heike Zinsmeister
Proceedings of the 22nd Workshop on Treebanks and Linguistic Theories (TLT 2024)

2023

pdf bib abs

Personal noun detection for German
Carla Sökefeld | Melanie Andresen | Johanna Binnewitt | Heike Zinsmeister
Proceedings of the 19th Joint ACL-ISO Workshop on Interoperable Semantics (ISA-19)

Personal nouns, i.e. common nouns denoting human beings, play an important role in manifesting gender and gender stereotypes in texts, especially for languages with grammatical gender like German. Automatically detecting and extracting personal nouns can thus be of interest to a myriad of different tasks such as minimizing gender bias in language models and researching gender stereotypes or gender-fair language, but is complicated by the morphological heterogeneity and homonymy of personal and non-personal nouns, which restrict lexicon-based approaches. In this paper, we introduce a classifier created by fine-tuning a transformer model that detects personal nouns in German. Although some phenomena like homonymy and metalinguistic uses are still problematic, the model is able to classify personal nouns with robust accuracy (f1-score: 0.94).

2021

pdf bib

The Impact of Word Embeddings on Neural Dependency Parsing
Benedikt Adelmann | Wolfgang Menzel | Heike Zinsmeister
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2020

pdf bib abs

Modeling Ambiguity with Many Annotators and Self-Assessments of Annotator Certainty
Melanie Andresen | Michael Vauth | Heike Zinsmeister
Proceedings of the 14th Linguistic Annotation Workshop

Most annotation efforts assume that annotators will agree on labels, if the annotation categories are well-defined and documented in annotation guidelines. However, this is not always true. For instance, content-related questions such as ‘Is this sentence about topic X?’ are unlikely to elicit the same answer from all annotators. Additional specifications in the guidelines are helpful to some extent, but can soon get overspecified by rules that cannot be justified by a research question. In this study, we model the semantic category ‘illness’ and its use in a gradual way. For this purpose, we (i) ask many annotators (30 votes per item, 960 items) for their opinion in a crowdsourcing experiment, (ii) ask annotators to indicate their certainty with respect to their annotation, and (iii) compare this across two different text types. We show that results of multiple annotations and average annotator certainty correlate, but many ambiguities can only be captured if several people contribute. The annotated data allow us to filter for sentences with high or low agreement and analyze causes of disagreement, thus getting a better understanding of people’s perception of illness—as an example of a semantic category—as well as of the content of our annotated texts.

2018

pdf bib abs

Survey: Anaphora With Non-nominal Antecedents in Computational Linguistics: a Survey
Varada Kolhatkar | Adam Roussel | Stefanie Dipper | Heike Zinsmeister
Computational Linguistics, Volume 44, Issue 3 - September 2018

This article provides an extensive overview of the literature related to the phenomenon of non-nominal-antecedent anaphora (also known as abstract anaphora or discourse deixis), a type of anaphora in which an anaphor like “that” refers to an antecedent (marked in boldface) that is syntactically non-nominal, such as the first sentence in “It’s way too hot here. That’s why I’m moving to Alaska.” Annotating and automatically resolving these cases of anaphora is interesting in its own right because of the complexities involved in identifying non-nominal antecedents, which typically represent abstract objects such as events, facts, and propositions. There is also practical value in the resolution of non-nominal-antecedent anaphora, as this would help computational systems in machine translation, summarization, and question answering, as well as, conceivably, any other task dependent on some measure of text understanding. Most of the existing approaches to anaphora annotation and resolution focus on nominal-antecedent anaphora, classifying many of the cases where the antecedents are syntactically non-nominal as non-anaphoric. There has been some work done on this topic, but it remains scattered and difficult to collect and assess. With this article, we hope to bring together and synthesize work done in disparate contexts up to now in order to identify fundamental problems and draw conclusions from an overarching perspective. Having a good picture of the current state of the art in this field can help researchers direct their efforts to where they are most necessary. Because of the great variety of theoretical approaches that have been brought to bear on the problem, there is an equally diverse array of terminologies that are used to describe it, so we will provide an overview and discussion of these terminologies. We also describe the linguistic properties of non-nominal-antecedent anaphora, examine previous annotation efforts that have addressed this topic, and present the computational approaches that aim at resolving non-nominal-antecedent anaphora automatically. We close with a review of the remaining open questions in this area and some of our recommendations for future research.

pdf bib abs

The ARRAU corpus is an anaphorically annotated corpus of English providing rich linguistic information about anaphora resolution. The most distinctive feature of the corpus is the annotation of a wide range of anaphoric relations, including bridging references and discourse deixis in addition to identity (coreference). Other distinctive features include treating all NPs as markables, including non-referring NPs; and the annotation of a variety of morphosyntactic and semantic mention and entity attributes, including the genericity status of the entities referred to by markables. The corpus however has not been extensively used for anaphora resolution research so far. In this paper, we discuss three datasets extracted from the ARRAU corpus to support the three subtasks of the CRAC 2018 Shared Task–identity anaphora resolution over ARRAU-style markables, bridging references resolution, and discourse deixis; the evaluation scripts assessing system performance on those datasets; and preliminary results on these three tasks that may serve as baseline for subsequent research in these phenomena.

2017

pdf bib

The Benefit of Syntactic vs. Linear N-grams for Linguistic Description
Melanie Andresen | Heike Zinsmeister
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib abs

Approximating Style by N-gram-based Annotation
Melanie Andresen | Heike Zinsmeister
Proceedings of the Workshop on Stylistic Variation

The concept of style is much debated in theoretical as well as empirical terms. From an empirical perspective, the key question is how to operationalize style and thus make it accessible for annotation and quantification. In authorship attribution, many different approaches have successfully resolved this issue at the cost of linguistic interpretability: The resulting algorithms may be able to distinguish one language variety from the other, but do not give us much information on their distinctive linguistic properties. We approach the issue of interpreting stylistic features by extracting linear and syntactic n-grams that are distinctive for a language variety. We present a study that exemplifies this process by a comparison of the German academic languages of linguistics and literary studies. Overall, our findings show that distinctive n-grams can be related to linguistic categories. The results suggest that the style of German literary studies is characterized by nominal structures and the style of linguistics by verbal ones.

The Stuttgart-Tübingen TagSet (STTS) is a de-facto standard for the part-of-speech tagging of German texts. Since its first publication in 1995, STTS has been used in a variety of annotation projects, some of which have adapted the tagset slightly for their specific needs. Recently, the focus of many projects has shifted from the analysis of newspaper text to that of non-standard varieties such as user-generated content, historical texts, and learner language. These text types contain linguistic phenomena that are missing from or are only suboptimally covered by STTS; in a community effort, German NLP researchers have therefore proposed additions to and modifications of the tagset that will handle these phenomena more appropriately. In addition, they have discussed alternative ways of tag assignment in terms of bipartite tags (stem, token) for historical texts and tripartite tags (lexicon, morphology, distribution) for learner texts. In this article, we report on this ongoing activity, addressing methodological issues and discussing selected phenomena and their treatment in the tagset adaptation process.

2013

pdf bib

Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data
Varada Kolhatkar | Heike Zinsmeister | Graeme Hirst
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib

Annotating Anaphoric Shell Nouns with their Antecedents
Varada Kolhatkar | Heike Zinsmeister | Graeme Hirst
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib abs

The Use of Parallel and Comparable Data for Analysis of Abstract Anaphora in German and English
Stefanie Dipper | Melanie Seiss | Heike Zinsmeister
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Parallel corpora ― original texts aligned with their translations ― are a widely used resource in computational linguistics. Translation studies have shown that translated texts often differ systematically from comparable original texts. Translators tend to be faithful to structures of the original texts, resulting in a """"shining through"""" of the original language preferences in the translated text. Translators also tend to make their translations most comprehensible with the effect that translated texts can be more explicit than their source texts. Motivated by the need to use a parallel resource for cross-linguistic feature induction in abstract anaphora resolution, this paper investigates properties of English and German texts in the Europarl corpus, taking into account both general features such as sentence length as well as task-dependent features such as the distribution of demonstrative noun phrases. The investigation is based on the entire Europarl corpus as well as on a small subset thereof, which has been manually annotated. The results indicate English translated texts are sufficiently """"authentic"""" to be used as training data for anaphora resolution; results for German texts are less conclusive, though.