Stefanie Dipper


2021

pdf bib
Identifikation von Vorkommensformen der Lemmata in Quellenzitaten frühneuhochdeutscher Lexikoneinträge
Stefanie Dipper | Jan Christian Schaffert
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2020

pdf bib
Proceedings of the 14th Linguistic Annotation Workshop
Stefanie Dipper | Amir Zeldes
Proceedings of the 14th Linguistic Annotation Workshop

pdf bib
Automatic Orality Identification in Historical Texts
Katrin Ortmann | Stefanie Dipper
Proceedings of the 12th Language Resources and Evaluation Conference

Independently of the medial representation (written/spoken), language can exhibit characteristics of conceptual orality or literacy, which mainly manifest themselves on the lexical or syntactic level. In this paper we aim at automatically identifying conceptually-oral historical texts, with the ultimate goal of gaining knowledge about spoken data of historical time stages. We apply a set of general linguistic features that have been proven to be effective for the classification of modern language data to historical German texts from various registers. Many of the features turn out to be equally useful in determining the conceptuality of historical data as they are for modern data, especially the frequency of different types of pronouns and the ratio of verbs to nouns. Other features like sentence length, particles or interjections point to peculiarities of the historical data and reveal problems with the adoption of a feature set that was developed on modern language data.

2019

pdf bib
Variation between Different Discourse Types: Literate vs. Oral
Katrin Ortmann | Stefanie Dipper
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper deals with the automatic identification of literate and oral discourse in German texts. A range of linguistic features is selected and their role in distinguishing between literate- and oral-oriented registers is investigated, using a decision-tree classifier. It turns out that all of the investigated features are related in some way to oral conceptuality. Especially simple measures of complexity (average sentence and word length) are prominent indicators of oral and literate discourse. In addition, features of reference and deixis (realized by different types of pronouns) also prove to be very useful in determining the degree of orality of different registers.

pdf bib
The making of the Litkey Corpus, a richly annotated longitudinal corpus of German texts written by primary school children
Ronja Laarmann-Quante | Stefanie Dipper | Eva Belke
Proceedings of the 13th Linguistic Annotation Workshop

To date, corpus and computational linguistic work on written language acquisition has mostly dealt with second language learners who have usually already mastered orthography acquisition in their first language. In this paper, we present the Litkey Corpus, a richly-annotated longitudinal corpus of written texts produced by primary school children in Germany from grades 2 to 4. The paper focuses on the (semi-)automatic annotation procedure at various linguistic levels, which include POS tags, features of the word-internal structure (phonemes, syllables, morphemes) and key orthographic features of the target words as well as a categorization of spelling errors. Comprehensive evaluations show that high accuracy was achieved on all levels, making the Litkey Corpus a useful resource for corpus-based research on literacy acquisition of German primary school children and for developing NLP tools for educational purposes. The corpus is freely available under https://www.linguistics.rub.de/litkeycorpus/.

2018

pdf bib
Survey: Anaphora With Non-nominal Antecedents in Computational Linguistics: a Survey
Varada Kolhatkar | Adam Roussel | Stefanie Dipper | Heike Zinsmeister
Computational Linguistics, Volume 44, Issue 3 - September 2018

This article provides an extensive overview of the literature related to the phenomenon of non-nominal-antecedent anaphora (also known as abstract anaphora or discourse deixis), a type of anaphora in which an anaphor like “that” refers to an antecedent (marked in boldface) that is syntactically non-nominal, such as the first sentence in “It’s way too hot here. That’s why I’m moving to Alaska.” Annotating and automatically resolving these cases of anaphora is interesting in its own right because of the complexities involved in identifying non-nominal antecedents, which typically represent abstract objects such as events, facts, and propositions. There is also practical value in the resolution of non-nominal-antecedent anaphora, as this would help computational systems in machine translation, summarization, and question answering, as well as, conceivably, any other task dependent on some measure of text understanding. Most of the existing approaches to anaphora annotation and resolution focus on nominal-antecedent anaphora, classifying many of the cases where the antecedents are syntactically non-nominal as non-anaphoric. There has been some work done on this topic, but it remains scattered and difficult to collect and assess. With this article, we hope to bring together and synthesize work done in disparate contexts up to now in order to identify fundamental problems and draw conclusions from an overarching perspective. Having a good picture of the current state of the art in this field can help researchers direct their efforts to where they are most necessary. Because of the great variety of theoretical approaches that have been brought to bear on the problem, there is an equally diverse array of terminologies that are used to describe it, so we will provide an overview and discussion of these terminologies. We also describe the linguistic properties of non-nominal-antecedent anaphora, examine previous annotation efforts that have addressed this topic, and present the computational approaches that aim at resolving non-nominal-antecedent anaphora automatically. We close with a review of the remaining open questions in this area and some of our recommendations for future research.

2017

pdf bib
Variance in Historical Data: How bad is it and how can we profit from it for historical linguistics?
Stefanie Dipper
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

pdf bib
Investigating Diatopic Variation in a Historical Corpus
Stefanie Dipper | Sandra Waldenberger
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper investigates diatopic variation in a historical corpus of German. Based on equivalent word forms from different language areas, replacement rules and mappings are derived which describe the relations between these word forms. These rules and mappings are then interpreted as reflections of morphological, phonological or graphemic variation. Based on sample rules and mappings, we show that our approach can replicate results from historical linguistics. While previous studies were restricted to predefined word lists, or confined to single authors or texts, our approach uses a much wider range of data available in historical corpora.

pdf bib
Annotating Orthographic Target Hypotheses in a German L1 Learner Corpus
Ronja Laarmann-Quante | Katrin Ortmann | Anna Ehlert | Maurice Vogel | Stefanie Dipper
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

NLP applications for learners often rely on annotated learner corpora. Thereby, it is important that the annotations are both meaningful for the task, and consistent and reliable. We present a new longitudinal L1 learner corpus for German (handwritten texts collected in grade 2–4), which is transcribed and annotated with a target hypothesis that strictly only corrects orthographic errors, and is thereby tailored to research and tool development for orthographic issues in primary school. While for most corpora, transcription and target hypothesis are not evaluated, we conducted a detailed inter-annotator agreement study for both tasks. Although we achieved high agreement, our discussion of cases of disagreement shows that even with detailed guidelines, annotators differ here and there for different reasons, which should also be considered when working with transcriptions and target hypotheses of other corpora, especially if no explicit guidelines for their construction are known.

2016

pdf bib
Annotating Spelling Errors in German Texts Produced by Primary School Children
Ronja Laarmann-Quante | Lukas Knichel | Stefanie Dipper | Carina Betken
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

pdf bib
Evaluating Inter-Annotator Agreement on Historical Spelling Normalization
Marcel Bollmann | Stefanie Dipper | Florian Petran
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

2014

pdf bib
CorA: A web-based annotation tool for historical and other non-standard language data
Marcel Bollmann | Florian Petran | Stefanie Dipper | Julia Krasselt
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2013

pdf bib
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse
Antonio Pareja-Lora | Maria Liakata | Stefanie Dipper
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

pdf bib
The Use of Parallel and Comparable Data for Analysis of Abstract Anaphora in German and English
Stefanie Dipper | Melanie Seiss | Heike Zinsmeister
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Parallel corpora ― original texts aligned with their translations ― are a widely used resource in computational linguistics. Translation studies have shown that translated texts often differ systematically from comparable original texts. Translators tend to be faithful to structures of the original texts, resulting in a """"shining through"""" of the original language preferences in the translated text. Translators also tend to make their translations most comprehensible with the effect that translated texts can be more explicit than their source texts. Motivated by the need to use a parallel resource for cross-linguistic feature induction in abstract anaphora resolution, this paper investigates properties of English and German texts in the Europarl corpus, taking into account both general features such as sentence length as well as task-dependent features such as the distribution of demonstrative noun phrases. The investigation is based on the entire Europarl corpus as well as on a small subset thereof, which has been manually annotated. The results indicate English translated texts are sufficiently """"authentic"""" to be used as training data for anaphora resolution; results for German texts are less conclusive, though.

2011

pdf bib
Rule-Based Normalization of Historical Texts
Marcel Bollmann | Florian Petran | Stefanie Dipper
Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage

2010

pdf bib
OTTO: A Transcription and Management Tool for Historical Texts
Stefanie Dipper | Lara Kresse | Martin Schnurrenberger | Seong-Eun Cho
Proceedings of the Fourth Linguistic Annotation Workshop

2009

pdf bib
Annotating Discourse Anaphora
Stefanie Dipper | Heike Zinsmeister
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

pdf bib
Measures for Term and Sentence Relevances: an Evaluation for German
Heike Bieler | Stefanie Dipper
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Terms, term relevances, and sentence relevances are concepts that figure in many NLP applications, such as Text Summarization. These concepts are implemented in various ways, though. In this paper, we want to shed light on the impact that different implementations can have on the overall performance of the systems. In particular, we examine the interplay between term definitions and sentence-scoring functions. For this, we define a gold standard that ranks sentences according to their significance and evaluate a range of relevant parameters with respect to the gold standard.

pdf bib
Annotation of Information Structure: an Evaluation across different Types of Texts
Julia Ritz | Stefanie Dipper | Michael Götze
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We report on the evaluation of information structural annotation according to the Linguistic Information Structure Annotation Guidelines (LISA, (Dipper et al., 2007)). The annotation scheme differentiates between the categories of information status, topic, and focus. It aims at being language-independent and has been applied to highly heterogeneous data: written and spoken evidence from typologically diverse languages. For the evaluation presented here, we focused on German texts of different types, both written texts and transcriptions of spoken language, and analyzed the annotation quantitatively and qualitatively.

2007

pdf bib
Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus
Kepa Joseba Rodríguez | Stefanie Dipper | Michael Götze | Massimo Poesio | Giuseppe Riccardi | Christian Raymond | Joanna Rabiega-Wiśniewska
Proceedings of the Linguistic Annotation Workshop

pdf bib
Identifying Formal and Functional Zones in Film Reviews
Heike Bieler | Stefanie Dipper | Manfred Stede
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2006

pdf bib
ANNIS: Complex Multilevel Annotations in a Linguistic Database
Michael Götze | Stefanie Dipper
Proceedings of the 5th Workshop on NLP and XML (NLPXML-2006): Multi-Dimensional Markup in Natural Language Processing

2004

pdf bib
Towards User-Adaptive Annotation Guidelines
Stefanie Dipper | Michael Götze | Stavros Skopeteas
Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora

pdf bib
Grammar Modularity and its Impact on Grammar Documentation
Stefanie Dipper
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2000

pdf bib
Grammar-Based Corpus Annotation
Stefanie Dipper
Proceedings of the COLING-2000 Workshop on Linguistically Interpreted Corpora