Anne-Kathrin Schumann

2025

Quantifying word complexity for Leichte Sprache: A computational metric and its psycholinguistic validation
Umesh Patil | Jesus Calvillo | Sol Lago | Anne-Kathrin Schumann
Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL)

Leichte Sprache (Easy Language or Easy German) is a strongly simplified version of German geared toward a target group with limited language proficiency. In Germany, public bodies are required to provide information in Leichte Sprache. Unfortunately, Leichte Sprache rules are traditionally defined by non-linguists, they are not rooted in linguistic research and they do not provide precise decision criteria or devices for measuring the complexity of linguistic structures (Bock and Pappert,2023). For instance, one of the rules simply recommends the usage of simple rather than complex words. In this paper we, therefore, propose a model to determine word complexity. We train an XGBoost model for classifying word complexity by leveraging word-level linguistic and corpus-level distributional features, frequency information from an in-house Leichte Sprache corpus and human complexity annotations. We psycholinguistically validate our model by showing that it captures human word recognition times above and beyond traditional word-level predictors. Moreover, we discuss a number of practical applications of our classifier, such as the evaluation of AI-simplified text and detection of CEFR levels of words. To our knowledge, this is one of the first attempts to systematically quantify word complexity in the context of Leichte Sprache and to link it directly to real-time word processing.

pdf bib

Automatic Compound Segmentation for Leichte Sprache
Jesus Calvillo | Umesh Patil | Johann Seltmann | Anne-Kathrin Schumann
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops

2020

pdf bib abs

Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems
Oliver Guhr | Anne-Kathrin Schumann | Frank Bahrmann | Hans Joachim Böhme
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes the training of a general-purpose German sentiment classification model. Sentiment classification is an important aspect of general text analytics. Furthermore, it plays a vital role in dialogue systems and voice interfaces that depend on the ability of the system to pick up and understand emotional signals from user utterances. The presented study outlines how we have collected a new German sentiment corpus and then combined this corpus with existing resources to train a broad-coverage German sentiment model. The resulting data set contains 5.4 million labelled samples. We have used the data to train both, a simple convolutional and a transformer-based classification model and compared the results achieved on various training configurations. The model and the data set will be published along with this paper.

2018

pdf bib abs

SemEval-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers
Kata Gábor | Davide Buscaldi | Anne-Kathrin Schumann | Behrang QasemiZadeh | Haïfa Zargayouna | Thierry Charnois
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes the first task on semantic relation extraction and classification in scientific paper abstracts at SemEval 2018. The challenge focuses on domain-specific semantic relations and includes three different subtasks. The subtasks were designed so as to compare and quantify the effect of different pre-processing steps on the relation classification results. We expect the task to be relevant for a broad range of researchers working on extracting specialized knowledge from domain corpora, for example but not limited to scientific or bio-medical information extraction. The task attracted a total of 32 participants, with 158 submissions across different scenarios.

pdf bib

Automatic Annotation of Semantic Term Types in the Complete ACL Anthology Reference Corpus
Anne-Kathrin Schumann | Héctor Martínez Alonso
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib

Brave New World: Uncovering Topical Dynamics in the ACL Anthology Reference Corpus Using Term Life Cycle Information
Anne-Kathrin Schumann
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib abs

The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods
Behrang QasemiZadeh | Anne-Kathrin Schumann
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper introduces the ACL Reference Dataset for Terminology Extraction and Classification, version 2.0 (ACL RD-TEC 2.0). The ACL RD-TEC 2.0 has been developed with the aim of providing a benchmark for the evaluation of term and entity recognition tasks based on specialised text from the computational linguistics domain. This release of the corpus consists of 300 abstracts from articles in the ACL Anthology Reference Corpus, published between 1978–2006. In these abstracts, terms (i.e., single or multi-word lexical units with a specialised meaning) are manually annotated. In addition to their boundaries in running text, annotated terms are classified into one of the seven categories method, tool, language resource (LR), LR product, model, measures and measurements, and other. To assess the quality of the annotations and to determine the difficulty of this annotation task, more than 171 of the abstracts are annotated twice, independently, by each of the two annotators. In total, 6,818 terms are identified and annotated in more than 1300 sentences, resulting in a specialised vocabulary made of 3,318 lexical forms, mapped to 3,471 concepts. We explain the development of the annotation guidelines and discuss some of the challenges we encountered in this annotation task.

pdf bib abs

Compasses, Magnets, Water Microscopes: Annotation of Terminology in a Diachronic Corpus of Scientific Texts
Anne-Kathrin Schumann | Stefan Fischer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The specialised lexicon belongs to the most prominent attributes of specialised writing: Terms function as semantically dense encodings of specialised concepts, which, in the absence of terms, would require lengthy explanations and descriptions. In this paper, we argue that terms are the result of diachronic processes on both the semantic and the morpho-syntactic level. Very little is known about these processes. We therefore present a corpus annotation project aiming at revealing how terms are coined and how they evolve to fit their function as semantically and morpho-syntactically dense encodings of specialised knowledge. The scope of this paper is two-fold: Firstly, we outline our methodology for annotating terminology in a diachronic corpus of scientific publications. Moreover, we provide a detailed analysis of our annotation results and suggest methods for improving the accuracy of annotations in a setting as difficult as ours. Secondly, we present results of a pilot study based on the annotated terms. The results suggest that terms in older texts are linguistically relatively simple units that are hard to distinguish from the lexicon of general language. We believe that this supports our hypothesis that terminology undergoes diachronic processes of densification and specialisation.

2014

pdf bib

Beyond Linguistic Equivalence. An Empirical Study of Translation Evaluation in a Translation Learner Corpus
Mihaela Vela | Anne-Kathrin Schumann | Andrea Wurm
Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation

pdf bib

Sensible: L2 Translation Assistance by Emulating the Manual Post-Editing Process
Liling Tan | Anne-Kathrin Schumann | Jose M.M. Martinez | Francis Bond
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

2013

pdf bib

Collection, Annotation and Analysis of Gold Standard Corpora for Knowledge-Rich Context Extraction in Russian and German
Anne-Kathrin Schumann
Proceedings of the Student Research Workshop associated with RANLP 2013

2012

pdf bib abs

Knowledge-Rich Context Extraction and Ranking with KnowPipe
Anne-Kathrin Schumann
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents ongoing Phd thesis work dealing with the extraction of knowledge-rich contexts from text corpora for terminographic purposes. Although notable progress in the field has been made over recent years, there is yet no methodology or integrated workflow that is able to deal with multiple, typologically different languages and different domains, and that can be handled by non-expert users. Moreover, while a lot of work has been carried out to research the KRC extraction step, the selection and further analysis of results still involves considerable manual work. In this view, the aim of this paper is two-fold. Firstly, the paper presents a ranking algorithm geared at supporting the selection of high-quality contexts once the extraction has been finished and describes ranking experiments with Russian context candidates. Secondly, it presents the KnowPipe framework for context extraction: KnowPipe aims at providing a processing environment that allows users to extract knowledge-rich contexts from text corpora in different languages using shallow and deep processing techniques. In its current state of development, KnowPipe provides facilities for preprocessing Russian and German text corpora, for pattern-based knowledge-rich context extraction from these corpora using shallow analysis as well as tools for ranking Russian context candidates.