Zarah Weiss

Also published as: Zarah Weiß

2025

Enriching children’s stories with LLMs: Delivering multilingual data enrichment for children’s books at scale and across markets
Zarah Weiss | Christof Meyer | Mikael Andersson
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

This paper presents a user-centered, empirically guided approach to multilingual metadata enrichment for children’s books. We combine LLMs with human-in-the-loop quality control in a scalable CI/CD pipeline to curate brand collections that enhance book discovery and engagement for young readers across multiple European markets. Our results demonstrate that this hybrid approach delivers high-quality, child-appropriate labels, improves user experience, and accelerates deployment in real-world production environments. This work offers practical insights for applying generative NLP in the media and publishing industry.

2022

pdf bib abs

Assessing sentence readability for German language learners with broad linguistic modeling or readability formulas: When do linguistic insights make a difference?
Zarah Weiss | Detmar Meurers
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

We present a new state-of-the-art sentence-wise readability assessment model for German L2 readers. We build a linguistically broadly informed machine learning model and compare its performance against four commonly used readability formulas. To understand when the linguistic insights used to inform our model make a difference for readability assessment and when simple readability formulas suffice, we compare their performance based on two common automatic readability assessment tasks: predictive regression and sentence pair ranking. We find that leveraging linguistic insights yields top performances across tasks, but that for the identification of simplified sentences also readability formulas – which are easier to compute and more accessible – can be sufficiently precise. Linguistically informed modeling, however, is the only viable option for high quality outcomes in fine-grained prediction tasks. We then explore the sentence-wise readability profile of leveled texts written for language learners at a beginning, intermediate, and advanced level of German to showcase the valuable insights that sentence-wise readability assessment can have for the adaptation of learning materials and better understand how sentences’ individual readability contributes to larger texts’ overall readability.

2021

pdf bib

Using Broad Linguistic Complexity Modeling for Cross-Lingual Readability Assessment
Zarah Weiss | Xiaobin Chen | Detmar Meurers
Proceedings of the 10th Workshop on NLP for Computer Assisted Language Learning

2020

pdf bib abs

CTAP for Italian: Integrating Components for the Analysis of Italian into a Multilingual Linguistic Complexity Analysis Tool
Nadezda Okinina | Jennifer-Carmen Frey | Zarah Weiss
Proceedings of the Twelfth Language Resources and Evaluation Conference

Linguistic complexity research being a very actively developing field, an increasing number of text analysis tools are created that use natural language processing techniques for the automatic extraction of quantifiable measures of linguistic complexity. While most tools are designed to analyse only one language, the CTAP open source linguistic complexity measurement tool is capable of processing multiple languages, making cross-lingual comparisons possible. Although it was originally developed for English, the architecture has been ex-tended to support multi-lingual analyses. Here we present the Italian component of CTAP, describe its implementation and compare it to the existing linguistic complexity tools for Italian. Offering general text length statistics and features for lexical, syntactic, and morpho-syntactic complexity (including measures of lexical frequency, lexical diversity, lexical and syntactical variation, part-of-speech density), CTAP is currently the most comprehensive linguistic complexity measurement tool for Italian and the only one allowing the comparison of Italian texts to multiple other languages within one tool.

2019

pdf bib abs

Computationally Modeling the Impact of Task-Appropriate Language Complexity and Accuracy on Human Grading of German Essays
Zarah Weiss | Anja Riemenschneider | Pauline Schröter | Detmar Meurers
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

Computational linguistic research on the language complexity of student writing typically involves human ratings as a gold standard. However, educational science shows that teachers find it difficult to identify and cleanly separate accuracy, different aspects of complexity, contents, and structure. In this paper, we therefore explore the use of computational linguistic methods to investigate how task-appropriate complexity and accuracy relate to the grading of overall performance, content performance, and language performance as assigned by teachers. Based on texts written by students for the official school-leaving state examination (Abitur), we show that teachers successfully assign higher language performance grades to essays with higher task-appropriate language complexity and properly separate this from content scores. Yet, accuracy impacts teacher assessment for all grading rubrics, also the content score, overemphasizing the role of accuracy. Our analysis is based on broad computational linguistic modeling of German language complexity and an innovative theory- and data-driven feature aggregation method inferring task-appropriate language complexity.

pdf bib

Integrating large-scale web data and curated corpus data in a search engine supporting German literacy education
Sabrina Dittrich | Zarah Weiss | Hannes Schröter | Detmar Meurers
Proceedings of the 8th Workshop on NLP for Computer Assisted Language Learning

pdf bib abs

Analyzing Linguistic Complexity and Accuracy in Academic Language Development of German across Elementary and Secondary School
Zarah Weiss | Detmar Meurers
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We track the development of writing complexity and accuracy in German students’ early academic language development from first to eighth grade. Combining an empirically broad approach to linguistic complexity with the high-quality error annotation included in the Karlsruhe Children’s Text corpus (Lavalley et al. 2015) used, we construct models of German academic language development that successfully identify the student’s grade level. We show that classifiers for the early years rely more on accuracy development, whereas development in secondary school is better characterized by increasingly complex language in all domains: linguistic system, language use, and human sentence processing characteristics. We demonstrate the generalizability and robustness of models using such a broad complexity feature set across writing topics.

2018

pdf bib abs

COAST - Customizable Online Syllable Enhancement in Texts. A flexible framework for automatically enhancing reading materials
Heiko Holz | Zarah Weiss | Oliver Brehm | Detmar Meurers
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents COAST, a web-based application to easily and automatically enhance syllable structure, word stress, and spacing in texts, that was designed in close collaboration with learning therapists to ensure its practical relevance. Such syllable-enhanced texts are commonly used in learning therapy or private tuition to promote the recognition of syllables in order to improve reading and writing skills. In a state of the art solutions for automatic syllable enhancement, we put special emphasis on syllable stress and support specific marking of the primary syllable stress in words. Core features of our tool are i) a highly customizable text enhancement and template functionality, and ii) a novel crowd-sourcing mechanism that we employ to address the issue of data sparsity in language resources. We successfully tested COAST with real-life practitioners in a series of user tests validating the concept of our framework.

pdf bib

A Linguistically-Informed Search Engine to Identifiy Reading Material for Functional Illiteracy Classes
Zarah Weiss | Sabrina Dittrich | Detmar Meurers
Proceedings of the 7th workshop on NLP for Computer Assisted Language Learning

pdf bib abs

Modeling the Readability of German Targeting Adults and Children: An empirically broad analysis and its cross-corpus validation
Zarah Weiß | Detmar Meurers
Proceedings of the 27th International Conference on Computational Linguistics

We analyze two novel data sets of German educational media texts targeting adults and children. The analysis is based on 400 automatically extracted measures of linguistic complexity from a wide range of linguistic domains. We show that both data sets exhibit broad linguistic adaptation to the target audience, which generalizes across both data sets. Our most successful binary classification model for German readability robustly shows high accuracy between 89.4%–98.9% for both data sets. To our knowledge, this comprehensive German readability model is the first for which robust cross-corpus performance has been shown. The research also contributes resources for German readability assessment that are externally validated as successful for different target audiences: we compiled a new corpus of German news broadcast subtitles, the Tagesschau/Logo corpus, and crawled a GEO/GEOlino corpus substantially enlarging the data compiled by Hancke et al. 2012.

Co-authors

Jennifer-Carmen Frey 1

Heiko Holz 1

Christof Meyer 1

Nadezda Okinina 1

Anja Riemenschneider 1

Pauline Schröter 1

Hannes Schröter 1

Venues

Fix author