Julius Steuer

2026

Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus
Julius Steuer | Toshiki Nakai | Andrew Thomas Dyer | Luigi Talamo | Annemarie Verkerk
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.

2024

pdf bib abs

WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case
Vagrant Gautam | Julius Steuer | Eileen Bingert | Ray Johns | Anne Lauscher | Dietrich Klakow
Proceedings of the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference

While measuring bias and robustness in coreference resolution are important goals, such measurements are only as good as the tools we use to measure them. Winogender Schemas (Rudinger et al., 2018) are an influential dataset proposed to evaluate gender bias in coreference resolution, but a closer look reveals issues with the data that compromise its use for reliable evaluation, including treating different pronominal forms as equivalent, violations of template constraints, and typographical errors. We identify these issues and fix them, contributing a new dataset: WinoPron. Using WinoPron, we evaluate two state-of-the-art supervised coreference resolution systems, SpanBERT, and five sizes of FLAN-T5, and demonstrate that accusative pronouns are harder to resolve for all models. We also propose a new method to evaluate pronominal bias in coreference resolution that goes beyond the binary. With this method, we also show that bias characteristics vary not just across pronoun sets (e.g., he vs. she), but also across surface forms of those sets (e.g., him vs. his).

pdf bib abs

An Interactive Toolkit for Approachable NLP
AriaRay Brown | Julius Steuer | Marius Mosbach | Dietrich Klakow
Proceedings of the Sixth Workshop on Teaching NLP

We present a novel tool designed for teaching and interfacing the information-theoretic modeling abilities of large language models. The Surprisal Toolkit allows students from diverse linguistic and programming backgrounds to learn about measures of information theory and natural language processing (NLP) through an online interactive tool. In addition, the interface provides a valuable research mechanism for obtaining measures of surprisal. We implement the toolkit as part of a classroom tutorial in three different learning scenarios and discuss the overall receptive student feedback. We suggest this toolkit and similar applications as resourceful supplements to instruction in NLP topics, especially for the purpose of balancing conceptual understanding with technical instruction, grounding abstract topics, and engaging students with varying coding abilities.

pdf bib abs

Who Did You Blame When Your Project Failed? Designing a Corpus for Presupposition Generation in Cross-Examination Dialogues
Maria Francis | Julius Steuer | Dietrich Klakow | Volha Petukhova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces the corpus for the novel task of presupposition generation - a natural language generation problem where a model produces a list of presuppositions carried by the given input sentence, in the context of the presented research - given the cross-examination question. Two datasets, PECaN (Presupposition, Entailment, Contradiction and Neutral) and PGen (Presuppostion Generation), are designed to fine-tune existing BERT (CITATION) and T5 (CITATION) models for classification and generation tasks. Various corpora construction methods are proposed ranging from manual annotations, prompting the GPT 3.0 model, to augmenting data from the existing corpora. The fine-tuned models achieved high accuracy on the novel Presupposition as Natural Language Inference (PNLI) task which extends the traditional Natural Language Inference (NLI) incorporating instances of presupposition into classification. T5 outperforms BERT by broad margin achieving an overall accuracy of 84.35% compared to 71.85% of BERT, and specifically when classifying presuppositions (93% vs 73% respectively). Regarding presupposition generation, we observed that despite the limited amount of data used for fine-tuning, the model displays an emerging proficiency in generation presuppositions reaching ROUGE scores of 43.47, adhering to systematic patterns that mirror valid strategies for presupposition generation, although failed to generate the complete lists.

pdf bib

Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal
Julius Steuer | Marie-Pauline Krielke | Stefan Fischer | Stefania Degaetano-Ortlieb | Marius Mosbach | Dietrich Klakow
Proceedings of the 17th Workshop on Building and Using Comparable Corpora (BUCC) @ LREC-COLING 2024

2023

pdf bib abs

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists
Julius Steuer | Johann-Mattis List | Badr M. Abdullah | Dietrich Klakow
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

pdf bib

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures
Julius Steuer | Marius Mosbach | Dietrich Klakow
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning