Wolf-Tilo Balke

Also published as: Wolf Tilo Balke

2025

From Prototypical to Relational: How LLMs Navigate Complex Analogies
Mayukh Das | Wolf-Tilo Balke
Proceedings of the 18th International Natural Language Generation Conference

We introduce a comprehensive benchmark to assess the analogical reasoning capabilities of large language models (LLMs) on complex analogy tasks that go beyond conventional formats with single correct answers. Unlike standard benchmarks that assume a singular ground truth, our framework presents a four-way multiple-choice analogy task in which all target options are semantically plausible. Leveraging concept pairs from Wikidata and AnalogyKB, we construct analogy instances enriched with multiple overlapping relational structures, where the relations are mined with RAG and ranked in salience through a GPT-4-assisted Max-Diff survey. To enable systematic evaluation, we propose three complementary semantic measures i.e. ranked relational overlap, context embedding similarity, and prototypicality; each grounded in established literature on analogical reasoning. Our experiments span a range of LLMs, evaluated under zero-shot, few-shot, and knowledge-enhanced prompting conditions. While models such as GPT-4 perform well on embedding-based and prototypicality-based measures, they consistently underperform when tasked with capturing fine-grained relational mappings. These results reveal that, despite their impressive surface-level semantic fluency, current LLMs exhibit notable limitations in structured relational reasoning.

2024

pdf bib abs

Toximatics: Towards Understanding Toxicity in Real-Life Social Situations
Mayukh Das | Wolf-Tilo Balke
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The proliferation of social media has increased the visibility and effects of hate speech. To address this, NLP solutions have been developed to identify both explicit and implicit forms of hate speech. Typically, these approaches evaluate the toxicity of utterances in isolation, ignoring the context. Drawing on pragmatics, our study examines how contextual factors can influence the perceived toxicity of utterances, thereby anchoring assessments in a more nuanced semantic framework. We present Toximatics, a dataset that includes context-dependent utterances and it’s toxicity score. We also introduce a novel synthetic data generation pipeline designed to create context-utterance pairs at scale with controlled polarity. This pipeline can enhance existing hate speech datasets by adding contextual information to utterances, either preserving or altering their polarity, and also generate completely new pairs from seed statements. We utilised both features to create Toximatics. To address biases in state-of-the-art hate datasets, which often skew towards specific sensitive topics such as politics, race, and gender, we propose a method to generate neutral utterances typical of various social settings. These are then contextualized to show how neutrality can shift to toxicity or benignity depending on the surrounding context. The evaluation results clearly indicate that the current models are underperforming on this dataset.

pdf bib abs

Is Machine Psychology here? On Requirements for Using Human Psychological Tests on Large Language Models
Lea Löhn | Niklas Kiehne | Alexander Ljapunov | Wolf-Tilo Balke
Proceedings of the 17th International Natural Language Generation Conference

In an effort to better understand the behavior of large language models (LLM), researchers recently turned to conducting psychological assessments on them. Several studies diagnose various psychological concepts in LLMs, such as psychopathological symptoms, personality traits, and intellectual functioning, aiming to unravel their black-box characteristics. But can we safely assess LLMs with tests that were originally designed for humans? The psychology domain looks back on decades of developing standards of appropriate testing procedures to ensure reliable and valid measures. We argue that analogous standardization processes are required for LLM assessments, given their differential functioning as compared to humans. In this paper, we propose seven requirements necessary for testing LLMs. Based on these, we critically reflect a sample of 25 recent machine psychology studies. Our analysis reveals (1) the lack of appropriate methods to assess test reliability and construct validity, (2) the unknown strength of construct-irrelevant influences, such as the contamination of pre-training corpora with test material, and (3) the pervasive issue of non-reproducibility of many studies. The results underscore the lack of a general methodology for the implementation of psychological assessments of LLMs and the need to redefine psychological constructs specifically for large language models rather than adopting them from human psychology.

pdf bib abs

Analyzing Effects of Learning Downstream Tasks on Moral Bias in Large Language Models
Niklas Kiehne | Alexander Ljapunov | Marc Bätje | Wolf-Tilo Balke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Pre-training and fine-tuning large language models (LMs) is currently the state-of-the-art methodology for enabling data-scarce downstream tasks. However, the derived models still tend to replicate and perpetuate social biases. To understand this process in more detail, this paper investigates the actual effects of learning downstream tasks on moral bias in LMs. We develop methods to assess the agreement of LMs to explicitly codified norms in both pre-training and fine-tuning stages. Even if a pre-trained foundation model exhibits consistent norms, we find that introducing downstream tasks may indeed lead to unexpected inconsistencies in norm representation. Specifically, we observe two phenomena during fine-tuning across both masked and causal LMs: (1) pre-existing moral bias may be mitigated or amplified even when presented with opposing views and (2) prompt sensitivity may be negatively impacted. We provide empirical evidence of models deteriorating into conflicting states, where contradictory answers can easily be triggered by slight modifications in the input sequence. Our findings thus raise concerns about the general ability of LMs to mitigate moral biases effectively.

2022

pdf bib abs

Quantifying Bias from Decoding Techniques in Natural Language Generation
Mayukh Das | Wolf Tilo Balke
Proceedings of the 29th International Conference on Computational Linguistics

Natural language generation (NLG) models can propagate social bias towards particular demography. Though several studies investigated bias from data and model, NLG task distinctively uses stochastic decoder that can positively or negatively impact the bias-sensitive tokens initially predicted by the model. To address this gap in research, we present an extensive analysis of bias from decoding techniques for open-domain language generation considering the entire decoding space. We analyze to what extent bias metrics like toxicity and sentiment are impacted by the individual components of decoder algorithms. To this extent, we also analyze the trade-off between bias scores and human-annotated generation quality throughout the decoder space. Together, these methods reveal the imperative of testing inference time bias and provide evidence on the usefulness of inspecting the entire decoding spectrum.

pdf bib abs

Contextualizing Language Models for Norms Diverging from Social Majority
Niklas Kiehne | Hermann Kroll | Wolf-Tilo Balke
Findings of the Association for Computational Linguistics: EMNLP 2022

To comprehensibly contextualize decisions, artificial systems in social situations need a high degree of awareness of the rules of conduct of human behavior. Especially transformer-based language models have recently been shown to exhibit some such awareness. But what if norms in some social setting do not adhere to or even blatantly deviate from the mainstream? In this paper, we introduce a novel mechanism based on deontic logic to allow for a flexible adaptation of individual norms by de-biasing training data sets and a task-reduction to textual entailment. Building on the popular ‘Moral Stories’ dataset we on the one hand highlight the intrinsic bias of current language models, on the other hand characterize the adaptability of pre-trained models to deviating norms in fine-tuning settings.

Co-authors

Lea Löhn 1

Venues

Fix author