Matthias Aßenmacher

Also published as: Matthias Assenmacher

2025

pdf bib abs
Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation
Esteban Garces Arias | Meimingwei Li | Christian Heumann | Matthias Assenmacher
Proceedings of the 31st International Conference on Computational Linguistics

Decoding strategies for generative large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Guided by specific hyperparameters, these strategies aim to transform the raw probability distributions produced by language models into coherent, fluent text. In this study, we undertake a large-scale empirical assessment of a range of decoding methods, open-source LLMs, textual domains, and evaluation protocols to determine how hyperparameter choices shape the outputs. Our experiments include both factual (e.g., news) and creative (e.g., fiction) domains, and incorporate a broad suite of automatic evaluation metrics alongside human judgments. Through extensive sensitivity analyses, we distill practical recommendations for selecting and tuning hyperparameters, noting that optimal configurations vary across models and tasks. By synthesizing these insights, this study provides actionable guidance for refining decoding strategies, enabling researchers and practitioners to achieve higher-quality, more reliable, and context-appropriate text generation outcomes.

pdf bib abs
AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers
Alexander Wuttke | Matthias Assenmacher | Christopher Klamm | Max Lang | Fraue Kreuter
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to voice their opinions in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to a conversational interview by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. We publish our data and materials for re-use and present specific recommendations for effective implementation.

pdf bib abs
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan
Matthias Schöffel | Marinus Wiedner | Esteban Garces Arias | Paula Ruppert | Christian Heumann | Matthias Aßenmacher
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora—hagiographical and medical texts—we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

2024

pdf bib abs
Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation
Esteban Garces Arias | Julian Rodemann | Meimingwei Li | Christian Heumann | Matthias Aßenmacher
Findings of the Association for Computational Linguistics: EMNLP 2024

Despite the remarkable capabilities of large language models, generating high-quality text remains a challenging task. Numerous decoding strategies—such as beam search, sampling with temperature, top‐k sampling, nucleus (top‐p) sampling, typical decoding, contrastive decoding, and contrastive search—have been proposed to address these challenges by improving coherence, diversity, and resemblance to human-generated text. In this study, we introduce Adaptive Contrastive Search (ACS), a novel decoding strategy that extends contrastive search (CS) by incorporating an adaptive degeneration penalty informed by the model’s estimated uncertainty at each generation step. ACS aims to enhance creativity and diversity while maintaining coherence to produce high-quality outputs. Extensive experiments across various model architectures, languages, and datasets demonstrate that our approach improves both creativity and coherence, underscoring its effectiveness in text-generation tasks. We release our code, datasets, and models to facilitate further research.

pdf bib abs
Detecting Gender Discrimination on Actor Level Using Linguistic Discourse Analysis
Stefanie Urchs | Veronika Thurner | Matthias Aßenmacher | Christian Heumann | Stephanie Thiemichen
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

With the usage of tremendous amounts of text data for training powerful large language models such as ChatGPT, the issue of analysing and securing data quality has become more pressing than ever. Any biases, stereotypes and discriminatory patterns that exist in the training data can be reproduced, reinforced or broadly disseminated by the models in production. Therefore, it is crucial to carefully select and monitor the text data that is used as input to train the model. Due to the vast amount of training data, this process needs to be (at least partially) automated. In this work, we introduce a novel approach for automatically detecting gender discrimination in text data on the actor level based on linguistic discourse analysis. Specifically, we combine existing information extraction (IE) techniques to partly automate the qualitative research done in linguistic discourse analysis. We focus on two important steps: Identifying the respectiveperson-named-entity (an actor) and all forms it is referred to (Nomination), and detecting the characteristics it is ascribed (Predication). Asa proof of concept, we integrate these two steps into a pipeline for automated text analysis. The separate building blocks of the pipeline could be flexibly adapted, extended, and scaled for bigger datasets to accommodate a wide range of usage scenarios and specific ML tasks or help social scientists with analysis tasks. We showcase and evaluate our approach on several real and simulated exemplary texts.

pdf bib
Introducing wwm-german-18k - Can LLMs Crack the Million? (Or Win at Least 500 Euros?)
Matthias Aßenmacher | Luis Karrlein | Philipp Schiele | Christian Heumann
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

pdf bib abs
Divergent Token Metrics: Measuring degradation to prune away LLM components – and optimize quantization
Björn Deiseroth | Max Meuer | Nikolas Gritsch | Constantin Eichenberg | Patrick Schramowski | Matthias Aßenmacher | Kristian Kersting
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components’ impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

pdf bib
Can OpenSource beat ChatGPT? - A Comparative Study of Large Language Models for Text-to-Code Generation
Luis Mayer | Christian Heumann | Matthias Aßenmacher
Proceedings of the 9th edition of the Swiss Text Analytics Conference

pdf bib
Classifying multilingual party manifestos: Domain transfer across country, time, and genre
Matthias Aßenmacher | Nadja Sauter | Christian Heumann
Proceedings of the 9th edition of the Swiss Text Analytics Conference

In this work, we present a collaboratively and continuously developed open-source educational resource (OSER) for teaching natural language processing at two different universities. We shed light on the principles we followed for the initial design of the course and the rationale for ongoing developments, followed by a reflection on the inter-university collaboration for designing and maintaining teaching material. When reflecting on the latter, we explicitly emphasize the considerations that need to be made when facing heterogeneous groups and when having to accommodate multiple examination regulations within one single course framework. Relying on the fundamental principles of OSER developments as defined by Bothmann et al. (2023) proved to be an important guideline during this process. The final part pertains to open-sourcing our teaching material, coping with the increasing speed of developments in the field, and integrating the course digitally, also addressing conflicting priorities and challenges we are currently facing.

pdf bib abs
More Labels or Cases? Assessing Label Variation in Natural Language Inference
Cornelia Gruber | Katharina Hechinger | Matthias Assenmacher | Göran Kauermann | Barbara Plank
Proceedings of the Third Workshop on Understanding Implicit and Underspecified Language

In this work, we analyze the uncertainty that is inherently present in the labels used for supervised machine learning in natural language inference (NLI). In cases where multiple annotations per instance are available, neither the majority vote nor the frequency of individual class votes is a trustworthy representation of the labeling uncertainty. We propose modeling the votes via a Bayesian mixture model to recover the data-generating process, i.e., the “true” latent classes, and thus gain insight into the class variations. This will enable a better understanding of the confusion happening during the annotation process. We also assess the stability of the proposed estimation procedure by systematically varying the numbers of i) instances and ii) labels. Thereby, we observe that few instances with many labels can predict the latent class borders reasonably well, while the estimation fails for many instances with only a few labels. This leads us to conclude that multiple labels are a crucial building block for properly analyzing label uncertainty.

2023

The Bavarian Academy of Sciences and Humanities aims to digitize the Medieval Latin Dictionary. This dictionary entails record cards referring to lemmas in medieval Latin, a low-resource language. A crucial step of the digitization process is the handwritten text recognition (HTR) of the handwritten lemmas on the record cards. In our work, we introduce an end-to-end pipeline, tailored for the medieval Latin dictionary, for locating, extracting, and transcribing the lemmas. We employ two state-of-the-art image segmentation models to prepare the initial data set for the HTR task. Further, we experiment with different transformer-based models and conduct a set of experiments to explore the capabilities of different combinations of vision encoders with a GPT-2 decoder. Additionally, we also apply extensive data augmentation resulting in a highly competitive model. The best-performing setup achieved a character error rate of 0.015, which is even superior to the commercial Google Cloud Vision model, and shows more stable performance.

pdf bib abs
Automatic Transcription of Handwritten Old Occitan Language
Esteban Garces Arias | Vallari Pai | Matthias Schöffel | Christian Heumann | Matthias Aßenmacher
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While existing neural network-based approaches have shown promising results in Handwritten Text Recognition (HTR) for high-resource languages and standardized/machine-written text, their application to low-resource languages often presents challenges, resulting in reduced effectiveness. In this paper, we propose an innovative HTR approach that leverages the Transformer architecture for recognizing handwritten Old Occitan language. Given the limited availability of data, which comprises only word pairs of graphical variants and lemmas, we develop and rely on elaborate data augmentation techniques for both text and image data. Our model combines a custom-trained Swin image encoder with a BERT text decoder, which we pre-train using a large-scale augmented synthetic data set and fine-tune on the small human-labeled data set. Experimental results reveal that our approach surpasses the performance of current state-of-the-art models for Old Occitan HTR, including open-source Transformer-based models such as a fine-tuned TrOCR and commercial applications like Google Cloud Vision. To nurture further research and development, we make our models, data sets, and code publicly available.

2022

pdf bib abs
CC-Top: Constrained Clustering for Dynamic Topic Discovery
Jann Goschenhofer | Pranav Ragupathy | Christian Heumann | Bernd Bischl | Matthias Aßenmacher
Proceedings of the First Workshop on Ever Evolving NLP (EvoNLP)

Research on multi-class text classification of short texts mainly focuses on supervised (transfer) learning approaches, requiring a finite set of pre-defined classes which is constant over time. This work explores deep constrained clustering (CC) as an alternative to supervised learning approaches in a setting with a dynamically changing number of classes, a task we introduce as dynamic topic discovery (DTD).We do so by using pairwise similarity constraints instead of instance-level class labels which allow for a flexible number of classes while exhibiting a competitive performance compared to supervised approaches. First, we substantiate this through a series of experiments and show that CC algorithms exhibit a predictive performance similar to state-of-the-art supervised learning algorithms while requiring less annotation effort. Second, we demonstrate the overclustering capabilities of deep CC for detecting topics in short text data sets in the absence of the ground truth class cardinality during model training. Third, we showcase that these capabilities can be leveraged for the DTD setting as a step towards dynamic learning over time and finally, we release our codebase to nurture further research in this area.

pdf bib abs
Pre-trained language models evaluating themselves - A comparative study
Philipp Koch | Matthias Aßenmacher | Christian Heumann
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We investigate their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further none of them was overall sensitive to the other issues mentioned above.

2021

pdf bib
Benchmarking down-scaled (not so large) pre-trained language models
Matthias Aßenmacher | Patrick Schulze | Christian Heumann
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

2020

pdf bib abs
Evaluating Unsupervised Representation Learning for Detecting Stances of Fake News
Maike Guderlei | Matthias Aßenmacher
Proceedings of the 28th International Conference on Computational Linguistics

Our goal is to evaluate the usefulness of unsupervised representation learning techniques for detecting stances of Fake News. Therefore we examine several pre-trained language models with respect to their performance on two Fake News related data sets, both consisting of instances with a headline, an associated news article and the stance of the article towards the respective headline. Specifically, the aim is to understand how much hyperparameter tuning is necessary when fine-tuning the pre-trained architectures, how well transfer learning works in this specific case of stance detection and how sensitive the models are to changes in hyperparameters like batch size, learning rate (schedule), sequence length as well as the freezing technique. The results indicate that the computationally more expensive autoregression approach of XLNet (Yanget al., 2019) is outperformed by BERT-based models, notably by RoBERTa (Liu et al., 2019).While the learning rate seems to be the most important hyperparameter, experiments with different freezing techniques indicate that all evaluated architectures had already learned powerful language representations that pose a good starting point for fine-tuning them.

Venues