Markus Schedl

2026

Cognitive Effects and Biases in Large Language Models
Markus Schedl | Ralph Hertwig | Antonela Tommasel | Shahed Masoudian
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts)

This tutorial bridges psychology and NLP to clarify cognitive effects and biases in large language models, compares methodological assumptions across disciplines, and discusses practical evaluation challenges and open research directions.

pdf bib abs

How Far Can Pretrained LLMs Go in Symbolic Music? Controlled Comparisons of Supervised and Preference-based Adaptation
Deepak Kumar | Emmanouil Karystinaios | Gerhard Widmer | Markus Schedl
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)

Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation. Despite growing interest, the practical effectiveness of adapting instruction-tuned LLMs to symbolic music remains insufficiently characterized. We present a controlled comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants and a music-specialized LLM baseline. Across multiple symbolic music corpora and evaluation signals, we provide some insights into adaptation choices for symbolic music applications. We highlight the domain adaptation vs. preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.

pdf bib abs

Read Between the Tracks: Exploring LLM-driven Intent-based Music Recommendations
Anna Hausberger | Petra Jósár | Markus Schedl
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)

This paper evaluates the effectiveness of large language models (LLMs) on the task of context-aware music recommendation, specifically focusing on the alignment of music tracks with a listening intent, in addition to user preferences. We present a preliminary investigation in which five LLMs (variants of LLama, Qwen, and Mistral) are tasked with ranking a candidate set of tracks containing both ground-truth items (associated with specific user-intent pairs) and distractor items (containing user-relevant, intent-relevant, or non-user and non-intent relevant items). Our results show that LLMs rank intent-user-relevant items higher than the distract items, with "Llama-3.1-8B-Instruct" having the best performance (NDCG of 0.32_0.20 vs. 0.20_0.15). We further investigate whether performance differs when mentioning the listening intent explicitly in the prompt vs. implicitly given solely music preferences.Surprisingly, the LLMs achieved the best performance through an implicit indication of intent, versus explicitly adding it to the prompt, with "Mistral-7B-Instruct-v0.3" performing the best (NDCG of 0.37_0.22 vs. 0.29_0.18).

2025

pdf bib abs

Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion
Markus Frohmann | Gabriel Meseguer-Brocal | Markus Schedl | Elena V. Epure
Findings of the Association for Computational Linguistics: ACL 2025

The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.

pdf bib abs

Characterizing Positional Bias in Large Language Models: A Multi-Model Evaluation of Prompt Order Effects
Patrick Schilcher | Dominik Karasin | Michael Schöpf | Haisam Saleh | Antonela Tommasel | Markus Schedl
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) are widely used for a variety of tasks such as text generation, ranking, and decision-making. However, their outputs can be influenced by various forms of biases. One such bias is positional bias, where models prioritize items based on their position within a given prompt rather than their content or quality, impacting on how LLMs interpret and weigh information, potentially compromising fairness, reliability, and robustness. To assess positional bias, we prompt a range of LLMs to generate descriptions for a list of topics, systematically permuting their order and analyzing variations in the responses. Our analysis shows that ranking position affects structural features and coherence, with some LLMs also reordering or omitting topics. Nonetheless, the impact of positional bias varies across different LLMs and topics, indicating an interplay with other related biases.

pdf bib abs

Mitigating Gender Bias in Job Ranking Systems Using Job Advertisement Neutrality
Deepak Kumar | Shahed Masoudian | Alessandro B. Melchiorre | Markus Schedl
Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Transformer-based Job Ranking Systems (JRSs) are vulnerable to societal biases inherited in unbalanced datasets. These biases often manifest as unjust job rankings, particularly disadvantaging candidates of different genders. Most bias mitigation techniques leverage candidates’ gender and align gender distributions within the embeddings of JRSs to mitigate bias. While such methods effectively align distributional properties and make JRSs agnostic to gender, they frequently fall short in addressing empirical fairness metrics, such as the performance gap across genders. In this study, we shift our attention from candidate gender to mitigate bias based on gendered language in job advertisements. We propose a novel neutrality score based on automatically discovered biased words in job ads and use it to re-rank the model’s decisions. We evaluate our method by comparing it with different bias mitigation strategies and empirically demonstrate that our proposed method not only improves fairness but can also enhance the model’s performance.

2024

pdf bib abs

Effective Controllable Bias Mitigation for Classification and Retrieval using Gate Adapters
Shahed Masoudian | Cornelia Volaucnik | Markus Schedl | Navid Rekabsaz
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Bias mitigation of Language Models has been the topic of many studies with a recent focus on learning separate modules like adapters for on-demand debiasing. Besides optimizing for a modularized debiased model, it is often critical in practice to control the degree of bias reduction at inference time, e.g., in order to tune for a desired performance-fairness trade-off in search results or to control the strength of debiasing in classification tasks. In this paper, we introduce Controllable Gate Adapter (ConGater), a novel modular gating mechanism with adjustable sensitivity parameters, %In addition to better perseverance of task performance and enhanced information removal, which allows for a gradual transition from the biased state of the model to the fully debiased version at inference time. We demonstrate ConGater performance by (1) conducting adversarial debiasing experiments with three different models on three classification tasks with four protected attributes, and (2) reducing the bias of search results through fairness list-wise regularization to enable adjusting a trade-off between performance and fairness metrics. Our experiments on the classification tasks show that compared to baselines of the same caliber, ConGater can maintain higher task performance while containing less information regarding the attributes. Our results on the retrieval task show that the fully debiased ConGater can achieve the same fairness performance while maintaining more than twice as high task performance than recent strong baselines. Overall, besides strong performance ConGater enables the continuous transitioning between biased and debiased states of models, enhancing personalization of use and interpretability through controllability.

pdf bib abs

Unlabeled Debiasing in Downstream Tasks via Class-wise Low Variance Regularization
Shahed Masoudian | Markus Frohmann | Navid Rekabsaz | Markus Schedl
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language models frequently inherit societal biases from their training data. Numerous techniques have been proposed to mitigate these biases during both the pre-training and fine-tuning stages. However, fine-tuning a pre-trained debiased language model on a downstream task can reintroduce biases into the model. Additionally, existing debiasing methods for downstream tasks either (i) require labels of protected attributes (e.g., age, race, or political views) that are often not available or (ii) rely on indicators of bias, which restricts their applicability to gender debiasing since they rely on gender-specific words. To address this, we introduce a novel debiasing regularization technique based on the class-wise variance of embeddings. Crucially, our method does not require attribute labels and targets any attribute, thus addressing the shortcomings of existing debiasing methods. Our experiments on encoder language models and three datasets demonstrate that our method outperforms existing strong debiasing baselines that rely on target attribute labels while maintaining performance on the target task.

pdf bib abs

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Markus Frohmann | Igor Sterner | Ivan Vulić | Benjamin Minixhofer | Markus Schedl
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model — Segment any Text (SaT) — to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines — including strong LLMs — across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are readily available at https://github.com/segment-any-text/wtpsplit under the MIT license.

2023

pdf bib abs

Large pre-trained language models contain societal biases and carry along these biases to downstream tasks. Current in-processing bias mitigation approaches (like adversarial training) impose debiasing by updating a model’s parameters, effectively transferring the model to a new, irreversible debiased state. In this work, we propose a novel approach to develop stand-alone debiasing functionalities separate from the model, which can be integrated into the model on-demand, while keeping the core model untouched. Drawing from the concept of AdapterFusion in multi-task learning, we introduce DAM (Debiasing with Adapter Modules) – a debiasing approach to first encapsulate arbitrary bias mitigation functionalities into separate adapters, and then add them to the model on-demand in order to deliver fairness qualities. We conduct a large set of experiments on three classification tasks with gender, race, and age as protected attributes. Our results show that DAM improves or maintains the effectiveness of bias mitigation, avoids catastrophic forgetting in a multi-attribute scenario, and maintains on-par task performance, while granting parameter-efficiency and easy switching between the original and debiased models.

pdf bib abs

Modular and On-demand Bias Mitigation with Attribute-Removal Subnetworks
Lukas Hauzenberger | Shahed Masoudian | Deepak Kumar | Markus Schedl | Navid Rekabsaz
Findings of the Association for Computational Linguistics: ACL 2023

Societal biases are reflected in large pre-trained language models and their fine-tuned versions on downstream tasks. Common in-processing bias mitigation approaches, such as adversarial training and mutual information removal, introduce additional optimization criteria, and update the model to reach a new debiased state. However, in practice, end-users and practitioners might prefer to switch back to the original model, or apply debiasing only on a specific subset of protected attributes. To enable this, we propose a novel modular bias mitigation approach, consisting of stand-alone highly sparse debiasing subnetworks, where each debiasing module can be integrated into the core model on-demand at inference time. Our approach draws from the concept of diff pruning, and proposes a novel training regime adaptable to various representation disentanglement optimizations. We conduct experiments on three classification tasks with gender, race, and age as protected attributes. The results show that our modular approach, while maintaining task performance, improves (or at least remains on-par with) the effectiveness of bias mitigation in comparison with baseline finetuning. Particularly on a two-attribute dataset, our approach with separately learned debiasing subnetworks shows effective utilization of either or both the subnetworks for selective bias mitigation.

pdf bib

Exploring Intensities of Hate Speech on Social Media: A Case Study on Explaining Multilingual Models with XAI
Raisa Romanov Geleta | Klaus Eckelt | Emilia Parada-Cabaleiro | Markus Schedl
Proceedings of the 4th Conference on Language, Data and Knowledge

Markus Schedl

2026

2025

2024

2023

Co-authors

Venues