Vera Schmitt

2026

Persona Prompting as a Lens on LLM Social Reasoning
Jing Yang | Moritz Hechtbauer | Elisabeth Khalilov | Evelyn Luise Brinkmann | Vera Schmitt | Nils Feldhus
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.

pdf bib abs

News Credibility Assessment by LLMs and Humans: Implications for Political Bias
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)

In an era of rapid misinformation spread, LLMs have emerged as tools for assessing news credibility at scale. However, the assessments are influenced by social and cultural biases. Studies investigating political bias, compare model credibility ratings with expert credibility ratings. Comparing LLMs to the perceptions of political camps extends this approach to detecting similarities in their biases.We compare LLM-generated credibility and bias ratings of news outlets with expert assessments and stratified political opinions collected through surveys. We analyse three models (Llama 3.3 70B, Mixtral 8x7B, and GPT-OSS 120B) across 47 news outlets from two countries (U.S. and Germany).We found that models demonstrated consistently high alignment with expert ratings, while showing weaker and more variable alignment with public opinions. For US-American news outlets all models showed stronger alignment with center-left perceptions, while for German news outlets the alignment is more diverse.

pdf bib abs

Multimodal fact checking has become increasingly important due to the predominance of visual content on social media platforms, where images are frequently used to enhance the credibility and spread of misleading claims, while generated images become more prevalent and realistic as generative models advance. Incorporating visual information, however, substantially increases computational costs, raising critical efficiency concerns for practical deployment. In this study, we propose and evaluate the ADA-AGGR (ensemble retrievAl for multimoDAl evidence AGGRegation) pipeline, which achieved the second place on both the dev and test leaderboards of the FEVER 9/AVerImaTeC shared task. However, long runtimes per claim highlight challenges regarding efficiency concerns when designing multimodal claim verification pipelines. We therefore run extensive ablation studies and configuration analyses to identify possible performance–runtime improvements. Our experiments show that substantial efficiency gains are possible without significant loss in verification quality. For instance, we reduced the average runtime by up to 6.28× while maintaining comparable performance across evaluation metrics by aggressively downsampling input images processed by visual language models. Overall, our results highlight that careful design choices are crucial for building scalable and resource-efficient multimodal fact-checking systems suitable for real-world deployment.

pdf bib abs

This paper presents an efficiency-aware pipeline for automated fact-checking of real-world image–text claims that treats multimodality as a controllable design variable rather than a property that must be uniformly propagated through every stage of the system. The approach decomposes claims into verification questions, assigns each to text- or image-related types, and applies modality-aware retrieval strategies, while ultimately relying on text-only evidence for verdict prediction and justification generation. Evaluated on the AVerImaTeC dataset within the FEVER-9 shared task, the system achieves competitive question, evidence, verdict, and justification scores and ranks fourth overall, outperforming the official baseline on evidence recall, verdict accuracy, and justification quality despite not using visual evidence during retrieval. These results demonstrate that strong performance on multimodal fact-checking can be achieved by selectively controlling where visual information influences retrieval and reasoning, rather than performing full multimodal fusion at every stage of the pipeline.

2025

pdf bib abs

Cross-Refine: Improving Natural Language Explanation Generation by Learning in Tandem
Qianli Wang | Tatiana Anikina | Nils Feldhus | Simon Ostermann | Sebastian Möller | Vera Schmitt
Proceedings of the 31st International Conference on Computational Linguistics

Natural language explanations (NLEs) are vital for elucidating the reasoning behind large language model (LLM) decisions. Many techniques have been developed to generate NLEs using LLMs. However, like humans, LLMs might not always produce optimal NLEs on first attempt. Inspired by human learning processes, we introduce Cross-Refine, which employs role modeling by deploying two LLMs as generator and critic, respectively. The generator outputs a first NLE and then refines this initial explanation using feedback and suggestions provided by the critic. Cross-Refine does not require any supervised training data or additional training. We validate Cross-Refine across three NLP tasks using three state-of-the-art open-source LLMs through automatic and human evaluation. We select Self-Refine (Madaan et al., 2023) as the baseline, which only utilizes self-feedback to refine the explanations. Our findings from automatic evaluation and a user study indicate that Cross-Refine outperforms Self-Refine. Meanwhile, Cross-Refine can perform effectively with less powerful LLMs, whereas Self-Refine only yields strong results with ChatGPT. Additionally, we conduct an ablation study to assess the importance of feedback and suggestions. Both of them play an important role in refining explanations. We further evaluate Cross-Refine on a bilingual dataset in English and German.

pdf bib abs

FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation
Qianli Wang | Nils Feldhus | Simon Ostermann | Luis Felipe Villa-Arenas | Sebastian Möller | Vera Schmitt
Findings of the Association for Computational Linguistics: ACL 2025

Counterfactual examples are widely used in natural language processing (NLP) as valuable data to improve models, and in explainable artificial intelligence (XAI) to understand model behavior. The automated generation of counterfactual examples remains a challenging task even for large language models (LLMs), despite their impressive performance on many tasks. In this paper, we first introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples in a zero-shot setting. Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations for few-shot prompting, outperforming three state-of-the-art baselines. Through ablation studies, we identify the importance of each of FitCF’s core components in improving the quality of counterfactuals, as assessed through flip rate, perplexity, and similarity measures. Furthermore, we show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance. Finally, we reveal a strong correlation between the faithfulness of feature attribution scores and the quality of generated counterfactuals, which we hope will serve as an importantfinding for future research in this direction.

pdf bib abs

Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Qianli Wang | Van Bach Nguyen | Nils Feldhus | Luis Felipe Villa-Arenas | Christin Seifert | Sebastian Möller | Vera Schmitt
Proceedings of the 18th International Natural Language Generation Conference

Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models: being the same model, belonging to the same model family, being independent models, and having an distillation relationship. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, four generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.

pdf bib abs

PolBiX: Detecting LLMs’ Political Bias in Fact-Checking through X-phemisms
Charlott Jakob | David Harbecke | Patrick Parschan | Pia Wenzel Neves | Vera Schmitt
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in German claims. We construct minimal pairs of factually equivalent claims that differ in political connotation, to assess the consistency of LLMs in classifying them as true or false. We evaluate six LLMs and find that, more than political leaning, the presence of judgmental words significantly influences truthfulness assessment. While a few models show tendencies of political bias, this is not mitigated by explicitly calling for objectivism in prompts. Warning: This paper contains content that may be offensive or upsetting.

pdf bib abs

Improving Sentiment Analysis for Ukrainian Social Media Code-Switching Data
Yurii Shynkarov | Veronika Solopova | Vera Schmitt
Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP 2025)

This paper addresses the challenges of sentiment analysis in Ukrainian social media, where users frequently engage in code-switching with Russian and other languages. We introduce COSMUS (COde-Switched MUltilingual Sentiment for Ukrainian Social media), a 12,224-texts corpus collected from Telegram channels, product‐review sites and open datasets, annotated into positive, negative, neutral and mixed sentiment classes as well as language labels (Ukrainian, Russian, code-switched). We benchmark three modeling paradigms: (i) few‐shot prompting of GPT‐4o and DeepSeek V2-chat, (ii) multilingual mBERT, and (iii) the Ukrainian‐centric UkrRoberta. We also analyze calibration and LIME scores of the latter two solutions to verify its performance on various language labels. To mitigate data sparsity we test two augmentation strategies: back‐translation consistently hurts performance, whereas a Large Language Model (LLM) word‐substitution scheme yields up to +2.2% accuracy. Our work delivers the first publicly available dataset and comprehensive benchmark for sentiment classification in Ukrainian code‐switching media. Results demonstrate that language‐specific pre‐training combined with targeted augmentation yields the most accurate and trustworthy predictions in this challenging low‐resource setting.

pdf bib abs

Given the limited computational and financial resources of news agencies, real-life usage of fact-checking systems requires fast response times. For this reason, our submission to the FEVER-8 claim verification shared task focuses on optimizing the efficiency of such pipelines built around subtasks such as evidence retrieval and veracity prediction. We propose the Semantic Filtering for Efficient Fact Checking (SFEFC) strategy, which is inspired by the FEVER-8 baseline and designed with the goal of reducing the number of LLM calls and other computationally expensive subroutines. Furthermore, we explore the reuse of cosine similarities initially calculated within a dense retrieval step to retrieve the top 10 most relevant evidence sentence sets. We use these sets for semantic filtering methods based on similarity scores and create filters for particularly hard classification labels “Not Enough Information” and “Conflicting Evidence/Cherrypicking” by identifying thresholds for potentially relevant information and the semantic variance within these sets. Compared to the parallelized FEVER-8 baseline, which takes 33.88 seconds on average to process a claim according to the FEVER-8 shared task leaderboard, our non-parallelized system remains competitive in regard to AVeriTeC retrieval scores while reducing the runtime to 7.01 seconds, achieving the fastest average runtime per claim.

pdf bib abs

Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user’s desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users’ underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.

pdf bib abs

The overwhelming volume of content being published at any given moment poses a significant challenge for the design of automated fact-checking (AFC) systems on social media, requiring an emphasized consideration of efficiency aspects.As in other fields, systems built upon LLMs have achieved good results on different AFC benchmarks. However, the application of LLMs is accompanied by high resource requirements. The energy consumption of LLMs poses a significant challenge from an ecological perspective, while remaining a bottleneck in latency-sensitive scenarios like AFC within social media. Therefore, we propose a system built upon fine-tuned smaller BERT-based models. When evaluated on the ClimateCheck dataset against decoder-only LLMs, our best fine-tuned model outperforms Phi 4 and approaches Qwen3 14B in reasoning mode — while significantly reducing runtime per claim. Our findings demonstrate that small encoder-only models fine-tuned for specific tasks can still provide a substantive alternative to large decoder-only LLMs, especially in efficiency-concerned settings.

pdf bib abs

Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence
Ariana Sahitaj | Premtim Sahitaj | Veronika Solopova | Jiaao Li | Sebastian Möller | Vera Schmitt
Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Propaganda detection on social media remains challenging due to task complexity and limited high-quality labeled data. This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability. We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques (CITATION) into three broader categories, conduct a human annotation study on the HQP dataset (CITATION) that reveals low inter-annotator agreement for fine-grained labels, and implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label. A secondary human verification study shows significant improvements in both agreement and time-efficiency. Building on this, we fine-tune smaller language models (SLMs) to perform structured annotation. Instead of fine-tuning on human annotations, we train on high-quality LLM-generated data, allowing a large model to produce these annotations and a smaller model to learn to generate them via knowledge distillation. Our work contributes towards the development of scalable and robust propaganda detection systems, supporting the idea of transparent and accountable media ecosystems in line with SDG 16. The code is publicly available at our GitHub repository.

2024

pdf bib abs

Augmented Political Leaning Detection: Leveraging Parliamentary Speeches for Classifying News Articles
Charlott Jakob | Pia Wenzel | Salar Mohtaj | Vera Schmitt
Proceedings of the 4th Workshop on Computational Linguistics for the Political and Social Sciences: Long and short papers

In an era where political discourse infiltrates online platforms and news media, identifying opinion is increasingly critical, especially in news articles, where objectivity is expected. Readers frequently encounter authors’ inherent political viewpoints, challenging them to discern facts from opinions. Classifying text on a spectrum from left to right is a key task for uncovering these viewpoints. Previous approaches rely on outdated datasets to classify current articles, neglecting that political opinions on certain subjects change over time. This paper explores a novel methodology for detecting political leaning in news articles by augmenting them with political speeches specific to the topic and publication time. We evaluated the impact of the augmentation using BERT and Mistral models. The results show that the BERT model’s F1 score improved from a baseline of 0.82 to 0.85, while the Mistral model’s F1 score increased from 0.30 to 0.31.

pdf bib abs

Implications of Regulations on Large Generative AI Models in the Super-Election Year and the Impact on Disinformation
Vera Schmitt | Jakob Tesch | Eva Lopez | Tim Polzehl | Aljoscha Burchardt | Konstanze Neumann | Salar Mohtaj | Sebastian Möller
Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024

With the rise of Large Generative AI Models (LGAIMs), disinformation online has become more concerning than ever before. Within the super-election year 2024, the influence of mis- and disinformation can severely influence public opinion. To combat the increasing amount of disinformation online, humans need to be supported by AI-based tools to increase the effectiveness of detecting false content. This paper examines the critical intersection of the AI Act with the deployment of LGAIMs for disinformation detection and the implications from research, deployer, and the user’s perspective. The utilization of LGAIMs for disinformation detection falls under the high-risk category defined in the AI Act, leading to several obligations that need to be followed after the enforcement of the AI Act. Among others, the obligations include risk management, transparency, and human oversight which pose the challenge of finding adequate technical interpretations. Furthermore, the paper articulates the necessity for clear guidelines and standards that enable the effective, ethical, and legally compliant use of AI. The paper contributes to the discourse on balancing technological advancement with ethical and legal imperatives, advocating for a collaborative approach to utilizing LGAIMs in safeguarding information integrity and fostering trust in digital ecosystems.

2023

pdf bib

Multi-task Learning for German Text Readability Assessment
Salar Mohtaj | Vera Schmitt | Razieh Khamsehashari | Sebastian Möller
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

Vera Schmitt

2026

2025

2024

2023

Co-authors

Venues