Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models

Piotr Przybyła, Matthew Shardlow, Clara Colombatto, Nanna Inie (Editors)

Anthology ID:: 2025.ommm-1
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Venues:: OMMM | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
URL:: https://aclanthology.org/2025.ommm-1/
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2025.ommm-1.pdf

Proceedings of Interdisciplinary Workshop on Observations of Misunderstood, Misguided and Malicious Use of Language Models
Piotr Przybyła | Matthew Shardlow | Clara Colombatto | Nanna Inie

pdf bib abs

Bias in, Bias out: Annotation Bias in Multilingual Large Language Models
Xia Cui | Ziyi Huang | Naeemeh Adel

Annotation bias in NLP datasets remains a major challenge for developing multilingual Large Language Models (LLMs), particularly in culturally diverse settings. Bias from task framing, annotator subjectivity, and cultural mismatches can distort model outputs and exacerbate social harms. We propose a comprehensive framework for understanding annotation bias, distinguishing among instruction bias, annotator bias, and contextual and cultural bias. We review detection methods (including inter-annotator agreement, model disagreement, and metadata analysis) and highlight emerging techniques such as multilingual model divergence and cultural inference. We further outline proactive and reactive mitigation strategies, including diverse annotator recruitment, iterative guideline refinement, and post-hoc model adjustments. Our contributions include: (1) a typology of annotation bias; (2) a synthesis of detection metrics; (3) an ensemble-based bias mitigation approach adapted for multilingual settings, and (4) an ethical analysis of annotation processes. Together, these insights aim to inform more equitable and culturally grounded annotation pipelines for LLMs.

pdf bib abs

Vision-Language Models (VLMs) achieve impressive multimodal performance but often inherit gender biases from their training data. This bias might be coming from both the vision and text modalities. In this work, we dissect the contributions of vision and text backbones to these biases by applying targeted debiasing—Counterfactual Data Augmentation (CDA) and Task Vector methods. Inspired by data-efficient approaches in hate speech classification, we introduce a novel metric, Degree of Stereotypicality (DoS), and a corresponding debiasing method, Data Augmentation Using DoS (DAUDoS), to reduce bias with minimal computational cost. We curate a gender-annotated dataset and evaluate all methods on the VisoGender benchmark to quantify improvements and identify the dominant source of bias. Our results show that CDA reduces the gender gap by 6% and DAUDoS by 3% but using only one‐third the data. Both methods also improve the model’s ability to correctly identify gender in images by 3%, with DAUDoS achieving this improvement using only almost one-third of training data. From our experiments, we observed that CLIP’s vision encoder is more biased whereas PaliGemma2’s text encoder is more biased. By identifying whether the bias stems more from the vision or text encoders, our work enables more targeted and effective bias mitigation strategies in future multi-modal systems.

pdf bib abs

AnthroSet: a Challenge Dataset for Anthropomorphic Language Detection
Dorielle Lonke | Jelke Bloem | Pia Sommerauer

This paper addresses the challenge of detecting anthropomorphic language in AI research. We introduce AnthroSet, a novel dataset of 600 manually annotated utterances covering various linguistic structures. Through the evaluation of two current approaches for anthropomorphism and atypical animacy detection, we highlight the limitations of a masked language model approach, arising from masking constraints as well as increasingly anthropomorphizing AI-related terminology. Our findings underscore the need for more targeted methods and a robust definition of anthropomorphism.

pdf bib abs

FLARE: An Error Analysis Framework for Diagnosing LLM Classification Failures
Keerthana Madhavan | Luiza Antonie | Stacey Scott

When Large Language Models return “Inconclusive” in classification tasks, practitioners are left without insight into what went wrong. This diagnostic gap can delay medical decisions, undermine content moderation, and mislead downstream systems. We present FLARE (Failure Location and Reasoning Evaluation), a framework that transforms opaque failures into seven actionable categories. Applied to 5,400 election-misinformation classifications, FLARE reveals a surprising result: Few-Shot prompting—widely considered a best practice—produced 38× more failures than Zero-Shot, with 70.8% due to simple parsing issues. By exposing hidden failure modes, FLARE addresses critical misunderstandings in LLM deployment with implications across domains.

pdf bib abs

BuST: A Siamese Transformer Model for AI Text Detection in Bulgarian
Andrii Maslo | Silvia Gargova

We introduce BuST (Bulgarian Siamese Transformer), a novel method for detecting machine-generated Bulgarian text using paraphrase-based semantic similarity. Inspired by the RAIDAR approach, BuST employs a Siamese Transformer architecture to compare input texts with their LLM-generated paraphrases, identifying subtle linguistic patterns that indicate synthetic origin. In pilot experiments, BuST achieved 88.79% accuracy and an F1-score of 88.0%, performing competitively with strong baselines. While BERT reached higher raw scores, BuST offers a model-agnostic and adaptable framework for low-resource settings, demonstrating the promise of paraphrase-driven detection strategies.

pdf bib abs

F*ck Around and Find Out: Quasi-Malicious Interactions with LLMs as a Site of Situated Learning
Sarah ONeill

This work-in-progress paper proposes a cross-disciplinary perspective on “malicious” interactions with large language models (LLMs), reframing it from only a threat to be mitigated, we ask whether certain adversarial interactions can also serve as productive learning encounters that demystify the opaque workings of AI systems to novice users. We ground this inquiry in an anecdotal observation of a student who deliberately sabotaged a machine-learning robot’s training process in order to understand its underlying logic. We outline this observation with a conceptual framework for learning with, through, and from the material quirks of LLMs grounded in Papert’s constructionism and Hasse’s ultra-social learning theory. Finally, we present the preliminary design of a research-through-workshop where non-experts will jailbreak various LLM chatbots, investigating this encounter as a situated learning process. We share this early-stage research as an invitation for feedback on reimagining inappropriate and harmful interactions with LLMs not merely as problems, but as opportunities for engagement and education.

pdf bib abs

<think> So let’s replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Sergey Pletenev | Alexander Panchenko | Daniil Moskovskiy

Modern Large Language Models (LLMs) are excellent at generating synthetic data. However, their performance in sensitive domains such as text detoxification has not received proper attention from the scientific community. This paper explores the possibility of using LLM-generated synthetic toxic data as an alternative to human-generated data for training models for detoxification. Using Llama 3 and Qwen activation-patched models, we generated synthetic toxic counterparts for neutral texts from ParaDetox and SST-2 datasets. Our experiments show that models fine-tuned on synthetic data consistently perform worse than those trained on human data, with a drop in performance of up to 30% in joint metrics. The root cause is identified as a critical lexical diversity gap: LLMs generate toxic content using a small, repetitive vocabulary of insults that fails to capture the nuances and variety of human toxicity. These findings highlight the limitations of current LLMs in this domain and emphasize the continued importance of diverse, human-annotated data for building robust detoxification systems.

pdf bib abs

Anthropomorphizing AI: A Multi-Label Analysis of Public Discourse on Social Media
Muhammad Owais Raza | Areej Fatemah Meghji

As the anthropomorphization of AI in public discourse usually reflects a complex interplay of metaphors, media framing, and societal perceptions, it is increasingly being used to shape and influence public perception on a variety of topics. To explore public perception and investigate how AI is personified, emotionalized, and interpreted in public discourse, we develop a custom multi-labeled dataset from the title and description of YouTube videos discussing artificial intelligence (AI) and large language models (LLMs). This was accomplished using a hybrid annotation pipeline that combined human-in-the-loop validation with AI assisted pre-labeling. This research introduces a novel taxonomy of narrative and epistemic dimensions commonly found in social media content on AI / LLM. Employing two modeling techniques based on traditional machine learning and transformer-based models for classification, the experimental results indicate that the fine-tuned transformer models, particularly AnthroRoBERTa and AnthroDistilBERT, generally outperform traditional machine learning approaches in anthropomorphization focused classification.

pdf bib abs

Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs
Jonathan Hvithamar Rystrøm | Hannah Rose Kirk | Scott Hale

Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs reflect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rigorous linear mixed-effects regression framework, we compare three families of models: Google’s Gemma models (2B-27B parameters), AI2’s OLMo models (7B-32B parameters), and successive iterations of OpenAI’s turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive correlation between language capability and cultural alignment across all languages, the OpenAI and OLMo models are inconsistent. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort beyond improving general language capabilities.

pdf bib abs

Learn, Achieve, Predict, Propose, Forget, Suffer: Analysing and Classifying Anthropomorphisms of LLMs
Matthew Shardlow | Ashley Williams | Charlie Roadhouse | Filippos Karolos Ventirozos | Piotr Przybyła

Anthropomorphism is a literary device where human-like characteristics are used to refer to non-human entities. However, the use of anthropomorphism in the scientific description and public communication of large language models could lead to misunderstanding amongst scientists and lay-people regarding the technical capabilities and limitations of these models. In this study, we present an analysis of anthropomorphised language commonly used to describe LLMs, showing that the presence of terms such as ‘learn’, ‘achieve’, ‘predict’ and ‘can’ are typically correlated with human labels of anthropomorphism. We also perform experiments to develop a classification system for anthropomorphic descriptions of LLMs in scientific writing at the sentence level. We find that whilst a supervised Roberta-based system identifies anthropomorphisms with F1-score of 0.564, state-of-the-art LLM-based approaches regularly overfit to the task.

pdf bib abs

Leveraging the Scala type system for secure LLM-generated code
Alexander Sternfeld | Ljiljana Dolamic | Andrei Kucharavy

Large language models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.