Özge Alacam - ACL Anthology

Özge Alacam

Also published as: Ozge Alacam, Özge Alaçam

2025

Not Just Who or What: Modeling the Interaction of Linguistic and Annotator Variation in Hateful Word Interpretation
Sanne Hoeken | Özge Alacam | Dong Nguyen | Massimo Poesio | Sina Zarrieß
Proceedings of the 16th International Conference on Computational Semantics

Interpreting whether a word is hateful in context is inherently subjective. While growing research in NLP recognizes the importance of annotation variation and moves beyond treating it as noise, most work focuses primarily on annotator-related factors, often overlooking the role of linguistic context and its interaction with individual interpretation.In this paper, we investigate the factors driving variation in hateful word meaning interpretation by extending the HateWiC dataset with linguistic and annotator-level features. Our empirical analysis shows that variation in annotations is not solely a function of who is interpreting or what is being interpreted, but of the interaction between the two. We evaluate how well models replicate the patterns of human variation. We find that incorporating annotator information can improve alignment with human disagreement but still underestimates it. Our findings further demonstrate that capturing interpretation variation requires modeling the interplay between annotators and linguistic content and that neither surface-level agreement nor predictive accuracy alone is sufficient for truly reflecting human variation.

A Federated Approach to Few-Shot Hate Speech Detection for Marginalized Communities
Haotian Ye | Axel Wisiorek | Antonis Maronikolakis | Özge Alaçam | Hinrich Schütze
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)

Despite substantial efforts, detecting and preventing hate speech online remains an understudied task for marginalized communities, particularly in the Global South, which includes developing societies with increasing internet penetration. In this paper, we aim to provide marginalized communities in societies where the dominant language is low-resource with a privacy-preserving tool to protect themselves from online hate speech by filtering offensive content in their native languages. Our contributions are twofold: 1) we release REACT (REsponsive hate speech datasets Across ConTexts), a collection of high-quality, culturespecific hate speech detection datasets comprising multiple target groups and low-resource languages, curated by experienced data collectors; 2) we propose a few-shot hate speech detection approach based on federated learning (FL), a privacy-preserving method for collaboratively training a central model that exhibits robustness when tackling different target groups and languages. By keeping training local to user devices, we ensure data privacy while leveraging the collective learning benefits of FL. We experiment with both multilingual and monolingual pre-trained representation spaces as backbones to examine the interaction between FL and different model representations. Furthermore, we explore personalized client models tailored to specific target groups and evaluate their performance. Our findings indicate the overall effectiveness of FL across different target groups, and point to personalization as a promising direction.

Components of Creativity: Language Model-based Predictors for Clustering and Switching in Verbal Fluency
Sina Zarrieß | Simeon Junker | Judith Sieker | Özge Alacam
Proceedings of the 29th Conference on Computational Natural Language Learning

Verbal fluency is an experimental paradigm used to examine human knowledge retrieval, cognitive performance and creative abilities. This work investigates the psychometric capacities of LMs in this task. We focus on switching and clustering patterns and seek evidence to substantiate them as two distinct and separable components of lexical retrieval processes in LMs.We prompt different transformer-based LMs with verbal fluency items and ask whether metrics derived from the language models’ prediction probabilities or internal attention distributions offer reliable predictors of switching/clustering behaviors in verbal fluency. We find that token probabilities, but especially attention-based metrics have strong statistical power when separating between cases of switching and clustering, in line with prior research on human cognition.

Using LLMs and Preference Optimization for Agreement-Aware HateWiC Classification
Sebastian Loftus | Adrian Mülthaler | Sanne Hoeken | Sina Zarrieß | Ozge Alacam
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

Annotator disagreement poses a significant challenge in subjective tasks like hate speech detection. In this paper, we introduce a novel variant of the HateWiC task that explicitly models annotator agreement by estimating the proportion of annotators who classify the meaning of a term as hateful. To tackle this challenge, we explore the use of Llama 3 models fine-tuned through Direct Preference Optimization (DPO). Our experiments show that while LLMs perform well for majority-based hate classification, they struggle with the more complex agreement-aware task. DPO fine-tuning offers improvements, particularly when applied to instruction-tuned models. Yet, our results emphasize the need for improved modeling of subjectivity in hate classification and this study can serve as foundation for future advancements.

Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
Raha Askari | Sina Zarrieß | Özge Alacam | Judith Sieker
Proceedings of the First BabyLM Workshop

Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences.Building on Surian et al. (1996)’s study of children’s sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on <10M and <100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens.We find that overall, models trained on <100M tokens outperform those trained on <10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.

Disentangling Subjectivity and Uncertainty for Hate Speech Annotation and Modeling using Gaze
Özge Alacam | Sanne Hoeken | Andreas Säuberli | Hannes Gröner | Diego Frassinelli | Sina Zarrieß | Barbara Plank
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Variation is inherent in opinion-based annotation tasks like sentiment or hate speech analysis. It does not only arise from errors, fatigue, or sentence ambiguity but also from genuine differences in opinion shaped by background, experience, and culture. In this paper, first, we show how annotators’ confidence ratings can be great use for disentangling subjective variation from uncertainty, without relying on specific features present in the data (text, gaze, etc.). Our goal is to establish distinctive dimensions of variation which are often not clearly separated in existing work on modeling annotator variation. We illustrate our approach through a hate speech detection task, demonstrating that models are affected differently by instances of uncertainty and subjectivity. In addition, we show that human gaze patterns offer valuable indicators of subjective evaluation and uncertainty. Disclaimer: This paper contains sentences that may be offensive.

2024

WikiScenes with Descriptions: Aligning Paragraphs and Sentences with Images in Wikipedia Articles
Özge Alaçam | Ronja Utescher | Hannes Grönner | Judith Sieker | Sina Zarrieß
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Research in Language & Vision rarely uses naturally occurring multimodal documents as Wikipedia articles, since they feature complex image-text relations and implicit image-text alignments. In this paper, we provide one of the first datasets that provides ground-truth annotations of image-text alignments in multi-paragraph multi-image articles. The dataset can be used to study phenomena of visual language grounding in longer documents and assess retrieval capabilities of language models trained on, e.g., captioning data. Our analyses show that there are systematic linguistic differences between the image captions and descriptive sentences from the article’s text and that intra-document retrieval is a challenging task for state-of-the-art models in L&V (CLIP, VILT, MCSE).

Hateful Word in Context Classification
Sanne Hoeken | Sina Zarrieß | Özge Alacam
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Hate speech detection is a prevalent research field, yet it remains underexplored at the level of word meaning. This is significant, as terms used to convey hate often involve non-standard or novel usages which might be overlooked by commonly leveraged LMs trained on general language use. In this paper, we introduce the Hateful Word in Context Classification (HateWiC) task and present a dataset of ~4000 WiC-instances, each labeled by three annotators. Our analyses and computational exploration focus on the interplay between the subjective nature (context-dependent connotations) and the descriptive nature (as described in dictionary definitions) of hateful word senses. HateWiC annotations confirm that hatefulness of a word in context does not always derive from the sense definition alone. We explore the prediction of both majority and individual annotator labels, and we experiment with modeling context- and sense-based inputs. Our findings indicate that including definitions proves effective overall, yet not in cases where hateful connotations vary. Conversely, including annotator demographics becomes more important for mitigating performance drop in subjective hate prediction.

Eyes Don’t Lie: Subjective Hate Annotation and Detection with Gaze
Özge Alacam | Sanne Hoeken | Sina Zarrieß
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Hate speech is a complex and subjective phenomenon. In this paper, we present a dataset (GAZE4HATE) that provides gaze data collected in a hate speech annotation experiment. We study whether the gaze of an annotator provides predictors of their subjective hatefulness rating, and how gaze features can improve Hate Speech Detection (HSD). We conduct experiments on statistical modeling of subjective hate ratings and gaze and analyze to what extent rationales derived from hate speech models correspond to human gaze and explanations in our data. Finally, we introduce MEANION, a first gaze-integrated HSD model. Our experiments show that particular gaze features like dwell time or fixation counts systematically correlate with annotators’ subjective hate ratings and improve predictions of text-only hate speech models.

Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models?
Piush Aggarwal | Jawar Mehrabanian | Weigang Huang | Özge Alacam | Torsten Zesch
Findings of the Association for Computational Linguistics: EACL 2024

This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection, presenting compelling findings. We provide evidence supporting the hypothesis that only the textual component of hateful memes enables the multimodal classifier to generalize across different domains, while the image component proves highly sensitive to a specific training dataset. The evidence includes demonstrations showing that hate-text classifiers perform similarly to hate-meme classifiers in a zero-shot setting. Simultaneously, the introduction of captions generated from images of memes to the hate-meme classifier worsens performance by an average F1 of 0.02. Through blackbox explanations, we identify a substantial contribution of the text modality (average of 83%), which diminishes with the introduction of meme’s image captions (52%). Additionally, our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with average ∆F1 of 0.18.

2023

Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing
Piush Aggarwal | Özge Alaçam | Carina Silberer | Sina Zarrieß | Torsten Zesch
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

Methodological Insights in Detecting Subtle Semantic Shifts with Contextualized and Static Language Models
Sanne Hoeken | Özge Alacam | Antske Fokkens | Pia Sommerauer
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we investigate automatic detection of subtle semantic shifts between social communities of different political convictions in Dutch and English. We perform a methodological study comparing methods using static and contextualized language models. We investigate the impact of specializing contextualized models through fine-tuning on target corpora, word sense disambiguation and sentiment. We furthermore propose a new approach using masked token prediction, that relies on behavioral information, specifically the most probable substitutions, instead of geometrical comparison of representations. Our results show that methods using static models and our masked token prediction method can detect differences in connotation of politically loaded terms, whereas methods that rely on measuring the distance between contextualized representations are not providing clear signals, even in synthetic scenarios of extreme shifts.

Identifying Slurs and Lexical Hate Speech via Light-Weight Dimension Projection in Embedding Space
Sanne Hoeken | Sina Zarrieß | Ozge Alacam
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

The prevalence of hate speech on online platforms has become a pressing concern for society, leading to increased attention towards detecting hate speech. Prior work in this area has primarily focused on identifying hate speech at the utterance level that reflects the complex nature of hate speech. In this paper, we propose a targeted and efficient approach to identifying hate speech by detecting slurs at the lexical level using contextualized word embeddings. We hypothesize that slurs have a systematically different representation than their neutral counterparts, making them identifiable through existing methods for discovering semantic dimensions in word embeddings. The results demonstrate the effectiveness of our approach in predicting slurs, confirming linguistic theory that the meaning of slurs is stable across contexts. Our robust hate dimension approach for slur identification offers a promising solution to tackle a smaller yet crucial piece of the complex puzzle of hate speech detection.

Towards Detecting Lexical Change of Hate Speech in Historical Data
Sanne Hoeken | Sophie Spliethoff | Silke Schwandt | Sina Zarrieß | Özge Alacam
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

The investigation of lexical change has predominantly focused on generic language evolution, not suited for detecting shifts in a particular domain, such as hate speech. Our study introduces the task of identifying changes in lexical semantics related to hate speech within historical texts. We present an interdisciplinary approach that brings together NLP and History, yielding a pilot dataset comprising 16th-century Early Modern English religious writings during the Protestant Reformation. We provide annotations for both semantic shifts and hatefulness on this data and, thereby, combine the tasks of Lexical Semantic Change Detection and Hate Speech Detection. Our framework and resulting dataset facilitate the evaluation of our applied methods, advancing the analysis of hate speech evolution.

2022

Exploring Semantic Spaces for Detecting Clustering and Switching in Verbal Fluency
Özge Alacam | Simeon Schüz | Martin Wegrzyn | Johanna Kißler | Sina Zarrieß
Proceedings of the 29th International Conference on Computational Linguistics

In this work, we explore the fitness of various word/concept representations in analyzing an experimental verbal fluency dataset providing human responses to 10 different category enumeration tasks. Based on human annotations of so-called clusters and switches between sub-categories in the verbal fluency sequences, we analyze whether lexical semantic knowledge represented in word embedding spaces (GloVe, fastText, ConceptNet, BERT) is suitable for detecting these conceptual clusters and switches within and across different categories. Our results indicate that ConceptNet embeddings, a distributional semantics method enriched with taxonomical relations, outperforms other semantic representations by a large margin. Moreover, category-specific analysis suggests that individual thresholds per category are more suited for the analysis of clustering and switching in particular embedding sub-space instead of a one-fits-all cross-category solution. The results point to interesting directions for future work on probing word embedding models on the verbal fluency task.

Modeling Referential Gaze in Task-oriented Settings of Varying Referential Complexity
Özge Alaçam | Eugen Ruppert | Ganeshan Malhotra | Chris Biemann | Sina Zarrieß
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Referential gaze is a fundamental phenomenon for psycholinguistics and human-human communication. However, modeling referential gaze for real-world scenarios, e.g. for task-oriented communication, is lacking the well-deserved attention from the NLP community. In this paper, we address this challenging issue by proposing a novel multimodal NLP task; namely predicting when the gaze is referential. We further investigate how to model referential gaze and transfer gaze features to adapt to unseen situated settings that target different referential complexities than the training environment. We train (i) a sequential attention-based LSTM model and (ii) a multivariate transformer encoder architecture to predict whether the gaze is on a referent object. The models are evaluated on the three complexity datasets. The results indicate that the gaze features can be transferred not only among various similar tasks and scenes but also across various complexity levels. Taking the referential complexity of a scene into account is important for successful target prediction using gaze parameters especially when there is not much data for fine-tuning.

The Why and The How: A Survey on Natural Language Interaction in Visualization
Henrik Voigt | Ozge Alacam | Monique Meuschke | Kai Lawonn | Sina Zarrieß
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Natural language as a modality of interaction is becoming increasingly popular in the field of visualization. In addition to the popular query interfaces, other language-based interactions such as annotations, recommendations, explanations, or documentation experience growing interest. In this survey, we provide an overview of natural language-based interaction in the research area of visualization. We discuss a renowned taxonomy of visualization tasks and classify 119 related works to illustrate the state-of-the-art of how current natural language interfaces support their performance. We examine applied NLP methods and discuss human-machine dialogue structures with a focus on initiative, duration, and communicative functions in recent visualization-oriented dialogue interfaces. Based on this overview, we point out interesting areas for the future application of NLP methods in the field of visualization.

MOTIF: Contextualized Images for Complex Words to Improve Human Reading
Xintong Wang | Florian Schneider | Özge Alacam | Prateek Chaudhury | Chris Biemann
Proceedings of the Thirteenth Language Resources and Evaluation Conference

MOTIF (MultimOdal ConTextualized Images For Language Learners) is a multimodal dataset that consists of 1125 comprehension texts retrieved from Wikipedia Simple Corpus. Allowing multimodal processing or enriching the context with multimodal information has proven imperative for many learning tasks, specifically for second language (L2) learning. In this respect, several traditional NLP approaches can assist L2 readers in text comprehension processes, such as simplifying text or giving dictionary descriptions for complex words. As nicely stated in the well-known proverb, sometimes “a picture is worth a thousand words” and an image can successfully complement the verbal message by enriching the representation, like in Pictionary books. This multimodal support can also assist on-the-fly text reading experience by providing a multimodal tool that chooses and displays the most relevant images for the difficult words, given the text context. This study mainly focuses on one of the key components to achieving this goal; collecting a multimodal dataset enriched with complex word annotation and validated image match.

2021

Situation-Specific Multimodal Feature Adaptation
Özge Alacam
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

In the next decade, we will see a considerable need for NLP models for situated settings where diversity of situations and also different modalities including eye-movements should be taken into account in order to grasp the intention of the user. However, language comprehension in situated settings can not be handled in isolation, where different multimodal cues are inherently present and essential parts of the situations. In this research proposal, we aim to quantify the influence of each modality in interaction with various referential complexities. We propose to encode the referential complexity of the situated settings in the embeddings during pre-training to implicitly guide the model to the most plausible situation-specific deviations. We summarize the challenges of intention extraction and propose a methodological approach to investigate a situation-specific feature adaptation to improve crossmodal mapping and meaning recovery from noisy communication settings.

Towards Multi-Modal Text-Image Retrieval to improve Human Reading
Florian Schneider | Özge Alaçam | Xintong Wang | Chris Biemann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

In primary school, children’s books, as well as in modern language learning apps, multi-modal learning strategies like illustrations of terms and phrases are used to support reading comprehension. Also, several studies in educational psychology suggest that integrating cross-modal information will improve reading comprehension. We claim that state-of- he-art multi-modal transformers, which could be used in a language learner context to improve human reading, will perform poorly because of the short and relatively simple textual data those models are trained with. To prove our hypotheses, we collected a new multi-modal image-retrieval dataset based on data from Wikipedia. In an in-depth data analysis, we highlight the differences between our dataset and other popular datasets. Additionally, we evaluate several state-of-the-art multi-modal transformers on text-image retrieval on our dataset and analyze their meager results, which verify our claims.

2020

Eye4Ref: A Multimodal Eye Movement Dataset of Referentially Complex Situations
Özge Alacam | Eugen Ruppert | Amr Rekaby Salama | Tobias Staron | Wolfgang Menzel
Proceedings of the Twelfth Language Resources and Evaluation Conference

Eye4Ref is a rich multimodal dataset of eye-movement recordings collected from referentially complex situated settings where the linguistic utterances and their visual referential world were available to the listener. It consists of not only fixation parameters but also saccadic movement parameters that are time-locked to accompanying German utterances (with English translations). Additionally, it also contains symbolic knowledge (contextual) representations of the images to map the referring expressions onto the objects in corresponding images. Overall, the data was collected from 62 participants in three different experimental setups (86 systematically controlled sentence–image pairs and 1844 eye-movement recordings). Referential complexity was controlled by visual manipulations (e.g. number of objects, visibility of the target items, etc.), and by linguistic manipulations (e.g., the position of the disambiguating word in a sentence). This multimodal dataset, in which the three different sources of information namely eye-tracking, language, and visual environment are aligned, offers a test of various research questions not from only language perspective but also computer vision.

2019

Enhancing Natural Language Understanding through Cross-Modal Interaction: Meaning Recovery from Acoustically Noisy Speech
Ozge Alacam
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Cross-modality between vision and language is a key component for effective and efficient communication, and human language processing mechanism successfully integrates information from various modalities to extract the intended meaning. However, incomplete linguistic input, i.e. due to a noisy environment, is one of the challenges for a successful communication. In that case, an incompleteness in one channel can be compensated by information from another one. In this paper, by conducting visual-world paradigm, we investigated the dynamics between syntactically possible gap fillers and the visual arrangements in incomplete German sentences and their effect on overall sentence interpretation.

2018

Text Completion using Context-Integrated Dependency Parsing
Amr Rekaby Salama | Özge Alaçam | Wolfgang Menzel
Proceedings of the Third Workshop on Representation Learning for NLP

Incomplete linguistic input, i.e. due to a noisy environment, is one of the challenges that a successful communication system has to deal with. In this paper, we study text completion with a data set composed of sentences with gaps where a successful completion cannot be achieved through a uni-modal (language-based) approach. We present a solution based on a context-integrating dependency parser incorporating an additional non-linguistic modality. An incompleteness in one channel is compensated by information from another one and the parser learns the association between the two modalities from a multiple level knowledge representation. We examined several model variations by adjusting the degree of influence of different modalities in the decision making on possible filler words and their exact reference to a non-linguistic context element. Our model is able to fill the gap with 95.4% word and 95.2% exact reference accuracy hence the successful prediction can be achieved not only on the word level (such as mug) but also with respect to the correct identification of its context reference (such as mug 2 among several mug instances).

Incorporating Contextual Information for Language-Independent, Dynamic Disambiguation Tasks
Tobias Staron | Özge Alaçam | Wolfgang Menzel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Venues