Miriam Schirmer

2025

Detecting Child Objectification on Social Media: Challenges in Language Modeling
Miriam Schirmer | Angelina Voggenreiter | Juergen Pfeffer | Agnes Horvat
Proceedings of the The 9th Workshop on Online Abuse and Harms (WOAH)

Online objectification of children can harm their self-image and influence how others perceive them. Objectifying comments may start with a focus on appearance but also include language that treats children as passive, decorative, or lacking agency. On TikTok, algorithm-driven visibility amplifies this focus on looks. Drawing on objectification theory, we introduce a Child Objectification Language Typology to automatically classify objectifying comments. Our dataset consists of 562,508 comments from 9,090 videos across 482 TikTok accounts. We compare language models of different complexity, including an n-gram-based model, RoBERTa, GPT-4, LlaMA, and Mistral. On our training dataset of 6,000 manually labeled comments, we found that RoBERTa performed best overall in detecting appearance- and objectification-related language. 10.35% of comments contained appearance-related language, while 2.90% included objectifying language. Videos with school-aged girls received more appearance-related comments compared to boys in that age group, while videos with toddlers show a slight increase in objectification-related comments compared to other age groups. Neither gender alone nor engagement metrics showed significant effects.The findings raise concerns about children’s digital exposure, emphasizing the need for stricter policies to protect minors.

Email is a vital conduit for human communication across businesses, organizations, and broader societal contexts. In this study, we aim to model the intents, expectations, and responsiveness in email exchanges. To this end, we release SIZZLER, a new dataset containing 1800 emails annotated with nuanced types of intents and expectations. We benchmark models ranging from feature-based logistic regression to zero-shot prompting of large language models. Leveraging the predictive model for intent, expectations, and 14 other features, we analyze 11.3M emails from GMANE to study how linguistic and social factors influence the conversational dynamics in email exchanges. Through our causal analysis, we find that the email response rates are influenced by social status, argumentation, and in certain limited contexts, the strength of social connection.

2024

pdf bib abs

GENTRAC: A Tool for Tracing Trauma in Genocide and Mass Atrocity Court Transcripts
Miriam Schirmer | Christian Brechenmacher | Endrit Jashari | Juergen Pfeffer
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces GENTRAC, an open-access web-based tool built to interactively detect and analyze potentially traumatic content in witness statements of genocide and mass atrocity trials. Harnessing recent developments in natural language processing (NLP) to detect trauma, GENTRAC processes and formats court transcripts for NLP analysis through a sophisticated parsing algorithm and detects the likelihood of traumatic content for each speaker segment. The tool visualizes the density of such content throughout a trial day and provides statistics on the overall amount of traumatic content and speaker distribution. Capable of processing transcripts from four prominent international criminal courts, including the International Criminal Court (ICC), GENTRAC’s reach is vast, tailored to handle millions of pages of documents from past and future trials. Detecting potentially re-traumatizing examination methods can enhance the development of trauma-informed legal procedures. GENTRAC also serves as a reliable resource for legal, human rights, and other professionals, aiding their comprehension of mass atrocities’ emotional toll on survivors.

pdf bib abs

The Language of Trauma: Modeling Traumatic Event Descriptions Across Domains with Explainable AI
Miriam Schirmer | Tobias Leemann | Gjergji Kasneci | Jürgen Pfeffer | David Jurgens
Findings of the Association for Computational Linguistics: EMNLP 2024

Psychological trauma can manifest following various distressing events and is captured in diverse online contexts. However, studies traditionally focus on a single aspect of trauma, often neglecting the transferability of findings across different scenarios. We address this gap by training various language models with progressing complexity on trauma-related datasets, including genocide-related court data, a Reddit dataset on post-traumatic stress disorder (PTSD), counseling conversations, and Incel forum posts. Our results show that the fine-tuned RoBERTa model excels in predicting traumatic events across domains, slightly outperforming large language models like GPT-4. Additionally, SLALOM-feature scores and conceptual explanations effectively differentiate and cluster trauma-related language, highlighting different trauma aspects and identifying sexual abuse and experiences related to death as a common traumatic event across all datasets. This transferability is crucial as it allows for the development of tools to enhance trauma detection and intervention in diverse populations and settings.

2022

pdf bib abs

A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts
Miriam Schirmer | Udo Kruschwitz | Gregor Donabauer
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one’s research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year’s hot topic on Language Technology for All.

Venues

WS1

Fix author

Miriam Schirmer

2025

2024

2022

Co-authors

Venues