Sandra Mitrović

Also published as: Sandra Mitrovic

2025

Better Together: Towards Localizing Fact-Related Hallucinations using Open Small Language Models
David Kletz | Sandra Mitrovic | Ljiljana Dolamic | Fabio Rinaldi
Proceedings of the 1st Workshop on Confabulation, Hallucinations and Overgeneration in Multilingual and Practical Settings (CHOMPS 2025)

In this paper, we explore the potential of Open-source Small Language Models (OSLMs) for localizing hallucinations related to factual accuracy. We first present Lucifer, a dataset designed to enable proper and consistent evaluation of LMs, composed of an automatically constructed portion and a manually curated subset intended for qualitative analysis.We then assess the performance of five OSLMs using four carefully designed prompts. Results are evaluated either individually or merged through a voting-based merging approach. While our results demonstrate that the merging method yields promising performance even with smaller models, our manually curated dataset highlights the inherent difficulty of the task, underscoring the need for further research.

pdf bib abs

The proliferation of wearable devices and sports monitoring apps has made tracking physical activity more accessible than ever. For individuals with Type 1 diabetes, regular exercise is essential for managing the condition, making personalized feedback particularly valuable. By leveraging data from physical activity sessions, NLP-generated messages can offer tailored guidance to help users optimize their workouts and make informed decisions. In this study, we assess several open-source pre-trained NLP models for this purpose. Contrary to expectations, our findings reveal that models fine-tuned on medical data or excelling in medical benchmarks do not necessarily produce high-quality messages.

pdf bib abs

Swushroomsia at SemEval-2025 Task 3: Probing LLMs’ Collective Intelligence for Multilingual Hallucination Detection
Sandra Mitrović | Joseph Cornelius | David Kletz | Ljiljana Dolamic | Fabio Rinaldi
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper introduces a system designed for SemEval-2025 Task 3: Mu-SHROOM, which focuses on detecting hallucinations in multilingual outputs generated by large language models (LLMs). Our approach leverages the collective intelligence of multiple LLMs by prompting several models with three distinct prompts to annotate hallucinations. These individual annotations are then merged to create a comprehensive probabilistic annotation. The proposed system demonstrates strong performance, achieving high accuracy in span detection and strong correlation between predicted probabilities and ground truth annotations.

2024

pdf bib abs

Hospital discharge letters are a fundamental component of patient management, as they provide the crucial information needed for patient post-hospital care. However their creation is very demanding and resource intensive, as it requires consultation of several reports documenting the patient’s journey throughout their hospital stay. Given the increasing pressures on doctor’s time, tools that can draft a reasonable discharge summary, to be then reviewed and finalized by the experts, would be welcome. In this paper we present a comparative study exploring the possibility of automatic generation of discharge summaries within the context of an hospital in an Italian-speaking region and discuss quantitative and qualitative results. Despite some shortcomings, the obtained results show that a generic generative system such as ChatGPT is capable of producing discharge summaries which are relatively close to the human generated ones, even in Italian.

pdf bib abs

Comparing panic and anxiety on a dataset collected from social media
Sandra Mitrović | Oscar William Lithgow-Serrano | Carlo Schillaci
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

The recognition of mental health’s crucial significance has led to a growing interest in utilizing social media text data in current research trends. However, there remains a significant gap in the study of panic and anxiety on these platforms, despite their high prevalence and severe impact. In this paper, we address this gap by presenting a dataset consisting of 1,930 user posts from Quora and Reddit specifically focusing on panic and anxiety. Through a combination of lexical analysis, emotion detection, and writer attitude assessment, we explore the unique characteristics of each condition. To gain deeper insights, we employ a mental health-specific transformer model and a large language model for qualitative analysis. Our findings not only contribute to the understanding digital discourse on anxiety and panic but also provide valuable resources for the broader research community. We make our dataset, methodologies, and code available to advance understanding and facilitate future studies.

pdf bib

Detecting ChatGPT-Generated Text with GZIP-KNN: A No-Training, Low-Resource Approach
Matthias Berchtold | Sandra Mitrovic | Davide Andreoletti | Daniele Puccinelli | Omran Ayoub
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

pdf bib abs

BUST: Benchmark for the evaluation of detectors of LLM-Generated Text
Joseph Cornelius | Oscar Lithgow-Serrano | Sandra Mitrovic | Ljiljana Dolamic | Fabio Rinaldi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce BUST, a comprehensive benchmark designed to evaluate detectors of texts generated by instruction-tuned large language models (LLMs). Unlike previous benchmarks, our focus lies on evaluating the performance of detector systems, acknowledging the inevitable influence of the underlying tasks and different LLM generators. Our benchmark dataset consists of 25K texts from humans and 7 LLMs responding to instructions across 10 tasks from 3 diverse sources. Using the benchmark, we evaluated 5 detectors and found substantial performance variance across tasks. A meta-analysis of the dataset characteristics was conducted to guide the examination of detector performance. The dataset was analyzed using diverse metrics assessing linguistic features like fluency and coherence, readability scores, and writer attitudes, such as emotions, convincingness, and persuasiveness. Features impacting detector performance were investigated with surrogate models, revealing emotional content in texts enhanced some detectors, yet the most effective detector demonstrated consistent performance, irrespective of writer’s attitudes and text styles. Our approach focused on investigating relationships between the detectors’ performance and two key factors: text characteristics and LLM generators. We believe BUST will provide valuable insights into selecting detectors tailored to specific text styles and tasks and facilitate a more practical and in-depth investigation of detection systems for LLM-generated text.

pdf bib

pdf bib

NLP in support of Pharmacovigilance
Fabio Rinaldi | Lorenzo Ruinelli | Roberta Noseda | Oscar William Lithgow Serrano | Sandra Mitrovic
Proceedings of the 9th edition of the Swiss Text Analytics Conference

pdf bib

Presenting BUST - A benchmark for the evaluation of system detectors of LLM-Generated Text
Joseph Cornelius | Oscar William Lithgow Serrano | Sandra Mitrović | Ljiljana Dolamic | Fabio Rinaldi
Proceedings of the 9th edition of the Swiss Text Analytics Conference

pdf bib

What can we discover about panic and anxiety from bloggers in Quora and Reddit?
Sandra Mitrović | Oscar William Lithgow Serrano
Proceedings of the 9th edition of the Swiss Text Analytics Conference

2020

pdf bib abs

SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces
Vani Kanjirangat | Sandra Mitrovic | Alessandro Antonucci | Fabio Rinaldi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Lexical semantic change detection (also known as semantic shift tracing) is a task of identifying words that have changed their meaning over time. Unsupervised semantic shift tracing, focal point of SemEval2020, is particularly challenging. Given the unsupervised setup, in this work, we propose to identify clusters among different occurrences of each target word, considering these as representatives of different word meanings. As such, disagreements in obtained clusters naturally allow to quantify the level of semantic shift per each target word in four target languages. To leverage this idea, clustering is performed on contextualized (BERT-based) embeddings of word occurrences. The obtained results show that our approach performs well both measured separately (per language) and overall, where we surpass all provided SemEval baselines.