Are you sure? Measuring models bias in content moderation through uncertainty

Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Stranisci


Abstract
Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models (LMs and LLMs) against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the F1 score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.
Anthology ID:
2025.findings-emnlp.980
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
18061–18076
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.980/
DOI:
Bibkey:
Cite (ACL):
Alessandra Urbinati, Mirko Lai, Simona Frenda, and Marco Stranisci. 2025. Are you sure? Measuring models bias in content moderation through uncertainty. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 18061–18076, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Are you sure? Measuring models bias in content moderation through uncertainty (Urbinati et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.980.pdf
Checklist:
 2025.findings-emnlp.980.checklist.pdf