LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback

Yaqi Zhang, Viktor Hangya, Alexander Fraser


Abstract
The capacity of large language models (LLMs) to understand and distinguish socially unacceptable texts enables them to play a promising role in abusive language detection. However, various factors can affect their sensitivity. In this work, we test whether LLMs have an unintended bias in abusive language detection, i.e., whether they predict more or less of a given abusive class than expected in zero-shot settings. Our results show that instruction-tuned LLMs tend to under-predict positive classes, since datasets used for tuning are dominated by the negative class. On the contrary, models fine-tuned with human feedback tend to be overly sensitive. In an exploratory approach to mitigate these issues, we show that label frequency in the prompt helps with the significant over-prediction.
Anthology ID:
2025.coling-main.188
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2765–2780
Language:
URL:
https://aclanthology.org/2025.coling-main.188/
DOI:
Bibkey:
Cite (ACL):
Yaqi Zhang, Viktor Hangya, and Alexander Fraser. 2025. LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2765–2780, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
LLM Sensitivity Challenges in Abusive Language Detection: Instruction-Tuned vs. Human Feedback (Zhang et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.188.pdf