Large Language Models with Reinforcement Learning from Human Feedback Approach for Enhancing Explainable Sexism Detection

Ali Riahi Samani, Tianhao Wang, Kangshuo Li, Feng Chen


Abstract
Recent advancements in natural language processing, driven by Large Language Models (LLMs), have significantly improved text comprehension, enabling these models to handle complex tasks with greater efficiency. A key feature of LLMs is their ability to engage in contextual learning, which allows them to understand and apply instructions given in natural language to new scenarios without requiring additional training. This capability is particularly valuable in social media, where LLMs can be crucial in addressing challenges in explainable sexism detection. We hypothesize that by leveraging contextual learning capabilities, LLMs can provide clear, explainable insights into why certain content is flagged as problematic, thus enhancing transparency in the sexism detection process. To this end, we propose a Reinforcement Learning from Human Feedback (RLHF) based fine-tuning framework for sexism detection. We studied two well-known LLMs, Mistral-7B and LLaMA-3-8B, in zero-shot, supervised fine-tuning, and RLHF scenarios to conclude the superior ability of LLMs in sexism detection. The experimental results reported in this work, based on three tasks of Explainable Detection of Online Sexism (EDOS), highlight the importance of RLHF for building explainable systems in online discourse. Furthermore, we found that the LLaMA-3-8B model achieves the best results using the RLHF approach, scoring 0.8681 on Task A (binary sexism detection), 0.6829 on Task B (category classification of sexism), and 0.4722 on Task C (fine-grained sexism vectors) test sets.
Anthology ID:
2025.coling-main.416
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6230–6243
Language:
URL:
https://aclanthology.org/2025.coling-main.416/
DOI:
Bibkey:
Cite (ACL):
Ali Riahi Samani, Tianhao Wang, Kangshuo Li, and Feng Chen. 2025. Large Language Models with Reinforcement Learning from Human Feedback Approach for Enhancing Explainable Sexism Detection. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6230–6243, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Large Language Models with Reinforcement Learning from Human Feedback Approach for Enhancing Explainable Sexism Detection (Riahi Samani et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.416.pdf