HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

Lorenzo Puppi Vecchi; Alceu De Souza Britto Jr.; Emerson Cabrera Paraiso; Rafael M. O. Cruz

HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

Lorenzo Puppi Vecchi, Alceu De Souza Britto Jr., Emerson Cabrera Paraiso, Rafael M. O. Cruz

Abstract

Explaining why content is hateful using natural language is crucial for fostering transparency in automated content moderation systems. However, evaluating the quality of such explanations remains an open challenge. General-purpose reward models (RMs), commonly used for scoring natural language outputs, are typically optimized for broad notions of safety. We argue that this optimization penalizes situations where references to stereotypes or offensive content are essential for explanations with higher explanatory fidelity. To address this gap, we introduce SBIC-Explain, a human-validated dataset of 370,788 LLM generated NLEs for offensive content, spanning three levels of human-annotated contextual richness: Tier 1: text-only, Tier 2: + classification-aware, and Tier 3: + semantics-informed. We hypothesize that as human-annotated context increases, explanations should lead to higher perceived explanations with higher explanatory fidelity. Yet, we find that existing RMs systematically assign lower scores to more contextually rich (and often more offensive) explanations, revealing a misalignment between model preferences and explanatory fidelity for this context. We propose HARM (Hate-Aware Reward Model), a RM that integrates interpretable signals to better align reward scores with the needs of hate speech explanation. HARM outperforms general-purpose baselines, improving NLE pair-wise preference. Available at: https://github.com/Lorenzo815/HARM.

Anthology ID:: 2026.findings-eacl.230
Volume:: Findings of the Association for Computational Linguistics: EACL 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4393–4431
Language:
URL:: https://aclanthology.org/2026.findings-eacl.230/
DOI:
Bibkey:
Cite (ACL):: Lorenzo Puppi Vecchi, Alceu De Souza Britto Jr., Emerson Cabrera Paraiso, and Rafael M. O. Cruz. 2026. HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content. In Findings of the Association for Computational Linguistics: EACL 2026, pages 4393–4431, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content (Vecchi et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-eacl.230.pdf
Checklist:: 2026.findings-eacl.230.checklist.pdf

PDF Cite Search Checklist Fix data