A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl, Oliver Eberle


Abstract
Instruction-tuned LLMs are able to provide *an* explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a *good* explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.
Anthology ID:
2026.trustnlp-main.44
Volume:
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California
Editors:
Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:
TrustNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
563–583
Language:
URL:
https://aclanthology.org/2026.trustnlp-main.44/
DOI:
Bibkey:
Cite (ACL):
Stephanie Brandl and Oliver Eberle. 2026. A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 563–583, San Diego, California. Association for Computational Linguistics.
Cite (Informal):
A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification (Brandl & Eberle, TrustNLP 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.trustnlp-main.44.pdf