A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl; Oliver Eberle

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Abstract

Instruction-tuned LLMs are able to provide *an* explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a *good* explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

Anthology ID:: 2026.trustnlp-main.44
Volume:: Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Kai-Wei Chang, Ninareh Mehrabi, Satyapriya Krishna, Anubrata Das, Jwala Dhamala, Yang Trista Cao, Tharindu Kumarage, Anil Ramakrishna, Christos Christodoulopoulos, Yixin Wan, Aram Galystan, Anoop Kumar, Rahul Gupta
Venues:: TrustNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 563–583
Language:
URL:: https://aclanthology.org/2026.trustnlp-main.44/
DOI:
Bibkey:
Cite (ACL):: Stephanie Brandl and Oliver Eberle. 2026. A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification. In Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026), pages 563–583, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification (Brandl & Eberle, TrustNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.trustnlp-main.44.pdf

PDF Cite Search Fix data