ChatGPT Rates Natural Language Explanation Quality like Humans: But on Which Scales?

Fan Huang, Haewoon Kwak, Kunwoo Park, Jisun An


Abstract
As AI becomes more integral in our lives, the need for transparency and responsibility grows. While natural language explanations (NLEs) are vital for clarifying the reasoning behind AI decisions, evaluating them through human judgments is complex and resource-intensive due to subjectivity and the need for fine-grained ratings. This study explores the alignment between ChatGPT and human assessments across multiple scales (i.e., binary, ternary, and 7-Likert scale). We sample 300 data instances from three NLE datasets and collect 900 human annotations for both informativeness and clarity scores as the text quality measurement. We further conduct paired comparison experiments under different ranges of subjectivity scores, where the baseline comes from 8,346 human annotations. Our results show that ChatGPT aligns better with humans in more coarse-grained scales. Also, paired comparisons and dynamic prompting (i.e., providing semantically similar examples in the prompt) improve the alignment. This research advances our understanding of large language models’ capabilities to assess the text explanation quality in different configurations for responsible AI development.
Anthology ID:
2024.lrec-main.277
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
3111–3132
Language:
URL:
https://aclanthology.org/2024.lrec-main.277
DOI:
Bibkey:
Cite (ACL):
Fan Huang, Haewoon Kwak, Kunwoo Park, and Jisun An. 2024. ChatGPT Rates Natural Language Explanation Quality like Humans: But on Which Scales?. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3111–3132, Torino, Italia. ELRA and ICCL.
Cite (Informal):
ChatGPT Rates Natural Language Explanation Quality like Humans: But on Which Scales? (Huang et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.277.pdf