A Comprehensive Evaluation of Chain-of-Thought Faithfulness in Persian Classification Tasks

Shakib Yazdani; Cristina España-Bonet; Eleftherios Avramidis; Yasser Hamidullah; Josef van Genabith

A Comprehensive Evaluation of Chain-of-Thought Faithfulness in Persian Classification Tasks

Shakib Yazdani, Cristina España-Bonet, Eleftherios Avramidis, Yasser Hamidullah, Josef Van Genabith

Abstract

Large language models (LLMs) have shown remarkable performance when prompted to reason step by step, commonly referred to as chain-of-thought (CoT) reasoning. While prior work has proposed mechanism-level approaches to evaluate CoT faithfulness, these studies have primarily focused on English, leaving low-resource languages such as Persian largely underexplored. In this paper, we present the first comprehensive study of CoT faithfulness in Persian. Our analysis spans 15 classification datasets and 6 language models across three classes (small, large, and reasoning models) evaluated under both English and Persian prompting conditions. We first assess model performance on each dataset while collecting the corresponding CoT traces and final predictions. We then evaluate the faithfulness of these CoT traces using an LLM-as-a-judge approach, followed by a human evaluation to measure agreement between the LLM-based judge and human annotator. Our results reveal substantial variation in CoT faithfulness across tasks, datasets, and model classes. In particular, faithfulness is strongly influenced by the dataset and the language model class, while the language used for prompting has a comparatively smaller effect. Notably, small language models exhibit lower or comparable faithfulness scores than large language models and reasoning models.

Anthology ID:: 2026.loreslm-1.27
Volume:: Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:: LoResLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 311–323
Language:
URL:: https://aclanthology.org/2026.loreslm-1.27/
DOI:
Bibkey:
Cite (ACL):: Shakib Yazdani, Cristina España-Bonet, Eleftherios Avramidis, Yasser Hamidullah, and Josef Van Genabith. 2026. A Comprehensive Evaluation of Chain-of-Thought Faithfulness in Persian Classification Tasks. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 311–323, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: A Comprehensive Evaluation of Chain-of-Thought Faithfulness in Persian Classification Tasks (Yazdani et al., LoResLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.loreslm-1.27.pdf

PDF Cite Search Fix data