Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Fangru Lin; Shaoguang Mao; Emanuele La Malfa; Valentin Hofmann; Adrian de Wynter; Xun Wang; Si-Qing Chen; Michael J. Wooldridge; Janet Pierrehumbert; Furu Wei

doi:10.18653/v1/2025.acl-long.317

Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael J. Wooldridge, Janet B. Pierrehumbert, Furu Wei

Abstract

Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects across canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce **ReDial** (**Re**asoning with **Dial**ect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks,such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for future research.

Anthology ID:: 2025.acl-long.317
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6317–6342
Language:
URL:: https://aclanthology.org/2025.acl-long.317/
DOI:: 10.18653/v1/2025.acl-long.317
Bibkey:
Cite (ACL):: Fangru Lin, Shaoguang Mao, Emanuele La Malfa, Valentin Hofmann, Adrian de Wynter, Xun Wang, Si-Qing Chen, Michael J. Wooldridge, Janet B. Pierrehumbert, and Furu Wei. 2025. Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6317–6342, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks (Lin et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.317.pdf

PDF Cite Search Fix data