Evaluating Dialect Robustness of Language Models via Conversation Understanding

Dipankar Srirag, Nihar Ranjan Sahoo, Aditya Joshi


Abstract
With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English (i.e., dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of ‘taboo‘. We formulate two evaluative tasks: target word prediction (TWP) (i.e., predict the masked target word in a conversation) and target word selection (TWS) (i.e., select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where en-US is transformed to include dialectal information) and en-TR (where dialectal information is removed from en-IN). We evaluate three multilingual LLMs–one open source (Llama3) and two closed-source (GPT-4/3.5). LLMs perform significantly better for US English than Indian English for both TWP and TWS tasks, for all settings, exhibiting marginalisation against the Indian dialect of English. While GPT-based models perform the best, the comparatively smaller models work more equitably after fine-tuning. Our evaluation methodology exhibits a novel and reproducible way to examine attributes of language models using pre-existing dialogue datasets with language varieties. Dialect being an artifact of one’s culture, this paper demonstrates the gap in the performance of multilingual LLMs for communities that do not use a mainstream dialect.
Anthology ID:
2025.sumeval-2.3
Volume:
Proceedings of the Second Workshop on Scaling Up Multilingual & Multi-Cultural Evaluation
Month:
January
Year:
2025
Address:
Abu Dhabi
Venues:
SUMEval | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24–38
Language:
URL:
https://aclanthology.org/2025.sumeval-2.3/
DOI:
Bibkey:
Cite (ACL):
Dipankar Srirag, Nihar Ranjan Sahoo, and Aditya Joshi. 2025. Evaluating Dialect Robustness of Language Models via Conversation Understanding. In Proceedings of the Second Workshop on Scaling Up Multilingual & Multi-Cultural Evaluation, pages 24–38, Abu Dhabi. Association for Computational Linguistics.
Cite (Informal):
Evaluating Dialect Robustness of Language Models via Conversation Understanding (Srirag et al., SUMEval 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.sumeval-2.3.pdf