Anjali Kantharuban

2026

Arabic, often considered a single language, actually describes a wide variety of sometimes mutually unintelligible language varieties. While large language models (LLMs) have revolutionized natural language processing (NLP) with rapid advances, these models still best serve speakers of high-resource and standard language varieties. One particular deficiency of theirs is in dialectal Arabic. We present the first ever shared task for dialectal Arabic language modeling: Arabic Modeling In Your Accent, or AMIYA. The goal of the shared task was to develop LLMs that could (1) respond in the correct dialectal variety when explicitly or implicitly prompted to, (2) translate between dialectal Arabic and standard Arabic or English, (3) adhere to LLM instructions in dialectal Arabic, and (4) produce fluent Arabic outputs. We called for submissions in the dialectal varieties of five countries: Morocco, Egypt, Palestine, Syria, and Saudi Arabia. We received 45 submitted systems from six participating teams. We saw positive results from supervised fine-tuning on a translation objective, and reinforcement learning to improve dialectness. Manual evaluation also showed that some systems had learned to output dialectal words or phrases, but at the expense of actual fluency or coherence. Overall the most effective system involved continual pre-training and supervised fine-tuning of 12 candidate LLMs, followed by selection of the best performing models.

2025

pdf bib abs

Stereotype or Personalization? User Identity Biases Chatbot Recommendations
Anjali Kantharuban | Jeremiah Milbauer | Maarten Sap | Emma Strubell | Graham Neubig
Findings of the Association for Computational Linguistics: ACL 2025

While personalized recommendations are often desired by users, it can be difficult in practice to distinguish cases of bias from cases of personalization: we find that models generate racially stereotypical recommendations regardless of whether the user revealed their identity intentionally through explicit indications or unintentionally through implicit cues. We demonstrate that when people use large language models (LLMs) to generate recommendations, the LLMs produce responses that reflect both what the user wants and who the user is. We argue that chatbots ought to transparently indicate when recommendations are influenced by a user’s revealed identity characteristics, but observe that they currently fail to do so. Our experiments show that even though a user’s revealed identity significantly influences model recommendations (p < 0.001), model responses obfuscate this fact in response to user queries. This bias and lack of transparency occurs consistently across multiple popular consumer LLMs and for four American racial groups.

2023

pdf bib abs

Quantifying the Dialect Gap and its Correlates Across Languages
Anjali Kantharuban | Ivan Vulić | Anna Korhonen
Findings of the Association for Computational Linguistics: EMNLP 2023

Historically, researchers and consumers have noticed a decrease in quality when applying NLP tools to minority variants of languages (i.e. Puerto Rican Spanish or Swiss German), but studies exploring this have been limited to a select few languages. Additionally, past studies have mainly been conducted in a monolingual context, so cross-linguistic trends have not been identified and tied to external factors. In this work, we conduct a comprehensive evaluation of the most influential, state-of-the-art large language models (LLMs) across two high-use applications, machine translation and automatic speech recognition, to assess their functionality on the regional dialects of several high- and low-resource languages. Additionally, we analyze how the regional dialect gap is correlated with economic, social, and linguistic factors. The impact of training data, including related factors like dataset size and its construction procedure, is shown to be significant but not consistent across models or languages, meaning a one-size-fits-all approach cannot be taken in solving the dialect gap. This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.

pdf bib abs

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.

Venues

Fix author