AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark

Abhay Gupta, Ece Yurtseven, Philip Meng, Kevin Zhu


Abstract
Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (AAVE Natural Language Understanding Evaluation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models.
Anthology ID:
2024.nlp4pi-1.28
Volume:
Proceedings of the Third Workshop on NLP for Positive Impact
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Daryna Dementieva, Oana Ignat, Zhijing Jin, Rada Mihalcea, Giorgio Piatti, Joel Tetreault, Steven Wilson, Jieyu Zhao
Venue:
NLP4PI
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
327–333
Language:
URL:
https://aclanthology.org/2024.nlp4pi-1.28
DOI:
Bibkey:
Cite (ACL):
Abhay Gupta, Ece Yurtseven, Philip Meng, and Kevin Zhu. 2024. AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark. In Proceedings of the Third Workshop on NLP for Positive Impact, pages 327–333, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark (Gupta et al., NLP4PI 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.nlp4pi-1.28.pdf