Towards Equitable Natural Language Understanding Systems for Dialectal Cohorts: Debiasing Training Data

Khadige Abboud; Gokmen Oz

Towards Equitable Natural Language Understanding Systems for Dialectal Cohorts: Debiasing Training Data

Abstract

Despite being widely spoken, dialectal variants of languages are frequently considered low in resources due to lack of writing standards and orthographic inconsistencies. As a result, training natural language understanding (NLU) systems relies primarily on standard language resources leading to biased and inequitable NLU technology that underserves dialectal speakers. In this paper, we propose to address this problem through a framework composed of a dialect identification model that is used to obtain targeted training data augmentation for under-represented dialects, in an effort to debias NLU model for dialectal cohorts in NLU systems. We conduct experiments on two dialect rich non-English languages: Arabic and German, using large-scale commercial NLU datasets as well as open-source datasets. Results show that such framework can provide insights on dialect disparity in real-world NLU systems and targeted data argumentation can help narrow the model’s performance gap between standard language speakers and dialect speakers.

Anthology ID:: 2024.lrec-main.1433
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 16487–16499
Language:
URL:: https://aclanthology.org/2024.lrec-main.1433
DOI:
Bibkey:
Cite (ACL):: Khadige Abboud and Gokmen Oz. 2024. Towards Equitable Natural Language Understanding Systems for Dialectal Cohorts: Debiasing Training Data. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16487–16499, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Towards Equitable Natural Language Understanding Systems for Dialectal Cohorts: Debiasing Training Data (Abboud & Oz, LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.1433.pdf

PDF Cite Search