NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki; Houdaifa Atou; Omer Nacar; Shady Shehata; Muhammad Abdul-Mageed

doi:10.18653/v1/2025.emnlp-main.556

NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities

Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed

Abstract

Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities.We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat.

Anthology ID:: 2025.emnlp-main.556
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10967–10991
Language:
URL:: https://aclanthology.org/2025.emnlp-main.556/
DOI:: 10.18653/v1/2025.emnlp-main.556
Bibkey:
Cite (ACL):: Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, and Muhammad Abdul-Mageed. 2025. NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 10967–10991, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities (El Mekki et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.556.pdf
Checklist:: 2025.emnlp-main.556.checklist.pdf

PDF Cite Search Checklist Fix data