Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles

Azzedine Aftiss, Salima Lamsiyah, Christoph Schommer, Said Ouatik El Alaoui


Abstract
Moroccan Dialect (MD), or “Darija,” is a primary spoken variant of Arabic in Morocco, yet remains underrepresented in Natural Language Processing (NLP) research, particularly in tasks like summarization. Despite a growing volume of MD textual data online, there is a lack of robust resources and NLP models tailored to handle the unique linguistic challenges posed by MD. In response, we introduce .MA_v2, an expanded version of the GOUD.MA dataset, containing over 50k articles with their titles across 11 categories. This dataset provides a more comprehensive resource for developing summarization models. We evaluate the application of large language models (LLMs) for MD summarization, utilizing both fine-tuning and zero-shot prompting with encoder-decoder and causal LLMs, respectively. Our findings demonstrate that an expanded dataset improves summarization performance and highlights the capabilities of recent LLMs in handling MD text. We open-source our dataset, fine-tuned models, and all experimental code, establishing a foundation for future advancements in MD NLP. We release the code at https://github.com/AzzedineAftiss/Moroccan-Dialect-Summarization.
Anthology ID:
2025.wacl-1.9
Volume:
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Saad Ezzini, Hamza Alami, Ismail Berrada, Abdessamad Benlahbib, Abdelkader El Mahdaouy, Salima Lamsiyah, Hatim Derrouz, Amal Haddad Haddad, Mustafa Jarrar, Mo El-Haj, Ruslan Mitkov, Paul Rayson
Venues:
WACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
77–85
Language:
URL:
https://aclanthology.org/2025.wacl-1.9/
DOI:
Bibkey:
Cite (ACL):
Azzedine Aftiss, Salima Lamsiyah, Christoph Schommer, and Said Ouatik El Alaoui. 2025. Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4), pages 77–85, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles (Aftiss et al., WACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.wacl-1.9.pdf