Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles

Azzedine Aftiss; Salima Lamsiyah; Christoph Schommer; Said Ouatik El Alaoui

Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles

Azzedine Aftiss, Salima Lamsiyah, Christoph Schommer, Said Ouatik El Alaoui

Abstract

Moroccan Dialect (MD), or “Darija,” is a primary spoken variant of Arabic in Morocco, yet remains underrepresented in Natural Language Processing (NLP) research, particularly in tasks like summarization. Despite a growing volume of MD textual data online, there is a lack of robust resources and NLP models tailored to handle the unique linguistic challenges posed by MD. In response, we introduce .MA_v2, an expanded version of the GOUD.MA dataset, containing over 50k articles with their titles across 11 categories. This dataset provides a more comprehensive resource for developing summarization models. We evaluate the application of large language models (LLMs) for MD summarization, utilizing both fine-tuning and zero-shot prompting with encoder-decoder and causal LLMs, respectively. Our findings demonstrate that an expanded dataset improves summarization performance and highlights the capabilities of recent LLMs in handling MD text. We open-source our dataset, fine-tuned models, and all experimental code, establishing a foundation for future advancements in MD NLP. We release the code at https://github.com/AzzedineAftiss/Moroccan-Dialect-Summarization.

Anthology ID:: 2025.wacl-1.9
Volume:: Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Saad Ezzini, Hamza Alami, Ismail Berrada, Abdessamad Benlahbib, Abdelkader El Mahdaouy, Salima Lamsiyah, Hatim Derrouz, Amal Haddad Haddad, Mustafa Jarrar, Mo El-Haj, Ruslan Mitkov, Paul Rayson
Venues:: WACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 77–85
Language:
URL:: https://aclanthology.org/2025.wacl-1.9/
DOI:
Bibkey:
Cite (ACL):: Azzedine Aftiss, Salima Lamsiyah, Christoph Schommer, and Said Ouatik El Alaoui. 2025. Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4), pages 77–85, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles (Aftiss et al., WACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wacl-1.9.pdf

PDF Cite Search Fix data