SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics

Salwa Saad Alahmari


Abstract
This paper presents the Saudi Arabian Dialects Song Lyrics Corpus (SADSLyC), the first dataset featuring song lyrics from the five major Saudi dialects: Najdi (Central Region), Hijazi (Western Region), Shamali (Northern Region), Janoubi (Southern Region), and Shargawi (Eastern Region). The dataset consists of 31,358 sentences, with each sentence representing a self-contained verse in a song, totaling 151,841 words. Additionally, we present a baseline experiment using the SaudiBERT model to classify the fine-grained dialects in the SADSLyC Corpus. The model achieved an overall accuracy of 73% on the test dataset.
Anthology ID:
2025.wacl-1.4
Volume:
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Saad Ezzini, Hamza Alami, Ismail Berrada, Abdessamad Benlahbib, Abdelkader El Mahdaouy, Salima Lamsiyah, Hatim Derrouz, Amal Haddad Haddad, Mustafa Jarrar, Mo El-Haj, Ruslan Mitkov, Paul Rayson
Venues:
WACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
38–43
Language:
URL:
https://aclanthology.org/2025.wacl-1.4/
DOI:
Bibkey:
Cite (ACL):
Salwa Saad Alahmari. 2025. SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4), pages 38–43, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics (Alahmari, WACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.wacl-1.4.pdf