Web-Based Corpus Compilation of the Emirati Arabic Dialect

Yousra A. El-Ghawi


Abstract
This paper displays some initial efforts conducted in the compilation pursuits of Arabic dialectal corpora in the form of raw text, the end purpose of which is to fine-tune existing Arabic large language models (LLM) to better understand and generate text in the Emirati dialect as instructed. The focus of the paper is on the process of compiling corpora from the web, which includes the exploration of possible methods, tools and techniques specific to web search, as well as examples of genres and domains to explore. The results of these efforts and the importance of native speaker contributions to corpus compilation for low-resource languages are also touched upon.
Anthology ID:
2025.wacl-1.7
Volume:
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Saad Ezzini, Hamza Alami, Ismail Berrada, Abdessamad Benlahbib, Abdelkader El Mahdaouy, Salima Lamsiyah, Hatim Derrouz, Amal Haddad Haddad, Mustafa Jarrar, Mo El-Haj, Ruslan Mitkov, Paul Rayson
Venues:
WACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
63–67
Language:
URL:
https://aclanthology.org/2025.wacl-1.7/
DOI:
Bibkey:
Cite (ACL):
Yousra A. El-Ghawi. 2025. Web-Based Corpus Compilation of the Emirati Arabic Dialect. In Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4), pages 63–67, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Web-Based Corpus Compilation of the Emirati Arabic Dialect (El-Ghawi, WACL 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.wacl-1.7.pdf