Yousra A. El-Ghawi
2025
Web-Based Corpus Compilation of the Emirati Arabic Dialect
Yousra A. El-Ghawi
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
This paper displays some initial efforts conducted in the compilation pursuits of Arabic dialectal corpora in the form of raw text, the end purpose of which is to fine-tune existing Arabic large language models (LLM) to better understand and generate text in the Emirati dialect as instructed. The focus of the paper is on the process of compiling corpora from the web, which includes the exploration of possible methods, tools and techniques specific to web search, as well as examples of genres and domains to explore. The results of these efforts and the importance of native speaker contributions to corpus compilation for low-resource languages are also touched upon.