Building Dialectal Arabic Corpora

Hani Elgabou, Dimitar Kazakov


Abstract
The aim of this research is to identify local Arabic dialects in texts from social media (Twitter) and link them to specific geographic areas. Dialect identification is studied as a subset of the task of language identification. The proposed method is based on unsupervised learning using simultaneously lexical and geographic distance. While this study focusses on Libyan dialects, the approach is general, and could produce resources to support human translators and interpreters when dealing with vernaculars rather than standard Arabic.
Anthology ID:
W17-7907
Volume:
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology
Month:
September
Year:
2017
Address:
Varna, Bulgaria
Editors:
Irina Temnikova, Constantin Orasan, Gloria Corpas Pastor, Stephan Vogel
Venue:
RANLP
SIG:
Publisher:
Association for Computational Linguistics, Shoumen, Bulgaria
Note:
Pages:
52–57
Language:
URL:
https://doi.org/10.26615/978-954-452-042-7_007
DOI:
10.26615/978-954-452-042-7_007
Bibkey:
Cite (ACL):
Hani Elgabou and Dimitar Kazakov. 2017. Building Dialectal Arabic Corpora. In Proceedings of the Workshop Human-Informed Translation and Interpreting Technology, pages 52–57, Varna, Bulgaria. Association for Computational Linguistics, Shoumen, Bulgaria.
Cite (Informal):
Building Dialectal Arabic Corpora (Elgabou & Kazakov, RANLP 2017)
Copy Citation:
PDF:
https://doi.org/10.26615/978-954-452-042-7_007