Hierarchical Aggregation of Dialectal Data for Arabic Dialect Identification

Nurpeiis Baimukan, Houda Bouamor, Nizar Habash


Abstract
Arabic is a collection of dialectal variants that are historically related but significantly different. These differences can be seen across regions, countries, and even cities in the same countries. Previous work on Arabic Dialect identification has focused mainly on specific dialect levels (region, country, province, or city) using level-specific resources; and different efforts used different schemas and labels. In this paper, we present the first effort aiming at defining a standard unified three-level hierarchical schema (region-country-city) for dialectal Arabic classification. We map 29 different data sets to this unified schema, and use the common mapping to facilitate aggregating these data sets. We test the value of such aggregation by building language models and using them in dialect identification. We make our label mapping code and aggregated language models publicly available.
Anthology ID:
2022.lrec-1.489
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4586–4596
Language:
URL:
https://aclanthology.org/2022.lrec-1.489
DOI:
Bibkey:
Cite (ACL):
Nurpeiis Baimukan, Houda Bouamor, and Nizar Habash. 2022. Hierarchical Aggregation of Dialectal Data for Arabic Dialect Identification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4586–4596, Marseille, France. European Language Resources Association.
Cite (Informal):
Hierarchical Aggregation of Dialectal Data for Arabic Dialect Identification (Baimukan et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.489.pdf