Text and Speech-based Tunisian Arabic Sub-Dialects Identification

Najla Ben Abdallah, Saméh Kchaou, Fethi Bougares


Abstract
Dialect IDentification (DID) is a challenging task, and it becomes more complicated when it is about the identification of dialects that belong to the same country. Indeed, dialects of the same country are closely related and exhibit a significant overlapping at the phonetic and lexical levels. In this paper, we present our first results on a dialect classification task covering four sub-dialects spoken in Tunisia. We use the term ’sub-dialect’ to refer to the dialects belonging to the same country. We conducted our experiments aiming to discriminate between Tunisian sub-dialects belonging to four different cities: namely Tunis, Sfax, Sousse and Tataouine. A spoken corpus of 1673 utterances is collected, transcribed and freely distributed. We used this corpus to build several speech- and text-based DID systems. Our results confirm that, at this level of granularity, dialects are much better distinguishable using the speech modality. Indeed, we were able to reach an F-1 score of 93.75% using our best speech-based identification system while the F-1 score is limited to 54.16% using text-based DID on the same test set.
Anthology ID:
2020.lrec-1.787
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6405–6411
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.787
DOI:
Bibkey:
Cite (ACL):
Najla Ben Abdallah, Saméh Kchaou, and Fethi Bougares. 2020. Text and Speech-based Tunisian Arabic Sub-Dialects Identification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6405–6411, Marseille, France. European Language Resources Association.
Cite (Informal):
Text and Speech-based Tunisian Arabic Sub-Dialects Identification (Ben Abdallah et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.787.pdf