Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

Chayma Fourati; Hatem Haddad; Abir Messaoudi; Moez BenHajhmida; Aymen Ben Elhaj Mabrouk; Malek Naski

Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis

Chayma Fourati, Hatem Haddad, Abir Messaoudi, Moez BenHajhmida, Aymen Ben Elhaj Mabrouk, Malek Naski

Abstract

On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.

Anthology ID:: 2021.wanlp-1.25
Volume:: Proceedings of the Sixth Arabic Natural Language Processing Workshop
Month:: April
Year:: 2021
Address:: Kyiv, Ukraine (Virtual)
Editors:: Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 226–230
Language:
URL:: https://aclanthology.org/2021.wanlp-1.25/
DOI:
Bibkey:
Cite (ACL):: Chayma Fourati, Hatem Haddad, Abir Messaoudi, Moez BenHajhmida, Aymen Ben Elhaj Mabrouk, and Malek Naski. 2021. Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 226–230, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Cite (Informal):: Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis (Fourati et al., WANLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.wanlp-1.25.pdf

PDF Cite Search Fix data