Arabizi Language Models for Sentiment Analysis

Gaétan Baert, Souhir Gahbiche, Guillaume Gadek, Alexandre Pauchet


Abstract
Arabizi is a written form of spoken Arabic, relying on Latin characters and digits. It is informal and does not follow any conventional rules, raising many NLP challenges. In particular, Arabizi has recently emerged as the Arabic language in online social networks, becoming of great interest for opinion mining and sentiment analysis. Unfortunately, only few Arabizi resources exist and state-of-the-art language models such as BERT do not consider Arabizi. In this work, we construct and release two datasets: (i) LAD, a corpus of 7.7M tweets written in Arabizi and (ii) SALAD, a subset of LAD, manually annotated for sentiment analysis. Then, a BERT architecture is pre-trained on LAD, in order to create and distribute an Arabizi language model called BAERT. We show that a language model (BAERT) pre-trained on a large corpus (LAD) in the same language (Arabizi) as that of the fine-tuning dataset (SALAD), outperforms a state-of-the-art multi-lingual pretrained model (multilingual BERT) on a sentiment analysis task.
Anthology ID:
2020.coling-main.51
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
592–603
Language:
URL:
https://aclanthology.org/2020.coling-main.51
DOI:
10.18653/v1/2020.coling-main.51
Bibkey:
Cite (ACL):
Gaétan Baert, Souhir Gahbiche, Guillaume Gadek, and Alexandre Pauchet. 2020. Arabizi Language Models for Sentiment Analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 592–603, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
Arabizi Language Models for Sentiment Analysis (Baert et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.51.pdf
Data
TUNIZI