We introduce DiaSet, a novel dataset of dialectical Arabic speech, manually transcribed and annotated for two specific downstream tasks: sentiment analysis and named entity recognition. The dataset encapsulates the Palestine dialect, predominantly spoken in Palestine, Israel, and Jordan. Our dataset incorporates authentic conversations between YouTube influencers and their respective guests. Furthermore, we have enriched the dataset with simulated conversations initiated by inviting participants from various locales within the said regions. The participants were encouraged to engage in dialogues with our interviewer. Overall, DiaSet consists of 644.8K tokens and 23.2K annotated instances. Uniform writing standards were upheld during the transcription process. Additionally, we established baseline models by leveraging some of the pre-existing Arabic BERT language models, showcasing the potential applications and efficiencies of our dataset. We make DiaSet publicly available for further research.
Sentiment classification and sarcasm detection attract a lot of attention by the NLP research community. However, solving these two problems in Arabic and on the basis of social network data (i.e., Twitter) is still of lower interest. In this paper we present designated solutions for sentiment classification and sarcasm detection tasks that were introduced as part of a shared task by Abu Farha et al. (2021). We adjust the existing state-of-the-art transformer pretrained models for our needs. In addition, we use a variety of machine-learning techniques such as down-sampling, augmentation, bagging, and usage of meta-features to improve the models performance. We achieve an F1-score of 0.75 over the sentiment classification problem where the F1-score is calculated over the positive and negative classes (the neutral class is not taken into account). We achieve an F1-score of 0.66 over the sarcasm detection problem where the F1-score is calculated over the sarcastic class only. In both cases, the above reported results are evaluated over the ArSarcasm-v2–an extended dataset of the ArSarcasm (Farha and Magdy, 2020) that was introduced as part of the shared task. This reflects an improvement to the state-of-the-art results in both tasks.