2024
pdf
bib
PoliTun: Tunisian Political Dataset for Detecting Public Opinions and Categories Orientation
Chayma Fourati
|
Roua Hammami
|
Chiraz Latiri
|
Hatem Haddad
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)
pdf
bib
Proceedings of The Second Arabic Natural Language Processing Conference
Nizar Habash
|
Houda Bouamor
|
Ramy Eskander
|
Nadi Tomeh
|
Ibrahim Abu Farha
|
Ahmed Abdelali
|
Samia Touileb
|
Injy Hamed
|
Yaser Onaizan
|
Bashar Alhafni
|
Wissam Antoun
|
Salam Khalifa
|
Hatem Haddad
|
Imed Zitouni
|
Badr AlKhamissi
|
Rawan Almatham
|
Khalil Mrini
Proceedings of The Second Arabic Natural Language Processing Conference
2023
pdf
bib
Proceedings of the Seventh Widening NLP Workshop (WiNLP 2023)
Bonaventure F. P. Dossou
|
Isidora Tourni
|
Hatem Haddad
|
Shaily Bhatt
|
Fatemehsadat Mireshghallah
|
Sunipa Dev
|
Tanvi Anand
|
Weijia Xu
|
Atnafu Lambebo Tonja
|
Alfredo Gomez
|
Chanjun Park
Proceedings of the Seventh Widening NLP Workshop (WiNLP 2023)
pdf
bib
Proceedings of ArabicNLP 2023
Hassan Sawaf
|
Samhaa El-Beltagy
|
Wajdi Zaghouani
|
Walid Magdy
|
Ahmed Abdelali
|
Nadi Tomeh
|
Ibrahim Abu Farha
|
Nizar Habash
|
Salam Khalifa
|
Amr Keleg
|
Hatem Haddad
|
Imed Zitouni
|
Khalil Mrini
|
Rawan Almatham
Proceedings of ArabicNLP 2023
2022
pdf
bib
Proceedings of the Sixth Widening NLP Workshop (WiNLP)
Shaily Bhatt
|
Sunipa Dev
|
Bonaventure Dossou
|
Tirthankar Ghosal
|
Hatem Haddad
|
Haley M. Lepp
|
Fatemehsadat Mireshghallah
|
Surangika Ranathunga
|
Xanda Schofield
|
Isidora Tourni
|
Weijia Xu
Proceedings of the Sixth Widening NLP Workshop (WiNLP)
pdf
bib
abs
iCompass Working Notes for the Nuanced Arabic Dialect Identification Shared task
Abir Messaoudi
|
Chayma Fourati
|
Hatem Haddad
|
Moez BenHajhmida
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
We describe our submitted system to the Nuanced Arabic Dialect Identification (NADI) shared task. We tackled only the first subtask (Subtask 1). We used state-of-the-art Deep Learning models and pre-trained contextualized text representation models that we finetuned according to the downstream task in hand. As a first approach, we used BERT Arabic variants: MARBERT with its two versions MARBERT v1 and MARBERT v2, we combined MARBERT embeddings with a CNN classifier, and finally, we tested the Quasi-Recurrent Neural Networks (QRNN) model. The results found show that version 2 of MARBERT outperforms all of the previously mentioned models on Subtask 1.
pdf
bib
abs
iCompass at WANLP 2022 Shared Task: ARBERT and MARBERT for Multilabel Propaganda Classification of Arabic Tweets
Bilel - Taboubi
|
Bechir Brahem
|
Hatem Haddad
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Arabic propaganda detection in Arabic was carried out using transformers pre-trained models ARBERT, MARBERT. They were fine-tuned for the down-stream task in hand ‘subtask 1’, multilabel classification of Arabic tweets. Submitted model was MARBERT the got 0.597 micro F1 score and got the fifth rank.
pdf
bib
abs
iCompass at Arabic Hate Speech 2022: Detect Hate Speech Using QRNN and Transformers
Mohamed Aziz Bennessir
|
Malek Rhouma
|
Hatem Haddad
|
Chayma Fourati
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
This paper provides a detailed overview of the system we submitted as part of the OSACT2022 Shared Tasks on Fine-Grained Hate Speech Detection on Arabic Twitter, its outcome, and limitations. Our submission is accomplished with a hard parameter sharing Multi-Task Model that consisted of a shared layer containing state-of-the-art contextualized text representation models such as MarBERT, AraBERT, ArBERT and task specific layers that were fine-tuned with Quasi-recurrent neural networks (QRNN) for each down-stream subtask. The results show that MARBERT fine-tuned with QRNN outperforms all of the previously mentioned models.
pdf
bib
TuniSER: Toward a Tunisian Speech Emotion Recognition System
Abir Messaoudi
|
Hatem Haddad
|
Moez Benhaj Hmida
|
Mohamed Graiet
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)
2021
bib
abs
TEET! Tunisian Dataset for Toxic Speech Detection
Slim Gharbi
|
Hatem Haddad
|
Mayssa Kchaou
|
Heger Arfaoui
Proceedings of the Fifth Workshop on Widening Natural Language Processing
The complete freedom of expression in social media has its costs especially in spreading harmful and abusive content that may induce people to act accordingly. Therefore, the need of detecting automatically such a content becomes an urgent task that will help and enhance the efficiency in limiting this toxic spread. Compared to other Arabic dialects which are mostly based on MSA, the Tunisian dialect is a combination of many other languages like MSA, Tamazight, Italian and French. Because of its rich language, dealing with NLP problems can be challenging due to the lack of large annotated datasets. In our context of detecting hate and abusive speech for tunisian dialect, the only existing annotated dataset is T-HSAB combining 6,039 annotated comments as hateful, abusive or normal. In this paper we are introducing a larger annotated dataset composed of approximately 10k of comments. We provide an in-depth exploration of its vocabulary as well as the results of the classification performance of machine learning classifiers like NB and SVM and deep learning models such as ARBERT, MARBERT and XLM-R.
pdf
bib
abs
iCompass at NLP4IF-2021–Fighting the COVID-19 Infodemic
Wassim Henia
|
Oumayma Rjab
|
Hatem Haddad
|
Chayma Fourati
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
This paper provides a detailed overview of the system and its outcomes, which were produced as part of the NLP4IF Shared Task on Fighting the COVID-19 Infodemic at NAACL 2021. This task is accomplished using a variety of techniques. We used state-of-the-art contextualized text representation models that were fine-tuned for the downstream task in hand. ARBERT, MARBERT,AraBERT, Arabic ALBERT and BERT-base-arabic were used. According to the results, BERT-base-arabic had the highest 0.784 F1 score on the test set.
pdf
bib
abs
Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment Analysis
Chayma Fourati
|
Hatem Haddad
|
Abir Messaoudi
|
Moez BenHajhmida
|
Aymen Ben Elhaj Mabrouk
|
Malek Naski
Proceedings of the Sixth Arabic Natural Language Processing Workshop
On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments: their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.
pdf
bib
abs
iCompass at Shared Task on Sarcasm and Sentiment Detection in Arabic
Malek Naski
|
Abir Messaoudi
|
Hatem Haddad
|
Moez BenHajhmida
|
Chayma Fourati
|
Aymen Ben Elhaj Mabrouk
Proceedings of the Sixth Arabic Natural Language Processing Workshop
We describe our submitted system to the 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic (Abu Farha et al., 2021). We tackled both subtasks, namely Sarcasm Detection (Subtask 1) and Sentiment Analysis (Subtask 2). We used state-of-the-art pretrained contextualized text representation models and fine-tuned them according to the downstream task in hand. As a first approach, we used Google’s multilingual BERT and then other Arabic variants: AraBERT, ARBERT and MARBERT. The results found show that MARBERT outperforms all of the previously mentioned models overall, either on Subtask 1 or Subtask 2.
2020
pdf
bib
abs
iCompass at SemEval-2020 Task 12: From a Syntax-ignorant N-gram Embeddings Model to a Deep Bidirectional Language Model
Abir Messaoudi
|
Hatem Haddad
|
Moez Ben Haj Hmida
Proceedings of the Fourteenth Workshop on Semantic Evaluation
We describe our submitted system to the SemEval 2020. We tackled Task 12 entitled “Multilingual Offensive Language Identification in Social Media”, specifically subtask 4A-Arabic. We propose three Arabic offensive language identification models: Tw-StAR, BERT and BERT+BiLSTM. Two Arabic abusive/hate datasets were added to the training dataset: L-HSAB and T-HSAB. The final submission was chosen based on the best performances which was achieved by the BERT+BiLSTM model.
2019
pdf
bib
abs
Tw-StAR at SemEval-2019 Task 5: N-gram embeddings for Hate Speech Detection in Multilingual Tweets
Hala Mulki
|
Chedi Bechikh Ali
|
Hatem Haddad
|
Ismail Babaoğlu
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper, we describe our contribution in SemEval-2019: subtask A of task 5 “Multilingual detection of hate speech against immigrants and women in Twitter (HatEval)”. We developed two hate speech detection model variants through Tw-StAR framework. While the first model adopted one-hot encoding ngrams to train an NB classifier, the second generated and learned n-gram embeddings within a feedforward neural network. For both models, specific terms, selected via MWT patterns, were tagged in the input data. With two feature types employed, we could investigate the ability of n-gram embeddings to rival one-hot n-grams. Our results showed that in English, n-gram embeddings outperformed one-hot ngrams. However, representing Spanish tweets by one-hot n-grams yielded a slightly better performance compared to that of n-gram embeddings. The official ranking indicated that Tw-StAR ranked 9th for English and 20th for Spanish.
pdf
bib
abs
L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language
Hala Mulki
|
Hatem Haddad
|
Chedi Bechikh Ali
|
Halima Alshabani
Proceedings of the Third Workshop on Abusive Language Online
Hate speech and abusive language have become a common phenomenon on Arabic social media. Automatic hate speech and abusive detection systems can facilitate the prohibition of toxic textual contents. The complexity, informality and ambiguity of the Arabic dialects hindered the provision of the needed resources for Arabic abusive/hate speech detection research. In this paper, we introduce the first publicly-available Levantine Hate Speech and Abusive (L-HSAB) Twitter dataset with the objective to be a benchmark dataset for automatic detection of online Levantine toxic contents. We, further, provide a detailed review of the data collection steps and how we design the annotation guidelines such that a reliable dataset annotation is guaranteed. This has been later emphasized through the comprehensive evaluation of the annotations as the annotation agreement metrics of Cohen’s Kappa (k) and Krippendorff’s alpha (α) indicated the consistency of the annotations.
pdf
bib
abs
Syntax-Ignorant N-gram Embeddings for Sentiment Analysis of Arabic Dialects
Hala Mulki
|
Hatem Haddad
|
Mourad Gridach
|
Ismail Babaoğlu
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Arabic sentiment analysis models have employed compositional embedding features to represent the Arabic dialectal content. These embeddings are usually composed via ordered, syntax-aware composition functions and learned within deep neural frameworks. With the free word order and the varying syntax nature across the different Arabic dialects, a sentiment analysis system developed for one dialect might not be efficient for the others. Here we present syntax-ignorant n-gram embeddings to be used in sentiment analysis of several Arabic dialects. The proposed embeddings were composed and learned using an unordered composition function and a shallow neural model. Five datasets of different dialects were used to evaluate the produced embeddings in the sentiment analysis task. The obtained results revealed that, our syntax-ignorant embeddings could outperform word2vec model and doc2vec both variant models in addition to hand-crafted system baselines, while a competent performance was noticed towards baseline systems that adopted more complicated neural architectures.
2018
pdf
bib
Impact du Prétraitement Linguistique sur l’Analyse de Sentiment du Dialecte Tunisien ()
Chedi Bechikh Ali
|
Hala Mulki
|
Hatem Haddad
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN
pdf
bib
abs
Tw-StAR at SemEval-2018 Task 1: Preprocessing Impact on Multi-label Emotion Classification
Hala Mulki
|
Chedi Bechikh Ali
|
Hatem Haddad
|
Ismail Babaoğlu
Proceedings of the 12th International Workshop on Semantic Evaluation
In this paper, we describe our contribution in SemEval-2018 contest. We tackled task 1 “Affect in Tweets”, subtask E-c “Detecting Emotions (multi-label classification)”. A multilabel classification system Tw-StAR was developed to recognize the emotions embedded in Arabic, English and Spanish tweets. To handle the multi-label classification problem via traditional classifiers, we employed the binary relevance transformation strategy while a TF-IDF scheme was used to generate the tweets’ features. We investigated using single and combinations of several preprocessing tasks to further improve the performance. The results showed that specific combinations of preprocessing tasks could significantly improve the evaluation measures. This has been later emphasized by the official results as our system ranked 3rd for both Arabic and Spanish datasets and 14th for the English dataset.
2017
pdf
bib
Modern Trends in Arabic Sentiment Analysis: A Survey
Hala Mulki
|
Hatem Haddad
|
Ismail Babaoğlu
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]
pdf
bib
abs
Tw-StAR at SemEval-2017 Task 4: Sentiment Classification of Arabic Tweets
Hala Mulki
|
Hatem Haddad
|
Mourad Gridach
|
Ismail Babaoglu
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
In this paper, we present our contribution in SemEval 2017 international workshop. We have tackled task 4 entitled “Sentiment analysis in Twitter”, specifically subtask 4A-Arabic. We propose two Arabic sentiment classification models implemented using supervised and unsupervised learning strategies. In both models, Arabic tweets were preprocessed first then various schemes of bag-of-N-grams were extracted to be used as features. The final submission was selected upon the best performance achieved by the supervised learning-based model. However, the results obtained by the unsupervised learning-based model are considered promising and evolvable if more rich lexica are adopted in further work.
pdf
bib
abs
Churn Identification in Microblogs using Convolutional Neural Networks with Structured Logical Knowledge
Mourad Gridach
|
Hatem Haddad
|
Hala Mulki
Proceedings of the 3rd Workshop on Noisy User-generated Text
For brands, gaining new customer is more expensive than keeping an existing one. Therefore, the ability to keep customers in a brand is becoming more challenging these days. Churn happens when a customer leaves a brand to another competitor. Most of the previous work considers the problem of churn prediction using the Call Detail Records (CDRs). In this paper, we use micro-posts to classify customers into churny or non-churny. We explore the power of convolutional neural networks (CNNs) since they achieved state-of-the-art in various computer vision and NLP applications. However, the robustness of end-to-end models has some limitations such as the availability of a large amount of labeled data and uninterpretability of these models. We investigate the use of CNNs augmented with structured logic rules to overcome or reduce this issue. We developed our system called Churn_teacher by using an iterative distillation method that transfers the knowledge, extracted using just the combination of three logic rules, directly into the weight of the DNNs. Furthermore, we used weight normalization to speed up training our convolutional neural networks. Experimental results showed that with just these three rules, we were able to get state-of-the-art on publicly available Twitter dataset about three Telecom brands.
2012
pdf
bib
Indexation à base des syntagmes nominaux (Nominal-chunk based indexing) [in French]
Amine Amri
|
Maroua Mbarek
|
Chedi Bechikh
|
Chiraz Latiri
|
Hatem Haddad
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)