2019
pdf
bib
abs
Tw-StAR at SemEval-2019 Task 5: N-gram embeddings for Hate Speech Detection in Multilingual Tweets
Hala Mulki
|
Chedi Bechikh Ali
|
Hatem Haddad
|
Ismail Babaoğlu
Proceedings of the 13th International Workshop on Semantic Evaluation
In this paper, we describe our contribution in SemEval-2019: subtask A of task 5 “Multilingual detection of hate speech against immigrants and women in Twitter (HatEval)”. We developed two hate speech detection model variants through Tw-StAR framework. While the first model adopted one-hot encoding ngrams to train an NB classifier, the second generated and learned n-gram embeddings within a feedforward neural network. For both models, specific terms, selected via MWT patterns, were tagged in the input data. With two feature types employed, we could investigate the ability of n-gram embeddings to rival one-hot n-grams. Our results showed that in English, n-gram embeddings outperformed one-hot ngrams. However, representing Spanish tweets by one-hot n-grams yielded a slightly better performance compared to that of n-gram embeddings. The official ranking indicated that Tw-StAR ranked 9th for English and 20th for Spanish.
pdf
bib
abs
L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language
Hala Mulki
|
Hatem Haddad
|
Chedi Bechikh Ali
|
Halima Alshabani
Proceedings of the Third Workshop on Abusive Language Online
Hate speech and abusive language have become a common phenomenon on Arabic social media. Automatic hate speech and abusive detection systems can facilitate the prohibition of toxic textual contents. The complexity, informality and ambiguity of the Arabic dialects hindered the provision of the needed resources for Arabic abusive/hate speech detection research. In this paper, we introduce the first publicly-available Levantine Hate Speech and Abusive (L-HSAB) Twitter dataset with the objective to be a benchmark dataset for automatic detection of online Levantine toxic contents. We, further, provide a detailed review of the data collection steps and how we design the annotation guidelines such that a reliable dataset annotation is guaranteed. This has been later emphasized through the comprehensive evaluation of the annotations as the annotation agreement metrics of Cohen’s Kappa (k) and Krippendorff’s alpha (α) indicated the consistency of the annotations.
2018
pdf
bib
Impact du Prétraitement Linguistique sur l’Analyse de Sentiment du Dialecte Tunisien ()
Chedi Bechikh Ali
|
Hala Mulki
|
Hatem Haddad
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN
pdf
bib
abs
Tw-StAR at SemEval-2018 Task 1: Preprocessing Impact on Multi-label Emotion Classification
Hala Mulki
|
Chedi Bechikh Ali
|
Hatem Haddad
|
Ismail Babaoğlu
Proceedings of the 12th International Workshop on Semantic Evaluation
In this paper, we describe our contribution in SemEval-2018 contest. We tackled task 1 “Affect in Tweets”, subtask E-c “Detecting Emotions (multi-label classification)”. A multilabel classification system Tw-StAR was developed to recognize the emotions embedded in Arabic, English and Spanish tweets. To handle the multi-label classification problem via traditional classifiers, we employed the binary relevance transformation strategy while a TF-IDF scheme was used to generate the tweets’ features. We investigated using single and combinations of several preprocessing tasks to further improve the performance. The results showed that specific combinations of preprocessing tasks could significantly improve the evaluation measures. This has been later emphasized by the official results as our system ranked 3rd for both Arabic and Spanish datasets and 14th for the English dataset.
2012
pdf
bib
Indexation à base des syntagmes nominaux (Nominal-chunk based indexing) [in French]
Amine Amri
|
Maroua Mbarek
|
Chedi Bechikh
|
Chiraz Latiri
|
Hatem Haddad
JEP-TALN-RECITAL 2012, Workshop DEFT 2012: DÉfi Fouille de Textes (DEFT 2012 Workshop: Text Mining Challenge)