2022
pdf
bib
abs
AraBEM at WANLP 2022 Shared Task: Propaganda Detection in Arabic Tweets
Eshrag Ali Refaee
|
Basem Ahmed
|
Motaz Saad
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
Propaganda is information or ideas that an organized group or government spreads to influence peopleś opinions, especially by not giving all the facts or secretly emphasizing only one way of looking at the points. The ability to automatically detect propaganda-related linguistic signs is a challenging task that researchers in the NLP community have recently started to address. This paper presents the participation of our team AraBEM in the propaganda detection shared task on Arabic tweets. Our system utilized a pre-trained BERT model to perform multi-class binary classification. It attained the best score at 0.602 micro-f1, ranking third on subtask-1, which identifies the propaganda techniques as a multilabel classification problem with a baseline of 0.079.
pdf
bib
abs
QQATeam at Qur’an QA 2022: Fine-Tunning Arabic QA Models for Qur’an QA Task
Basem Ahmed
|
Motaz Saad
|
Eshrag A. Refaee
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
The problem of auto-extraction of reliable answers from a reference text like a constitution or holy book is a real challenge for the natural languages research community. Qurán is the holy book of Islam and the primary source of legislation for millions of Muslims around the world, which can trigger the curiosity of non-Muslims to find answers about various topics from the Qurán. Previous work on Question Answering (Q&A) from Qurán is scarce and lacks the benchmark of previously developed systems on a testbed to allow meaningful comparison and identify developments and challenges. This work presents an empirical investigation of our participation in the Qurán QA shared task (2022) that utilizes a benchmark dataset of 1,093 tuples of question-Qurán passage pairs. The dataset comprises Qurán verses, questions and several ranked possible answers. This paper describes the approach we follow with our participation in the shared task and summarises our main findings. Our system attained the best score at 0.63 pRR and 0.59 F1 on the development set and 0.56 pRR and 0.51 F1 on the test set. The best results of the Exact Match (EM) score at 0.34 indicate the difficulty of the task and the need for more future work to tackle this challenging task.
2020
pdf
bib
abs
An Arabic Tweets Sentiment Analysis Dataset (ATSAD) using Distant Supervision and Self Training
Kathrein Abu Kwaik
|
Stergios Chatzikyriakidis
|
Simon Dobnik
|
Motaz Saad
|
Richard Johansson
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
As the number of social media users increases, they express their thoughts, needs, socialise and publish their opinions reviews. For good social media sentiment analysis, good quality resources are needed, and the lack of these resources is particularly evident for languages other than English, in particular Arabic. The available Arabic resources lack of from either the size of the corpus or the quality of the annotation. In this paper, we present an Arabic Sentiment Analysis Corpus collected from Twitter, which contains 36K tweets labelled into positive and negative. We employed distant supervision and self-training approaches into the corpus to annotate it. Besides, we release an 8K tweets manually annotated as a gold standard. We evaluated the corpus intrinsically by comparing it to human classification and pre-trained sentiment analysis models, Moreover, we apply extrinsic evaluation methods exploiting sentiment analysis task and achieve an accuracy of 86%.
2019
pdf
bib
abs
ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification
Kathrein Abu Kwaik
|
Motaz Saad
Proceedings of the Fourth Arabic Natural Language Processing Workshop
In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.
2018
pdf
bib
Shami: A Corpus of Levantine Arabic Dialects
Kathrein Abu Kwaik
|
Motaz Saad
|
Stergios Chatzikyriakidis
|
Simon Dobnik
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2014
pdf
bib
abs
Building and Modelling Multilingual Subjective Corpora
Motaz Saad
|
David Langlois
|
Kamel Smaïli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Building multilingual opinionated models requires multilingual corpora annotated with opinion labels. Unfortunately, such kind of corpora are rare. We consider opinions in this work as subjective or objective. In this paper, we introduce an annotation method that can be reliably transferred across topic domains and across languages. The method starts by building a classifier that annotates sentences into subjective/objective label using a training data from “movie reviews” domain which is in English language. The annotation can be transferred to another language by classifying English sentences in parallel corpora and transferring the same annotation to the same sentences of the other language. We also shed the light on the link between opinion mining and statistical language modelling, and how such corpora are useful for domain specific language modelling. We show the distinction between subjective and objective sentences which tends to be stable across domains and languages. Our experiments show that language models trained on objective (respectively subjective) corpus lead to better perplexities on objective (respectively subjective) test.
2013
pdf
bib
Comparing Multilingual Comparable Articles Based On Opinions
Motaz Saad
|
David Langlois
|
Kamel Smaïli
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora