Overview of the WANLP 2022 Shared Task on Propaganda Detection in Arabic

Propaganda is defined as an expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends and this is achieved by means of well-defined rhetorical and psychological devices. Currently, propaganda (or persuasion) techniques have been commonly used on social media to manipulate or mislead social media users. Automatic detection of propaganda techniques from textual, visual, or multimodal content has been studied recently, however, major of such efforts are focused on English language content. In this paper, we propose a shared task on detecting propaganda techniques for Arabic textual content. We have done a pilot annotation of 200 Arabic tweets, which we plan to extend to 2,000 tweets, covering diverse topics. We hope that the shared task will help in building a community for Arabic propaganda detection. The dataset will be made publicly available, which can help in future studies.


Introduction
Social media platforms have become an important communication channel, where we can share and access information from a variety of sources.Unfortunately, the rise of this democratic information ecosystem was accompanied by and dangerously polluted with misinformation, disinformation, and malinformation in the form of propaganda, conspiracies, rumors, hoaxes, fake news, hyper-partisan content, falsehoods, hate speech, cyberbullying, etc. (Oshikawa et al., 2020;Alam et al., 2021;Pramanick et al., 2021;Rosenthal et al., 2021;Alam et al., 2022;Barnabò et al., 2022;Guo et al., 2022;Hardalov et al., 2022;Nguyen et al., 2022;Sharma et al., 2022) Propaganda is conveyed through the use of diverse propaganda techniques (Miller, 1939), which range from leveraging on the emotions of the audience (e.g., using loaded language, appealing to fear, etc.) to using logical fallacies such as straw men (misrepresenting someone's opinion), whataboutism, red herring (presenting irrelevant data), etc.In the last decades, propaganda was widely used on social media to influence and/or mislead the audience, which became a major concern for different stakeholders, social media platforms, and policymakers.To address this problem, the research area of computational propaganda has emerged, and here we are particularly interested in automatically identifying the use of propaganda techniques in text, images, and multimodal content.Prior work in this direction includes identifying propagandistic content in an article based on writing style and readability level (Rashkin et al., 2017;Barrón-Cedeno et al., 2019), at the sentence and the fragment levels from news articles with finegrained techniques (Da San Martino et al., 2019b), and in memes (Dimitrov et al., 2021a).These efforts focused on English, and there was no prior work on Arabic.Our shared task aims to bridge this gap by focusing on detecting propaganda in Arabic social media text, i.e., tweets.

Related Work
In the current information ecosystem, propaganda has evolved to computational propaganda (Woolley and Howard, 2018;Da San Martino et al., 2020b), where information is distributed on social media platforms, which makes it possible for malicious users to reach well-targeted communities at high velocity.Thus, research on propaganda detection has focused on analyzing not only news articles but also social media content (Rashkin et al., 2017;Barrón-Cedeno et al., 2019;Da San Martino et al., 2019b, 2020b;Nakov et al., 2021a,b;Hristakieva et al., 2022).Rashkin et al. (2017) focused on article-level propaganda analysis.They developed the TSHP-17 corpus, which used distant supervision for annotation with four classes: trusted, satire, hoax, and propaganda.The assumption of their distant supervision approach was that all articles from a given news source should share the same label.They collected their articles from the English Gigaword corpus and from seven other unreliable news sources, including two propagandistic ones.Later, Barrón-Cedeno et al. (2019) developed a new corpus, QProp , with two labels: propaganda vs. non-propaganda, and also experimented on TSHP-17 and QProp corpora.For the TSHP-17 corpus, they binarized the labels: propaganda vs. any of the other three categories as non-propaganda.They investigated the writing style and the readability level of the target document, and trained models using logistic regression and SVMs.Their findings confirmed that using distant supervision, in conjunction with rich representations, might encourage the model to predict the source of the article, rather than to discriminate propaganda from non-propaganda.Similarly, Habernal et al. (2017Habernal et al. ( , 2018) developed a corpus with 1.3k arguments annotated with five fallacies, including ad hominem, red herring, and irrelevant authority, which directly relate to propaganda techniques.
Recently, Da San Martino et al. (2019b), curated a set of persuasive techniques, ranging from leveraging on the emotions of the audience such as using loaded language and appeal to fear, to logical fallacies such as straw man (misrepresenting someone's opinion) and red herring (presenting irrelevant data).They focused on textual content, i.e., newspaper articles.In particular, they developed a corpus of news articles annotated with eighteen propaganda techniques.The annotation was at the fragment level, and could be used for two tasks: (i) binary classification -given a sentence in an article, predict whether any of the 18 techniques has been used in it, and (ii) multi-label classification and span detection task -given a raw text, identify both the specific text fragments where a propaganda technique is used as well as the specific technique.They further proposed a multigranular deep neural network that captures signals from the sentence-level task and helps to improve the fragment-level classifier.Da San Martino et al. (2020a) also organized a shared task on Detection of Propaganda Techniques in News Articles.Subsequently, Dimitrov et al. (2021b) organized the SemEval-2021 task 6 on Detection of Propaganda Techniques in Memes.It had a multimodal setup, combining text and images, and asked participants to build systems to identify the propaganda techniques used in a given meme.Yu et al. (2021) looked into interpretable propaganda detection.
The present shared task is inspired from prior work on propaganda detection.In particular, we adapted the annotation instructions and the propaganda techniques discussed in (Da San Martino et al., 2019b;Dimitrov et al., 2021b).

Tasks and Dataset
Below, we first formulate the two subtasks of our shared task, and then we discuss our datasets, including how we collected the data and what annotation guidelines we used.

Tasks
In the shared tasks, we offered the following two subtasks: • Subtask 1: Given the text of a tweet, identify the propaganda techniques used in it.
• Subtask 2: Given the text of a tweet, identify the propaganda techniques used in it together with the span(s) of text in which each propaganda technique appears.
Note that Subtask 1 is formulated as a multilabel classification problem, while Subtask 2 is a sequence labeling task.

Dataset
We used Social Bakers1 to obtain the top-2 news sources from each Arab country, e.g., Al Arabiya and Sky News Arabia from UAE, Al Jazeera and Al Sharq from Qatar, etc.We further added five international sources that broadcast Arabic news: Al-Hurra News, BBC Arabic, CNN Arabic, France 24, and Russia Today.We then extracted from Twitter their latest 3,200 tweets.To have a balanced dataset that covers a wide range of topics, we chose 100 random tweets from each source, and then we sampled 930 tweets for annotation.
We target emotional appeals (e.g., loaded language, appeal to fear, flag waving, exaggeration, etc.) and logical fallacies (e.g., whataboutism, causal oversimplification, red herring, band wagon, etc.).We adopted the same techniques studied in (Da San Martino et al., 2019b;Dimitrov et al., 2021b).Below we briefly summarize them: 1. Appeal to authority: Stating that a claim is true simply because a valid authority or expert on the issue said it was true.We also include here the special case where the reference is not an authority or an expert, which is referred to as Testimonial in the literature.
2. Appeal to fear / prejudices: Seeking to build support for an idea by instilling anxiety and/or panic in the population towards an alternative.
In some cases, the support is built based on preconceived judgements.
3. Bandwagon Attempting to persuade the target audience to join in and take the course of action because "everyone else is taking the same action." 4. Black-and-white fallacy or dictatorship: Presenting two alternative options as the only possibilities, when in fact more possibilities exist.As an the extreme case, tell the audience exactly what actions to take, eliminating any other possible choices (ictatorship).
5. Causal oversimplification: Assuming a single cause or reason when there are actually multiple causes for an issue.This includes transferring blame to one person or group of people without investigating the complexities of the issue.
6. Doubt: Questioning the credibility of someone or something.
7. Exaggeration / minimisation: Either representing something in an excessive manner: making things larger, better, worse (e.g., the best of the best, quality guaranteed) or making something seem less important or smaller than it really is (e.g., saying that an insult was actually just a joke).
8. Flag-waving: Playing on strong national feeling (or to any group, e.g., race, gender, political preference) to justify or to promote an action or an idea.9. Glittering generalities (virtue) These are words or symbols in the value system of the target audience that produce a positive image when attached to a person or issue.Peace, hope, happiness, security, wise leadership, freedom, "The Truth", etc. are virtue words.
Virtue can be also expressed in images, where a person or an object is depicted positively.
10. Loaded language: Using specific words and phrases with strong emotional implications (either positive or negative) to influence an audience.
11. Misrepresentation of someone's position (straw man): Substituting an opponent's proposition with a similar one, which is then refuted in place of the original proposition.
12. Name calling or labeling: Labeling the object of the propaganda campaign as something that the target audience fears, hates, finds undesirable or loves, praises.
13. Obfuscation, intentional vagueness, confusion: Using words that are deliberately not clear, so that the audience may have their own interpretations.For example, when an unclear phrase with multiple possible meanings is used within an argument and, therefore, it does not support the conclusion.

Presenting irrelevant data (red herring):
Introducing irrelevant material to the issue being discussed, so that everyone's attention is diverted away from the points made.
15. Reductio ad hitlerum: Persuading an audience to disapprove an action or an idea by suggesting that the idea is popular with groups hated in contempt by the target audience.It can refer to any person or concept with a negative connotation.
16. Repetition: Repeating the same message over and over again, so that the audience will eventually accept it.17. Slogans: A brief and striking phrase that may include labeling and stereotyping.Slogans tend to act as emotional appeals.
18. Smears A smear is an effort to damage or call into question someone's reputation, by propounding negative propaganda.It can be applied to individuals or groups.
19. Thought-terminating cliché: Words or phrases that discourage critical thought and meaningful discussion about a given topic.They are typically short, generic sentences that offer seemingly simple answers to complex questions or that distract the attention away from other lines of thought.

Whataboutism:
A technique that attempts to discredit an opponent's position by charging them with hypocrisy without directly disproving their argument.The annotation is done in different stages: (i) three annotators independently annotate the same tweet, and (ii) they meet together with one consolidator to discuss each instance and to come up with gold annotations.Since the annotations are at the fragment level, it might happen that an annotation is spotted by only one annotator.The two phases ensure that each annotation is eventually discussed by all annotators.In order to train the annotators, we provide clear annotation instructions with examples and ask them to annotate a sample of tweets.Then, we revise their annotations and provide feedback.Figures 1 and 2 show example tweets with annotated propaganda techniques.
Table 1 shows the distribution of the propaganda techniques in our dataset for different data splits.Our annotation guidelines inclide twenty techniques, but in the annotated dataset, there were no instances of bandwagon, straw man, and reductio ad hitlerum.Overall, the distribution of the propaganda techniques in our dataset is very skewed, which made the task challenging.

Evaluation Measures
To measure the performance of the systems, for both subtasks, we use micro-F1 and macro-F1, as these are multi-class multi-label problems, where the labels are imbalanced.The official evaluation measure for subtask 1 is micro-F1, but the scorer also reports macro-F1.
Subtask 2 is a multi-label sequence tagging problem.We modified the standard micro-averaged F1 to account for partial matching between the spans.More details about the modified macro-averaged F1 can be found in (Da San Martino et al., 2019b;Dimitrov et al., 2021b).

Task Organization
We ran the shared task in two phases: Development Phase In the first phase, we provided the participants three subsets of the dataset: train, dev, and dev_test.The purpose of the dev set was to fine-tune the trained model, and the dev_test set was to evaluate the model performance on unseen dev_test set.
Test Phase In the second phase, we released the actual test set and the participants were given just a few days to submit their final predictions via the submission system on Codalab.2In this phase, the participants could again submit multiple runs, but they would not get any feedback on their performance.Only the latest submission of each team was considered as official and was used for the final team ranking.The final leaderboard on the test set was made publicly available after the system submission deadline.

Participants and Results
In this section, we provide a general description of the systems that participated in each subtask and their results.Table 2 shows the results for all teams for both subtasks, as well as a random baseline.We can see that subtask 1 was more popular, attracting submissions by 14 teams, while there were only three submissions for subtask 2.

Subtask 1
Table 3 gives an overview of the systems that took part in subtask 1.We can see that transformers were quite popular, most notably AraBERT, followed by BERT, and MARBERT.Some participants also used ensembles methods, data augmentation, and standard preprocessing.
The best-performing team NGU_CNLP (Samir et al., 2022) first explored various baselines models such as bag of words with SVM, Naïve Bayes, Stochastic Gradient Descent, Logistic Regression, Table 2: Results for subtask 1 on multilabel propaganda detection and subtask 2 on identifying propaganda techniques and their span(s) in the text.The results are ordered by the official score: Micro-F1.* Indicated that no system description paper was submitted.
The third system was CNLP-NITS-PP (Laskar et al., 2022), and they used the AraBERT Twitterbase model along with data augmentation.Note that all systems outperformed the random baseline.

Subtask 2
In Table 3, we also present an overview of the systems that took part in Subtask 2. Once again, this subtask was dominated by transformer models.We can see in the table that transformers were quite popular, and among them, the most commonly used one was AraBERT, followed by BERT and MAR-BERT.The participants in this task also used data augmentation and standard pre-processing.
Table 2 shows the evaluation results: we report our random baseline, which is based on the random selection of spans with random lengths and a random assignment of labels.(Mittal and Nakov, 2022) ✓ 3. NGU_CNLP (Samir et al., 2022) ✓ The best system for this subtask was Pythoneers (Attieh and Hassan, 2022).They used AraBERT with a Conditional Random Field (CRF) layer, which was trained on encoded data using the BIO schema.
The second-best system was IITD (Mittal and Nakov, 2022), which used a Multi-Granularity Network (Da San Martino et al., 2019b) with the mBERT encoder.
The third system was NGU_CNLP (Samir et al., 2022).They converted the data to BIO format and fine-tuned a token classifier based on Marefa-NER 3 (pretrained using XLM-RoBERTa).

Participants' Systems
NGU_CNLP (Samir et al., 2022)[subtask 1:1, subtask 2:3] team participated in both subtasks.For subtask 1, they used a combination of a data augmentation strategy with a transformer-based model.This model ranked first among the 14 systems that participated in this subtask.Their preliminary experiments for subtask 1 consist of using a bag-of-words model with different classical algorithms such as Support Vector Machines, Naïve Bayes, Stochastic Gradient Descent, Logistic regression, Random Forests, and simple K-nearest Neighbor.For subtask 2, they fine-tuned the Marefa-NER model, which is based on XLM-RoBERTa.The system ranked third among the three systems that participated in this subtask.
3 https://huggingface.co/marefa-nlp/marefa-nerPythoneers (Attieh and Hassan, 2022)[subtask 1:3, subtask 2:1] also participated in both subtasks.For subtask 1, they trained a multi-task learning model that performs binary classification per propaganda technique.For subtask 2, they first converted the data into BIO format and then fine-tuned an AraBERT model with a Conditional Random Field (CRF) layer.Their subtask 1 system ranked third with a micro-averaged F1-Score of 0.602, and their subtask 2 system ranked first with a micro-averaged F1-Score of 0.396.
IITD (Mittal and Nakov, 2022) .This team also participated in both subtasks.They used multilingual pretrained language models for both subtask s.For subtask 1, they used a pretrained XLM-R to estimate a Multinoulli distribution after projecting the CLS embedding to a 20-dimensional embedding (one per propaganda technique).For subtask 2, they used a multigranularity network (Da San Martino et al., 2019b) with mBERT encoder.Even though both systems were trained on only the dataset released in this shared task, they also discussed several methods (zero-shot transfer, continued training, and translation of PTC (Da San Martino et al., 2019b) to Arabic) to study cross-lingual propaganda detection.This suggested interesting research challenges for future exploration, such as how to effectively use data from different domains and how to learn language-agnostic embeddings in propaganda detection systems.
CNLP-NITS-PP (Laskar et al., 2022)[subtask 1:3].This team participated in subtask 1 and they used AraBERT Twitter-base model for multilabel propaganda classification.They further used data augmentation; in particular, they generated synthetic training data using root and stem substitution from the original train samples and prepared additional synthetic examples.They changed the input labels to the model to be one-hot encoded to indicate multiple labels and modified the macro-F1 scorer to give a score for multiple labels.To make predictions with the model, they used a sentiment analysis pipeline from HuggingFace Transformers and selected all the labels that yielded a score greater than or equal to 0.32.They observed the scores for the predictions on the validation test set and found that most correct labels had a score greater than 0.30.They also found that there was a large gap in the score for the label when the score was below 0.30.
AraBEM (Eshrag Ali et al., 2022)[subtask 1:3].This team participated in subtask 1 and they finetuned BERT to perform multi-class binary classification.They used standard pre-processing including normalization (mapping letters with various forms, i.e., alef, hamza, and yaa to their representative characters), and removing special characters, diacritics, and repeated characters.
AraProp (Singh, 2022)[subtask 1:4].This team participated in subtask 1.First, they tokenized the input and produced contextualized word embeddings for all input tokens.To get a fixed-size output representation, they simply averaged all contextualized word embeddings by taking attention mask into account for correct averaging.Then, they added a dropout layer with a dropout rate of 0.3, followed by a linear layer with a sigmoid activation function for the output.They experimented with multiple transformer-based language models: two multilingual models and six monolingual (Arabic) models.Their findings suggest that the MARBERTv2-based fine-tuned model outperforms other models in terms of F1-micro score.
iCompass (Taboubi et al., 2022)[subtask 1:5] team participated in subtask 1.Their system used standard pre-processing such as normalization and removing stopwords, emojis, special characters, and links.Then, they used pre-trained language models such as MARBERT and ARBERT.They further added global average and max pooling layers on top of the models.Finally, they used cross-validation to improve the model performance.
SI2M & AIOX Labs (Gaanoun and Benelallam, 2022)[subtask 1:6] team participated in subtask 1.They used data augmentation, named entity recognition (NER), and manual rules.For data augmentation, they combined the training and the dev sets, and randomly mixed the sequences to create new synthetic sequences, which they concatenated with the train and the dev sets.Their final system uses a mixed dataset of 2,000 examples.Next, they finetuned ARBERT on the augmented dataset, and they made predictions based on a defined threshold of the classifier's confidence.If no technique got a prediction probability greater than the threshold, the token was assigned the label No technique.Moreover, to detect the Name Calling/Labelling technique, they used a NER model based on AraBERT.Finally, to detect Repetition, they used manual rules, after removing the stopwords.
ChavanKane (Chavan and Kane, 2022)[subtask 1:9] team participated in subtask 1 and experimented with AraBERT v1, v02 and v2, MAR-BERT, ARBERT, XLMRoBERTa, and AraELEC-TRA.They used a specific variant of DeHateBERT, which is initialized from multilingual BERT and fine-tuned only on Arabic datasets.They also tried creating an ensemble of all models, which consists of five models such as DeHateBERT, AraBERTv2, AraBERTv02, AraBERTv01, and MARBERT.For the final prediction from the ensembles, they used hard voting.
TUB (Mohtaj and Möller, 2022)[subtask 1:11].This team participated in subtask 1 and used a semantic similarly detection approach based on conceptual word embedding.They converted all sentences in the train, dev, and test sets into vectors using the BERT model.For each sentence in the test set, they detected the five most similar instances from the train and the dev sets, with a cosine similarity above 0.4.Then, they assigned the three most frequent labels among the five instances as the label of the target sentence.

Conclusion and Future Work
We presented the WANLP'2022 shared task on Propaganda Detection in Arabic, as part of which we developed the first dataset for Arabic propaganda detection with focus on social media content.This was a successful task: a total of 63 teams registered to participate, and 14 and 3 teams eventually made an official submission on the test set for subtasks 1 and 2, respectively.Finally, 11 teams submitted a task description paper.Subtask 1 asked to identify the propaganda techniques used in a tweet, and subtask 2 further asked to identify the the span(s) of text in which each propaganda technique appears.For both subtasks, the majority of the systems fine-tuned pre-trained Arabic language models, and used standard pre-processing.Some systems used data augmentation and ensemble methods.
In future work, we plan to increase the data size and to add hierarchically structured propaganda techniques.

Figure 1 :
Figure 1: An example of tweet annotation with propaganda techniques loaded language and name calling.

Figure 2 :
Figure 2: An example of tweet annotation with propaganda techniques loaded language and slogan.

Table 1 :
Statistics about the corpus.In parentheses, we show the number of tweets.Total represents the number of techniques in each set.

Table 3 :
Overview of the approaches used for subtasks 1 and 2, for the teams that submitted a description paper.The systems are ordered by the official score: F1-micro.