DamascusTeam at NLP4IF2021: Fighting the Arabic COVID-19 Infodemic on Twitter Using AraBERT

The objective of this work was the introduction of an effective approach based on the AraBERT language model for fighting Tweets COVID-19 Infodemic. It was arranged in the form of a two-step pipeline, where the first step involved a series of pre-processing procedures to transform Twitter jargon, including emojis and emoticons, into plain text, and the second step exploited a version of AraBERT, which was pre-trained on plain text, to fine-tune and classify the tweets with respect to their Label. The use of language models pre-trained on plain texts rather than on tweets was motivated by the necessity to address two critical issues shown by the scientific literature, namely (1) pre-trained language models are widely available in many languages, avoiding the time-consuming and resource-intensive model training directly on tweets from scratch, allowing to focus only on their fine-tuning; (2) available plain text corpora are larger than tweet-only ones, allowing for better performance.


Introduction
In the past few years, various social media platforms such as Twitter, Facebook, Instagram, etc. have become very popular since they facilitate the easy acquisition of information and provide a quick platform for information sharing (Vicario et al., 2016;Kumar et al., 2018). The work presented in this paper primarily focuses on Twitter. Twitter is a micro-blogging web service with over 330 million Active Twitter Users per month, and has gained popularity as a major news source and information dissemination agent over the last years. Twitter provides the ground information and helps in reaching out to people in need, thus it plays an important role in aiding crisis management teams as the researchers have shown (Ntalla et al., 2015). The availability of unauthentic data on social media platforms has gained massive attention among researchers and become a hot-spot for sharing misinformation (Gorrell et al., 2019;Vosoughi et al., 2017). Infodemic misinformation has been an important issue due to its tremendous negative impact (Gorrell et al., 2019;Vosoughi et al., 2017;Zhou et al., 2018), it has increased attention among researchers, journalists, politicians and the general public. In the context of writing style, misinformation is written or published with the intent to mislead the people and to damage the image of an agency, entity, person, either for financial or political benefits (Zhou et al., 2018;Ghosh et al., 2018;Ruchansky et al., 2017;Shu et al., 2020). This paper is organized as follows: Section 2 describes the related work in this domain; Section 3 gives our methodology in detail; Section 4 discusses the evaluation of our proposed solution and finally, the last section gives the conclusion and describes future works.

Related Works
There are various techniques used to solve the problem of infodemic misinformation on Online Social Media, especially in English content. This section briefly summarizes the work in this field. Allcott et al. (2017) have focused on a quantitative report to understand the impact of misinformation on social media in the 2016 U.S. Presidential General Election and its effect upon U.S. voters.

Ahmad Hussein 1 , Nada Ghneim 2 , and Ammar Joukhadar 1
Authors have investigated the authentic and unauthentic URLs related to misinformation from the BuzzFeed dataset. Shu et al. (2019) have investigated a way for robotization process through hashtag recurrence. Authors have also presented a comprehensive review of detecting misinformation on social media, false news classifications on psychology and social concepts, and existing algorithms from a data mining perspective. Ghosh et al. (2018) have investigated the impact of webbased social networking on political decisions. Quantity research (Zhou et al., 2018;Allcott et al., 2017;Zubiaga et al., 2018) has been done in the context of detecting political-news-based articles. Authors have investigated the effect of various political gatherings related to the discussion of any misinformation as agenda. Authors have also explored the Twitter-based data of six Venezuelan government officials with a specific end goal to investigate bot collaboration. Their discoveries recommend that political bots in Venezuela tend to imitate individuals from political gatherings or basic natives. In one of the studies, Zhou et al. (2018) have investigated the ability of social media to aggregate the judgments of a large community of users. In their further investigation, they have explained machine learning approaches with the end goal to develop a better rumors detection. They have investigated the difficulties for the spread of rumors, rumors classification, and deception for the advancement of such frameworks. They have also investigated the utilization of such useful strategies towards creating fascinating structures that can help individuals in settling on choices towards evaluating the integrity of data gathered from various social media platforms. In one of the studies, Jwa et al. (2019) have explored the approach towards automatic misinformation detection. They have used Bidirectional Encoder Representations from Transformers model (BERT) model to detect misinformation by analyzing the relationship between the headline and the body text of the news story. Their results improve the 0.14 Fscore over existing state-of-the-art models. Williams et al. (2020) utilized BERT and RoBERTa models to identify claims in social media text a professional fact-checker should review. For the English language, they fine-tuned a RoBERTa model and added an extra mean pooling layer and a dropout layer to enhance generalizability to unseen text. For the Arabic language, they fine-tuned Arabic-language BERT models and demonstrate the use of back-translation to amplify the minority class and balance the dataset. Hussein et al. (2020) presented their approach to analyze the worthiness of Arabic information on Twitter. To train the classification model, they annotated for worthiness a dataset of 5000 Arabic tweets -corresponding to 4 high impact news events of 2020 around the world, in addition to a dataset of 1500 tweets provided by CLEF 2020. They proposed two models to classify the worthiness of Arabic tweets: BI-LSTM model, and a CNN-LSTM model. Results show that BI-LSTM model can extract better the worthiness of tweets.

Methodology
In this section, we will present our methodology by explaining the different steps of building the models, we use the same architecture for building them: Data Set, Data Preprocessing, AraBERT System Architecture, and Model Training.

Data Set
We used a dataset of 2556 tweets provided by NLP4IF 2021 (Shaar et al., 2021), which includes tweets about COVID-19. The dataset includes besides the tweet text and the tweet Id. Each tweet annotates with binary properties about COVID-19: whether it contains a verifiable claim (Q1), whether it appears to contain false information (Q2), whether it may be of interest to the general public (Q3), whether it is harmful (Q4), whether it needs to verification (Q5), whether it is harmful to society (Q6) and whether it requires attention of government entities (Q7). Each question has a Yes/No (binary) annotation. However, the answers to Q2, Q3, Q4 and Q5 are all "nan" if the answer to Q1 is No. Table 1 shows the statistics of the class labels for each property in the dataset.

Data Preprocessing
Tweets have certain special features, i.e., emojis, emoticons, hashtags and user mentions, coupled with typical web constructs, such as email addresses and URLs, and other noisy sources, such as phone numbers, percentages, money amounts, time, date, and generic numbers. In this work, a set of pre-processing procedures, which has been tailored to translate tweets into a more conventional form sentences, is adopted. Most of the noisy entities are normalized because their particular instances generally do not contribute to the identification of the class within a sentence. Regarding date, email addresses, money amounts, numbers, percentages, phone numbers and time, this process is performed by using the ekphrasis tool 1 (Baziotis et al., 2017), which enables to individuate regular expressions and replace them with normalized forms.

AraBERT System Architecture
Among modern language modeling architectures, AraBERT (Antoun et al., 2020) is one of the most popular for Arabic language. Its generalization capability is such that it can be adapted to different down-stream tasks according to different needs, be it NER or relation extraction, question answering or sentiment analysis. The core of the architecture is trained on particularly large text corpora and, 1 https://github.com/cbaziotis/ekphras is consequently, the parameters of the most internal layers of the architecture are frozen. The outermost layers are instead those that adapt to the task and on which the so-called fine-tuning is performed. An overview is shown in Figure 1.
Going into details, one can distinguish two main architectures of AraBERT, the base and the large. The architectures differ mainly in four fundamental aspects: the number of hidden layers in the transformer encoder, also known as transformer blocks (12 vs. 24), the number of attention heads, also known as self-attention (Vaswani et al., 2017) (12 vs. 16), the hidden size of the feed-forward networks (768 vs. 1024) and finally the maximum sequence length parameter (512 vs. 1024), i.e., the maximum accepted input vector size. In this work, the base architecture is used, and the corresponding hyper-parameters are reported in Table 2.
In addition, the AraBERT architecture employs two special tokens: [SEP] for segment separation and [CLS] for classification, used as the first input token for any classifier, representing the whole sequence and from which an output vector of the same size as the hidden size H is derived. Hence, the output of the transformers, i.e., the final hidden state of this first token used as input, can be denoted as a vector ∈ . The vector C is used as input of the final fully-connected classification layer. Given the parameter matrix ∈ of the classification layer, where K is the number of

Model Training
The whole classification model has been trained in two steps, involving firstly the pre-training of the AraBERT language model and then the fine-tuning of the outermost classification layer. The AraBERTv0.2-base (Antoun et al., 2020) is pretrained on five corpora: OSCAR unshuffled and filtered, Arabic Wikipedia dump, the 1.5B words, Arabic corpus, the OSIAN corpus and Assafir news articles with a final corpus size equal to about 77 GB. The cased version was chosen, being more suitable for the proposed pre-processing method.
The fine-tuning of the model was performed by using labeled tweets comprising the training set provided for the shared task. In particular, the fully connected classification layer was learned accordingly. During training, the loss function used was categorical cross-entropy. For this study, the hyper-parameters used are shown in Table 1. The maximum sequence length was reduced to 128, due to the short length of the tweets.

Evaluation and Results
To validate the results, we used the NLP4IF tweets dataset. The training and testing sets contain 90% and 10% of total samples, respectively. We split the training data set into 90% for training and 10% for validation.
In this section, we will introduce the different evaluation experiments of our implemented model on the test data. In Table 3, we present the accuracy, precision, recall, F1-score of each evaluation experiment on the test dataset.
Results show that our model can detect if the tweet is "harmfull to society" or "requires attention of government entities" with high accuracy (90% and 92% respectively), if the tweet "may be of interest to the general public" or "contains false information" with a very good accuracy (84% and 86% respectively), and if the tweet is "Harmfull", "needs verification", or "Verifiable" with fairly good accuracy (76%, 75%, and 74% respectively).
In Table 4, we represent the evaluation results of our implementation models, which was conducted by the organizers based on our submitted predicted labels for the blind test set.

Conclusions
The objective of this work was the introduction of an effective approach based on the AraBERT language model for fighting Tweets COVID-19 Infodemic. It was arranged in the form of a twostep pipeline, where the first step involved a series of pre-processing procedures to transform Twitter jargon, including emojis and emoticons, into plain text, and the second step exploited a version of AraBERT, which was pre-trained on plain text, to fine-tune and classify the tweets with respect to their Label. Future work will be directed to investigate the specific contributions of each pre-processing procedure, as well as other settings associated with the tuning, so as to further characterize the language model for the purposes of COVID-19 Infodemic. Finally, the proposed approach will also be tested and assessed with respect to other datasets, languages and social media sources, such as Facebook posts, in order to further estimate its applicability and generalizability.