R00 at NLP4IF-2021 Fighting COVID-19 Infodemic with Transformers and More Transformers

This paper describes the winning model in the Arabic NLP4IF shared task for fighting the COVID-19 infodemic. The goal of the shared task is to check disinformation about COVID-19 in Arabic tweets. Our proposed model has been ranked 1st with an F1-Score of 0.780 and an Accuracy score of 0.762. A variety of transformer-based pre-trained language models have been experimented with through this study. The best-scored model is an ensemble of AraBERT-Base, Asafya-BERT, and ARBERT models. One of the study’s key findings is showing the effect the pre-processing can have on every model’s score. In addition to describing the winning model, the current study shows the error analysis.


Introduction
Social media platforms are highly used for expressing and delivering ideas. Most people on social media platforms tend to spread and share posts without fact-checking the story or the source. Consequently, the propaganda is posted to promote a particular ideology to create further confusion in understanding an event. Of course, it does not apply to all posts. However, there is a line between propaganda and factual news, blurred for people engaged in these platforms (Abedalla et al., 2019). And thus, social media can act as a distortion for critical and severe events. The COVID-19 pandemic is one such event.
Several previous works were published for using language models and machine learning techniques for detecting misinformation. Authors in Haouari et al. (2020b) presented a twitter data set for COVID-19 misinformation detection called "ArCOV19-Rumors". It is an extension of the "ArCOV-19" (Haouari et al., 2020a), which is a data set of Twitter posts with "Propagation Networks". Propagation networks refer to a post's retweets and conversational threads. Other authors in Shahi et al. (2021) performed an exploratory study of COVID-19 misinformation on Twitter. They collected data from Twitter and identified misinformation, rumors on Twitter, and misinformation propagation. Authors in Müller et al. (2020) presented CT-BERT, a transformer-based model pre-trained on English Twitter data. Other works that used Deep Learning models to detect propaganda in news articles (Al-Omari et al., 2019;Altiti et al., 2020).
The NLP4IF (Shaar et al., 2021) shared-task offers an annotated data set of tweets to check disinformation about COVID-19 in each tweet. The task asked the participants to propose models that can predict the disinformation in these tweets. This paper describes the winning model in the shared task, an ensemble of AraBERT-Base, Asafya-BERT, and ARBERT pre-trained language models. The team R00's model outperformed the other teams and baseline models with an F1-Score of 0.780 and an Accuracy score of 0.762. This paper describes the Dataset and the shared task in section 2. The Data Preprocessing step is presented in section 3. The experiments with the pre-trained language models are provided in section 4. Finally, the proposed winning model and methodology are discussed in section 5.

Dataset
The Data provided by the organizers Shaar et al., 2021 comprised of tweets, which are posts from the Twitter social media platform "twitter.com". The posts are related to the COVID-19 pandemic and have been annotated in a "Yes or No" question style annotation. The annotator was asked to read the post/tweet and go to an affiliated weblink (if the tweet contains one). For each tweet, the seven main questions that were asked are: 2. False Information: To what extent does the tweet appear to contain false information?
3. Interest to General Public:Will the tweet affect or be of interest to the general public?

Harmfulness:
To what extent is the tweet harmful to the society/person(s)/company(s)/product(s)?
5. Need of Verification: Do you think that a professional fact-checker should verify the claim in the tweet?
6. Harmful to Society: Is the tweet harmful the society and why?
7. Require attention: Do you think that this tweet should get the attention of government entities?
For each question, the answer can be "Yes" or "No". However the questions two through five depend on the first question. If the first question (Verifiable Factual Claim) is answered "No", questions two through five will be labeled as "NaN". "NaN" is interpreted as there's no need to ask the question. For example, for the following tweet: "maybe if i develop feelings for covid-19 it will leave".
This tweet is not a verifiable factual claim. Therefore asking whether it's False Information or is in Need of Verification is unnecessary. Moreover, our model modified the values to be " No" for all text samples with labels annotated as "NaN".
Task Our team participated in the Arabic text shared task. The Arabic data set consists of 2,536 tweets for the training data, 520 tweets for the development (validation) data, and 1,000 tweets for the test data. It has been observed that the label distribution in the training data is unbalanced, as shown in Figure 1.

Data Pre-Processing
Social media posts can contain noisy features, particularly the special characters (#, @, emojis, weblinks, etc..). Many elements within Arabic text can act as distortions for the model. We Tokenize the Arabic text 1 , and for each sequence of tokens, we remove stop-words, numbers, and punctuation from the text. We also remove any non-Arabic terms in the text. Stemming and Segmentation are two common pre-processing operations done in Arabic Natural Language Processing. However, we do not apply them here, except in the case of AraBERT, where segmentation was applied.

Fine-tuning Pre-Trained Language Models
We approach the problem as a multi-label classification problem. For each label in a text sample, the label's value can be one (yes) or zero (no). In the training phase, we load the pre-trained language model (along with its corresponding tokenizer) and stack a linear classifier on top of the model. This section describes the pre-trained Arabic language models that have been used in the study. The hyperparameters' fine-tuning is also detailed in this section in addition to the experiments' results.

Pre-trained Arabic Language Models
This section goes over the pre-trained language models experimented with through the study: AraBERT, Asafaya-BERT, ARBERT, and MAR-BERT.
• AraBERT (Antoun et al.) follows the original BERT pre-training (Devlin et al., 2018), employing the Masked Language Modelling task. It was pre-trained on roughly 70million sentences amounting to 24GB of text data. There are four variations of the model: The difference is that the v2 variants were trained on the pre-segmented text where prefixes and suffixes were split, whereas the v0.2 were not. The models we used are the v0.2 variants. the Authors recommended using the Arabert-Preprocessor powered by the farasapy 2 python package for the v2 versions. Although the v0.2 models don't require it, we 1 Preprocessing was done using the NLTK Library 2 farasapy have found that the Arabert-Preprocessor improves the performance significantly for some experiments. So, we have used it with all the AraBERT models only.
• Asafaya-BERT (Safaya et al., 2020) is a model also based on the BERT architecture. This model was pre-trained on 8.2B words, with a vocabulary of 32,000 word-pieces. The corpus the model was pre-trained on was not restricted to Modern Standard Arabic, as they contain some dialectal Arabic, and as such Safaya et al. (2020)

Fine-Tuning
Each model has been trained for 20 epochs. We found that after the 10 th epoch, most of the model scores start to plateau. This is, of course, highly dependent on the learning rate used for each model. We have not tuned the models' learning rates, and rather we chose the learning rate we found best after doing multiple experiments with each model. We use a Training Batch-Size of 32 and a Validation Batch-Size of 16 for all the models. For each model's tokenizer we choose a Max Sequencelength of 100. Each model has been trained on two versions of the data set, one that has not been pre-processed (We refer to it as "Raw") and one that has been preprocessed (we refer to it as "Cleaned"). A model that has been trained on cleaned data in training time will also receive cleaned text at validation and testing time. We apply the post-processing step, where for the labels Question-2, 3, 4, and Question-5, if a model predicts that Question-1 is "No" then the values of the mentioned Questions (Q2 through Q5) will be "NaN" Unconditionally. This, of course, assumes that the model can perform well on the first question. We report the results in Table 1.
Note: It is worth noting that, initially, we save the model on the first epoch along with its score as the "best-score". After each epoch, we compare the score of the model on that epoch with the best score. If the model's current score is higher than the best score, the model will be saved, and the model's best score will be overwritten as the current model's score. And as such, saying we train a model for 20 epochs is not an accurate description of the model's training. The score we used as criteria for saving was the Weighted F1-Score.

Results
We see (in Table 1) that generally, training on cleaned data either gave slightly better scores or no significant improvement, with ARBERT 4.1 being the exception. This is because ARBERT was specifically trained on Arabic text that followed the Modern Standard Arabic. Cleaning has normalized text for the model and removed features in the text that may otherwise act as noise. Furthermore, we conclude that Asafya-BERT 4.1 has a better performance when trained on Raw data, proving that a model pre-trained on Twitter data would perform better. Lastly, we observe that using a larger model (deeper network) does provide a slight improvement over using the Base version. 3

Ensemble Pre-trained language Models
To maximize the scores, we resort to ensembling some of the models we fine-tuned on the data set. Ensemble models are known to improve accuracy   under the right conditions. If two models can detect different data patterns, then ensembling these two models would perhaps (in theory) give a better prediction. Of course, the process of finding a good ensemble is an empirical one. It involves a process of trial-and-error of combining different models and choosing the best one. However, as we show in Table 1 various combinations can be done, and as a result, trying all combinations would perhaps be impractical. We mention in Section-2 that the label distribution in the data set is unbalanced, and hence for labels like Question-2 (False Information), the model can give poor predictions for the answer to that label. However, suppose we were to acquire a model (through experimentation) that tends to perform well in predicting that label. In that case, we could ensemble this model with one that generally performs well to get a better overall score.
Strategy Through experimentation and for each label, train a model that performs well on that label and save it for an ensemble. Then, train a model that generally performs well on all labels (relative to the models at hand) and save it for an ensemble. After collecting several models, ensemble these models through various combinations. And for each ensemble, record the combination and its score (performance on validation data). Choose the best performing ensemble.
Weighted-Average Our approach for an ensemble is to take the weighted average of each's model predictions for each sample. Each model produces a vector of probabilities (whose length is equal to the number of labels) for each tweet. We take the weighted average point-wise and then apply a 0.5-threshold to decide if a label is one (yes) or zero (no). We suggest using the weighted average rather than a normal average with equal weights to give higher confidence to the generally performing model as opposed to the less generally performing one. The intuition is that you would want the model to be the deciding factor in predicting better overall performance. The models with the lesser weights are merely there to increase the models' confidence in predicting some labels. The optimal weights for an ensemble are obtainable through experimentation. As a hyperparameter, they can be tuned.
Proposed Model We ensemble five models as shown in Figure 2, all of them were trained on cleaned data. And so, the models were tested on cleaned data. The models are: 1. Model (2): AraBERT-Base, with a weight of 3. (4): Asafya-BERT-Large, with a weight of 3 3. Model (10): Asafya-BERT-Base, with a weight of 1. (12): AraBERT-Large, with a weight of 1. (8): ARBERT, with a weight of 3.

Model
Our model achieved an F1-Weighted Score of 0.749, an F1-Micro Score of 0.763, and an Accuracy of 0.405 on validation data. It also earned an F1-Weighted Score of 0.781 and an Accuracy of 0.763 on the Test data. These results made the model ranked the first mode since it is the topperforming model in the shared task. Figure 3 presents the confusion matrix for the Ensemblemodel predictions on the labels.

Conclusion
This paper described the winning model in the NLP4IF 2021 shared task. The task aimed to check disinformation about COVID-19 in Arabic tweets. We have ensembled five pre-trained language models to obtain the highest F1-score of 0.780 and an Accuracy score of 0.762. We have shown the performances of every pre-trained language model on the data set. We also have shown some of the models' performances on each label. Moreover, we have demonstrated the confusion matrix for the ensemble model. We have illustrated that a pre-trained model on Twitter data (Asafya-Bert in Section 4.1) will perform better relative to a model that hasn't.