iCompass at NLP4IF-2021–Fighting the COVID-19 Infodemic

This paper provides a detailed overview of the system and its outcomes, which were produced as part of the NLP4IF Shared Task on Fighting the COVID-19 Infodemic at NAACL 2021. This task is accomplished using a variety of techniques. We used state-of-the-art contextualized text representation models that were fine-tuned for the downstream task in hand. ARBERT, MARBERT,AraBERT, Arabic ALBERT and BERT-base-arabic were used. According to the results, BERT-base-arabic had the highest 0.784 F1 score on the test set.


Introduction
In recent years, there has been a massive increase in the number of people using social media (such as Facebook and Twitter) to share, post information, and voice their thoughts. The increasing number of users has resulted in the development of an enormous number of posts on Twitter. Although social media networks have enhanced information exchange, they have also created a space for antisocial and illegal activities such as spreading false information, rumors, and abuse. These anti-social behaviors intensify in a massive way during crisis cases, creating a toxic impact on society, either purposely or accidentally. The COVID-19 pandemic is one such situation that has impacted people's lives by locking them down to their houses and causing them to turn to social media. Since the beginning of the pandemic, false information concerning Covid-19 has circulated in a variety of languages, but the spread in Arabic is especially harmful due to a lack of quality reporting. For example, the tweet " 40 #" is translated as follows: "Good evening, good news, 40 seconds, the owner of the initiative to gather scientists to find a treatment against Corona announces on the air that an entire team, including a French doctor named "Raoult", discovered that the malaria treatment is the one that treats the new Corona, and it has been tried on 40 patients". This tweet contains false information that is harmful to the society and people believing it could be faced with real danger. Basically, we are not only fighting the coronavirus, but there is a war against infodemic which makes it crucial to identify this type of false information. For instance, the NLP4IF Task 2 is fighting the COVID-19 Infodemic by predicting several binary properties of a tweet about COVID-19 as follows: whether it is harmful, whether it contains a verifiable claim, whether it may be of interest to the general public, whether it appears to contain false information, whether it needs verification or/and requires attention. This is why we performed a multilabel classification using Arabic pretrained models including ALBERT Arabic (Lan et al., 2019), BERT-base-arabic (Devlin et al., 2018), AraBERT (Antoun et al., 2020), ARBERT (Abdul-Mageed et al., 2020), and MARBERT (Abdul-Mageed et al., 2020) with different hyper-parameters. The paper is structured as follows: Section 2 provides a concise description of the used dataset. Section 3 describes the used systems and the experimental setup to build models for Fighting the COVID-19 Infodemic. Section 4 presents the obtained results. Section 5 presents the official submission results. Finally, section 6 concludes and points to possible directions for future work.

Dataset description
The provided training dataset of the competition, fighting the COVID-19 Infodemic Arabic, consists of 2536 tweets and the development dataset con-sists of 520 tweets (Shaar et al., 2021). The data was labelled as yes/no questions answering seven questions: 1. Verifiable Factual Claim: Does the tweet contain a verifiable factual claim?
2. False Information: To what extent does the tweet appear to contain false information?
3. Interest to General Public: Will the tweet have an effect on or be of interest to the general public?

Harmfulness:
To what extent is the tweet harmful to the society/person(s)/company(s)/product(s)?
5. Need of Verification: Do you think that a professional fact-checker should verify the claim in the tweet?
6. Harmful to Society: Is the tweet harmful for society and why?
7. Require attention: Do you think that this tweet should get the attention of government entities?
Questions 2,3,4 and 5 will be labelled as nan if the answer to the first question is no. The tweets are in Modern Standard Arabic (MSA) and no other Arabic dialect was observed. Data was preprocessed by removing emojis, URLs, punctuation, duplicated characters in a word, diacritics, and any non Arabic words. We present an example sentence before and after preprocessing: • Before preprocessing: : #  -Mageed et al., 2020) and Arabic BERT (Safaya et al., 2020). Added-on, we used the xlarge version Arabic Albert 2 .

AraBERT
AraBERT (Antoun et al., 2020), was trained on 70 million sentences, equivalent to 24 GB of text, covering news in Arabic from different media sources. It achieved state-of-the-art performances on three Arabic tasks including Sentiment Analysis. Yet, the pre-training dataset was mostly in MSA and therefore can't handle dialectal Arabic as much as official Arabic.

ARBERT
ARBERT (Abdul-Mageed et al., 2020) is a largescale pretrained language model using BERT base's architecture and focusing on MSA. It was trained on 61 GB of text gathered from books, news articles, crawled data and the Arabic Wikipedia. The vocabulary size was equal to 100k WordPieces which is the largest compared to AraBERT (60k for Arabic out of 64k) and mBERT (5k for Arabic out of 110k).

MARBERT
MARBERT, also by (Abdul-Mageed et al., 2020), is a large-scale pretrained language model using BERT base's architecture and focusing on the various Arabic dialects. It was trained on 128 GB of Arabic Tweets. The authors chose to keep the Tweets that have at least three Arabic words. Therefore, Tweets that have three or more Arabic words and some other non-Arabic words are kept. This is because dialects are often times mixed with other foreign languages. Hence, the vocabulary size is equal to 100k WordPieces. MARBERT enhances the language variety as it focuses on representing the previously underrepresented dialects and Arabic variants.

Arabic ALBERT 2 by (KUIS-AI-Lab) models were pretrained on 4.4 Billion words: Arabic version of OSCAR (unshuffled version of the corpus) filtered from Common Crawl and Recent dump of Arabic
Wikipedia. Also, the corpus and vocabulary set are not restricted to MSA, but contain some dialectical Arabic too.

Arabic BERT
Arabic BERT (Safaya et al., 2020) is a set of BERT language models that consists of four models of different sizes trained using masked language modeling with whole word masking (Devlin et al., 2018). Using a corpus that consists of the unshuffled version of OSCAR data (Ortiz Suárez et al., 2020) and a recent data dump from Wikipedia, which sums up to 8.2B words, a vocabulary set of 32,000 Wordpieces was constructed. The final version of corpus contains some non-Arabic words inlines. The corpus and the vocabulary set are not restricted to MSA, they contain some dialectical (spoken) Arabic too, which boosted models performance in terms of data from social media platforms.

Fine-tuning
We use these pretrained language models and build upon them to obtain our final models. Other than outperforming previous techniques, huge amounts of unlabelled text have been used to train general purpose models. Fine-tuning them on much smaller annotated datasets achieves good results thanks to the knowledge gained during the pretraining phase, which is expensive especially in terms of computational power. Hence, given our relatively small dataset, we chose to fine-tune these pretrained models. The fine-tuning actually consists of adding an untrained layer of neurons on top of the pretrained model and only tweaking the weights of the last layers to adjust them to the new labelled dataset. We chose to train our models on a Google Cloud GPU using Google Colaboratory. The average training time of one model is around 10 minutes. We experimented with Arabic ALBERT, Arabic BERT, AraBERT, ARBERT and MARBERT with different hyperparameters. The final model that we used to make the submission is a model based on BERT-base-arabic, trained for 10 epochs with a learning rate of 5e-5, a batch size of 32 and max sequence length of 128.

Development dataset results
We have validated our models through the development dataset as mentioned in the data section. The results of all models were close but the BERT-basearabic achieved the best results performing 78.27% F1 score. For reference, and to compare with other models, we also showcase the results obtained with ARBERT, AraBERT, and Arabic ALBERT in Table  1.
• The best ALBERT Arabic model was achieved using 2e-5 learning rate, 16 batch size, 8 epochs, 128 max length. The result of all the models used are very close. However, bert-base-arabic outperformed all other models. This may be due to the pretrained data for bert-base-arabic. The final version has some non-Arabic words inlines. Also, the corpus of bertbase-arabic and vocabulary set are not restricted to MSA, they contain some dialectical Arabic too which can boost the model performance in terms of data from social media. Table 2 reviews the official results of iCompass system against the top three ranked systems.   Table 3: Official Results for each classifier as reported by the task organisers (Shaar et al., 2021).

Official submission results
that BERT-base-arabic outperforms all of the previously listed models in terms of overall performance, and was chosen for the final submission. Future work will include developing larger contextualized pretrained models and improving the current COVID-19 Infodemic Detection .