Transformers to Fight the COVID-19 Infodemic

The massive spread of false information on social media has become a global risk especially in a global pandemic situation like COVID-19. False information detection has thus become a surging research topic in recent months. NLP4IF-2021 shared task on fighting the COVID-19 infodemic has been organised to strengthen the research in false information detection where the participants are asked to predict seven different binary labels regarding false information in a tweet. The shared task has been organised in three languages; Arabic, Bulgarian and English. In this paper, we present our approach to tackle the task objective using transformers. Overall, our approach achieves a 0.707 mean F1 score in Arabic, 0.578 mean F1 score in Bulgarian and 0.864 mean F1 score in English ranking 4^{th} place in all the languages.


Introduction
By April 2021, coronavirus(COVID-19) pandemic has affected 219 nations around the world with 136 million total cases and 2.94 million deaths. With this pandemic situation, a rapid increase in social media usage was noticed. In measures, during 2020, 490 million new users joined indicating a more than 13% year-on-year growth (Kemp, 2021). This growth is mainly resulted due to the impacts on day-to-day activities and information sharing and gathering requirements related to the pandemic.
As a drawback of these exponential growths, the dark side of social media is further revealed during this COVID-19 infodemic (Mourad et al., 2020). The spreading of false and harmful information resulted in panic and confusions which make the pandemic situation worse. Also, the inclusion of false information reduced the usability of a huge volume of data which is generated via social media platforms with the capability of fast propagation. To handle these issues and utilise social media data effectively, accurate identification of false information is crucial. Considering the high data generation in social media, manual approaches to filter false information require significant human efforts. Therefore an automated technique to tackle this problem will be invaluable to the community.
Targeting the infodemic that occurred with COVID-19, NLP4IF-2021 shared task was designed to predict several properties of a tweet including harmfulness, falseness, verifiability, interest to the general public and required attention. The participants of this task were required to predict the binary aspect of the given properties for the test sets in three languages: Arabic, Bulgarian and English provided by the organisers. Our team used recently released transformer models with the text classification architecture to make the predictions and achieved the 4 th place in all the languages while maintaining the simplicity and universality of the method. In this paper, we mainly present our approach, with more details about the architecture including an experimental study. We also provide our code to the community which will be freely available to everyone interested in working in this area using the same methodology 1 .

Related Work
Identifying false information in social media has been a major research topic in recent years. False information detection methods can be mainly categorised into two main areas; Content-based methods and Social Context-based methods (Guo et al., 2020).
Content-based methods are mainly based on the different features in the content of the tweet. For example, Castillo et al. (2011) find that highly credible tweets have more URLs, and the textual content length is usually longer than that of lower credibility tweets. Many studies utilize the lexical and syntactic features to detect false information. For instance, Qazvinian et al. (2011) find that the part of speech (POS) is a distinguishable feature for false information detection. Kwon et al. (2013) find that some types of sentiments are apparent features of machine learning classifiers, including positive sentiments words (e.g., love, nice, sweet), negating words (e.g., no, not, never), cognitive action words (e.g., cause, know), and inferring action words (e.g., maybe, perhaps). Then they propose a periodic time-series model to identify key linguistic differences between true tweets and fake tweets. With the word embeddings and deep learning getting popular in natural language processing, most of the fake information detection methods were based on embeddings of the content fed into a deep learning network to perform the classification (Ma et al., 2016).
Traditional content-based methods analyse the credibility of the single microblog or claim in isolation, ignoring the high correlation between different tweets and events. However, Social Contextbased methods take different tweets in a user profile or an event to identify false information. Many studies detect false information by analyzing users' credibility  or stances (Mohammad et al., 2017). Since this shared task is mainly focused on the content of the tweet to detect false information, we can identify our method as a contentbased false information identification approach.

Data
The task is about predicting several binary properties of a tweet on COVID-19: whether it is harmful, whether it contains a verifiable claim, whether it may be of interest to the general public, whether it appears to contain false information, etc. (Shaar et al., 2021). The data has been released for three languages; English, Arabic and Bulgarian 2 . Following are the binary properties that the participants should predict for a tweet.  (Devlin et al., 2019) provide pretrained multilingual language models that support more than 100 languages which will solve the multilingual issues of these tasks Zampieri, 2021b, 2020). Transformer models take an input of a sequence and outputs the representations of the sequence. There can be one or two segments in a sequence which are separated by a special token [SEP] (Devlin et al., 2019). In this approach we considered a tweet as a sequence and no [SEP] token is used. Another special token [CLS] is used as the first token of the sequence which contains a special classification embedding. For text classification tasks, transformer models take the final hidden state h of the [CLS] token as the representation of the whole sequence (Sun et al., 2019). A simple softmax classifier is added to the top of the transformer model to predict the probability of a class c as shown in Equation 1 where W is the task-specific parameter matrix. In the classification task all the parameters from transformer as well as W are fine tuned jointly by maximising the log-probability of the correct label. The architecture of transformer-based sequence classifier is shown in Figure 1.

Experimental Setup
We considered the whole task as seven different classification problems. We trained a transformer model for each label mentioned in Section 3. This gave us the flexibility to fine-tune the classification model in to the specific label rather than the whole task. Given the very unbalanced nature of the dataset, the transformer models tend to overfit and predict only the majority class. Therefore, for each label we took the number of instances in the training set for the minority class and undersampled the majority class to have the same number of instances as the minority class. We then divided this undersampled dataset into a training set and a validation set using 0.8:0.2 split. We mainly fine tuned the learning rate and number of epochs of the classification model manually to obtain the best results for the development set provided by organisers in each language. We obtained 1e −5 as the best value for learning rate and 3 as the best value for number of epochs for all the languages in all the labels. The other configurations of the transformer model were set to a constant value over all the languages in order to ensure consistency between the languages. We used a batch-size of eight, Adam optimiser (Kingma and Ba, 2014) and a linear learning rate warm-up over 10% of the training data. The models were trained using only training data. We performed early stopping if the evaluation loss did not improve over ten evaluation rounds. A summary of hyperparameters and their values used to obtain the reported results are mentioned in Appendix - Table 3. The optimized hyperparameters are marked with ‡ and their optimal values are reported. The rest of the hyperparameter values are kept as constants. We did not use any language specific preprocessing techniques in order to have a flexible solution between the languages. We used a Nvidia Tesla K80 GPU to train the models. All the experiments were run for five different random seeds and as the final result, we took the majority class predicted by these different random seeds as mention in Hettiarachchi and Ranasinghe (2020b). We used the following pretrained transformer models for the experiments.
bert-base-cased -Introduced in Devlin et al. (2019), the model has been trained on a Wikipedia dump of English using Masked Language Modelling (MLM) objective. There are two variants in English BERT, base model and the large model. Considering the fact that we built seven different models for each label, we decided to use the base model considering the resources and time.
roberta-base -Introduced in Liu et al. (2019), RoBERTa builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger minibatches and learning rates. RoBERTa has outperformed BERT in many NLP tasks and it motivated us to use RoBERTa in this research too. Again we only considered the base model.
bert-nultilingual-cased -Introduced in Devlin et al. (2019), the model has been trained on a Wikipedia dump of 104 languages using MLM objective. This model has shown good performance in variety of languages and tasks. Therefore, we used this model in Arabic and Bulgarian.
AraBERT Recently language-specific BERT based models have proven to be very efficient at language understanding. AraBERT (Antoun et al., 2020) is such a model built for Arabic with BERT using scraped Arabic news websites and two publicly available Arabic corpora; 1.5 billion words Arabic Corpus (El-khair, 2016) and OSIAN: the Open Source International Arabic News Corpus (Zeroual et al., 2019). Since AraBERT has outperformed multilingual bert in many NLP tasks in Arabic (Antoun et al., 2020) we used this model for Arabic in this task. There are two version in AraBERT; AraBERTv0.1 and AraBERTv1, with the difference being that AraBERTv1 uses presegmented text where prefixes and suffixes were  Table 2: Macro F1 between the InfoMiner submission and human annotations for test set in all the languages. Best System is the results of the best model submitted for each language as reported by the task organisers (Shaar et al., 2021). splitted using the Farasa Segmenter (Abdelali et al., 2016).

Results
When it comes to selecting the best model for each language, highest F1 score out of the evaluated models was chosen. Due to the fact that our approach uses a single model for each label, our main goal was to achieve good F1 scores using light weight models. The limitation of available resources to train several models for all seven labels itself was a very challenging task to the team but we managed to evaluate several.
As depicted in Table 1, for English, bert-basecased model performed better than roberta-base model. For Arabic, arabert-v2-tokenized performed better than the other two models we considered. For Bulgarian, with the limited time, we could only train bert-multilingual model, therefore, we submitted the predictions from that for Bulgarian.
As shown in Table 2, our submission is very competitive with the best system submitted in each language and well above the random baseline. Our team was ranked 4 th in all the languages.

Conclusion
We have presented the system by InfoMiner team for NLP4IF-2021-Fighting the COVID-19 Infodemic. We have shown that multiple transformer models trained on different labels can be successfully applied to this task. Furthermore, we have shown that undersampling can be used to prevent the overfitting of the transformer models to the majority class in an unbalanced dataset like this. Overall, our approach is simple but can be considered as effective since it achieved 4 th place in the leader-board for all three languages.
One limitation in our approach is that it requires maintaining seven transformer models for the seven binary properties of this task which can be costly in a practical scenario which also restricted us from experimenting with different transformer types due to the limited time and resources. Therefore, in future work, we are interested in remodeling the task as a multilabel classification problem, where a single transformer model can be used to predict all seven labels.