Fighting the COVID-19 Infodemic with a Holistic BERT Ensemble

This paper describes the TOKOFOU system, an ensemble model for misinformation detection tasks based on six different transformer-based pre-trained encoders, implemented in the context of the COVID-19 Infodemic Shared Task for English. We fine tune each model on each of the task’s questions and aggregate their prediction scores using a majority voting approach. TOKOFOU obtains an overall F1 score of 89.7%, ranking first.


Introduction
Social media platforms, e.g., Twitter, Instagram, Facebook, TikTok among others, are playing a major role in facilitating communication among individuals and sharing of information. Social media, and in particular Twitter, are also actively used by governments and health organizations to quickly and effectively communicate key information to the public in case of disasters, political unrest, and outbreaks (Househ, 2016;Stefanidis et al., 2017;LaLone et al., 2017;Daughton and Paul, 2019;Rogers et al., 2019).
However, there are dark sides to the use of social media. The removal of forms of gate-keeping and the democratization process of the production of information have impacted the quality of the content that becomes available. Misinformation, i.e., the spread of false, inaccurate, misleading information such as rumors, hoaxes, false statements, is a particularly dangerous type of low quality content that affects social media platforms. The dangers of misinformation are best illustrated by considering the combination of three strictly interconnected factors: (i) the diminishing abilities to discriminate between trustworthy sources and information from hoaxes and malevolent agents (Hargittai et al., 2010); (ii) a faster, deeper, and broader spread than true information, especially for topics such as disasters and science (Vosoughi et al., 2018); (iii) the elicitation of fears and suspicions in the population, threatening the texture of societies.
The COVID-19 pandemic is the perfect target for misinformation: it is the first pandemic of the Information Age, where social media platforms have a primary role in the information-sphere; it is a natural disaster, where science plays a key role to understand and cure the disease; knowledge about the SARS-CoV-2 virus is limited and the scientific understanding is continually developing. To monitor and limit the threats of COVID-19 misinformation, different initiatives have been activated (e.g., #CoronaVirusFacts Alliance 1 , EUvs-Disinfo 2 ), while social media platforms have been enforcing more stringent policies. Nevertheless, the amount of produced misinformation is such that manual intervention and curation is not feasible, calling for the development of automatic solutions grounded on Natural Language Processing.
The proposed shared task on COVID-19 misinformation presents innovative aspects mirroring the complexity and variation of phenomena that accompanies the spread of misinformation about COVID-19, including fake news, rumors, conspiracy theories, racism, xenophobia and mistrust of science, among others. To embrace the variation of the phenomena, the task organizers have developed a rich annotation scheme based on seven questions (Shaar et al., 2021). Participants are asked to design a system capable of automatically labeling a set of messages from Twitter with a binary value (i.e., yes/no) for each of the seven questions. Train and test data are available in three languages, namely English, Arabic, and Bulgarian. Our team, TOKOFOU, submitted predictions only for the English data by developing an ensemble model based on a combination of different transformer-based pre-trained language encoders. Each pre-trained model has been selected to match the language va-riety of the data (i.e., tweet) and the phenomena entailed by each of the questions. With an overall F1 score of 89.7 our system ranked first 3 .

Data
The English task provides both training and development data. The data have been annotated using a in-house crowdsourcing platform following the annotation scheme presented in Alam et al. (2020).
The scheme covers in a very extensive way the complexity of the phenomena that surrounds COVID-19 misinformation by means of seven key questions. The annotation follows a specific pattern after the first question (Q1), that aims at checking whether a message is a verifiable factual claim. In case of a positive answer, the annotator is presented with an additional set of four questions (Q2-Q5) addressing aspects such as presence of false information, interest for the public, presence of harmful content, and check-worthiness. After this block, the annotator has two further questions. Q6 can be seen as a refinement of the presence of harmful content (i.e, the content is intended to harm society or weaponized to mislead the society), while Q7 asks the annotator whether the message should receive the attention of a government authority. In case of a negative answer to Q1, the annotator jumps directly to Q6 and Q7. Quite interestingly, Q6 lists a number of categories to better identify the nature of the harm (e.g., satire, joke, rumor, conspiracy, xenophobic, racist, prejudices, hate speech, among others).
The labels of the original annotation scheme present fine-grained categories for each questions, including a not sure value. For the task, the set of labels has been simplified to three: yes, no, and nan, with this latter corresponding in some cases to the not sure value. Indeed, due to the dependence of Q2-Q5 to a positive answer to Q1, some nan values for this set of questions can also correspond to not applicable rather than to not sure making the task more challenging than one would expect.
For English, the organisers released 869 annotated messages for training, 53 for development, and 418 for testing. The distribution of the labels for each question in the training data is reported in Figure 1. As the figures show, the dataset is unbalanced for all questions. While the majority of messages present potential factual claims (Q1), only a tiny minority has been labelled as containing false information (Q2) with a very high portion re-3 Source code is available at https://git.io/JOtpH.  ceiving a nan label, suggesting that discriminating whether a claim is false or not is a difficult task for human annotators. Similar observations hold for Q3-Q5. Q6 is a refinement of Q4 about the nature of the harm. The low amount of nan values indicates a better reliability of the annotators in deciding the specific type of harms. Q7 also appears to elicit more clear-cut judgements. Finally, with the exception of questions Q4-Q7 which exhibit a weak pairwise covariance, no noteworthy correlation is discernible (refer to Figure 2).

System Overview
Our system is a majority voting ensemble model based on a combination of six different transformerbased pre-trained encoders, each selected targeting a relevant aspect of the annotated data such as domain, topic, and specific sub-tasks.

BERT Models
Preliminary data analysis and manual inspection of the input texts strongly hint at the notable difficulty of the problem. The questions our model will be called to answer are high-level semantic tasks that sometimes go beyond sentential understanding, seemingly also relying on external world knowledge. The limited size of the dataset also rules out the possibility for a task-specific architecture, even more so if one considers the effective loss of data from nan labels and the small proportion of development samples, factors that increase the risk of overfitting. Knowledge grounding with a static external source becomes impractical in view of the rapid pace of events throughout the COVID-19 pandemic: a claim would need to be contrasted against a distinct version of the knowledge base depending on when it was expressed, inserting significant overhead and necessitating an additional timestamp input feature. 4 In light of the above, we turn our attention to pretrained BERT-like models (Devlin et al., 2019). BERT-like models are the workhorses in NLP, boasting a high capacity for semantic understanding while acting as implicit rudimentary knowledge bases, owing to their utilization of massive amounts of unlabeled data (Petroni et al., 2019;Rogers et al., 2020). Among the many candidate models, the ones confined within the twitter domain make for the most natural choices. Language use in twitter messages differs from the norm, in terms of style, length, and content. A twitter-specific model should then already be accustomed to the particularities of the domain, relieving us from either having to account for domain adaptation, or relying on external data. We obtain our final set of models by filtering our selection in accordance with a refinement of the tasks, as expressed by the questions of the annotation schemes, and the domain. In particular, we focus our selection of models according to the following criteria: (i) models that have been pre-trained on the language domain (i.e, Twitter); (ii) models that have been pre-trained on data related to the COVID-19 pandemic; and (iii) models that have been pre-trained or fine tuned for high-level tasks (e.g., irony and hate speech detection) expressed by any of the target questions. In this way, we identified and used six variations of three pre-trained models, detailed in the following paragraphs.

BERTWEET
( Nguyen et al., 2020) is a RoBERTa base model (Liu et al., 2019) trained from scratch on 850M tweets. It is a strong baseline that, fine tuned, achieves state-of-the-art benchmarks on the SemEval 2017 sentiment analysis and the SemEval 2018 irony detection shared tasks (Rosenthal et al., 2017;Van Hee et al., 2018). Here, we use a variant of the model, additionally trained on 23M tweets related to the COVID-19 pandemic, collected prior to September 2020.
CT-BERT (Müller et al., 2020) is a pre-trained BERT large model, adapted for use in the twitter setting and specifically the COVID-19 theme by continued unsupervised training on 160M tweets related to the COVID-19 pandemic and collected between January and April 2020. Fine tuned and evaluated on a small range of tasks, it has been shown to slightly outperform the original.
TWEETEVAL (Barbieri et al., 2020) is a pretrained RoBERTa base model, further trained with 60M tweets, randomly collected, resulting in a Twitter-domain adapted version. We use a selection of four TWEETEVAL models, each fine tuned for a twitter-specific downstream task: hate speech-, emotion-and irony-detection, and offensive language identification.

Fine-tuning
The affinity between the above models and the task at hand allows us to use them for sentence vectorization as-is, requiring only an inexpensive fine tuning pass. We attach a linear projection on top of each model, which maps its [CLS] token representation to ||Q|| = 7 outputs, one per question. The sigmoid-activated outputs act as independent logits for binary classification of each question and the entire network is trained through summing their crossentropy losses. We train for 15 epochs on batches of 16 tweets, using the AdamW (Loshchilov and Hutter, 2017) optimizer with a learning rate of 3 · 10 −5 and weight decay of 0.01, without penalizing predictions corresponding to nan gold labels. We add dropout layers of rate 0.5 in each model's classification head. We perform model selection on the basis of mean F1-score on the development set, and report results in Table 1. As the figures show, no single model outperforms the rest. Indeed, performance largely varies both across models and questions, with best scores scattered over the table. Similar results occur when repeating the experiments with different random seeds.

Aggregation
The proposed ensemble model aggregates predictions scores along the model axis by first rounding them (into positive or negative labels) and then selecting the final outcome by a majority rule. The ensemble performs better or equally to all individual models in 3 out of 7 questions in the development set, and its metrics lie above the average for 6 of them. Keeping in mind the small size of the development set, we refrain from altering the voting scheme, expecting the majority-based model to be the most robust. During training, we do not apply any preprocessing of the data and rely on the respective tokenizer of each model, but homogenize test data by removing URLs.

Results and Discussion
Results on the test data are illustrated in Table 2. Two of the three organizers' baselines, namely the majority voting and the ngram baseline, provide already competitive scores. Our ensemble model largely outperforms all of them. The delta with the second best performing system is 0.6 points in F1 score, with a better Recall for TOKOFOU of 3 points.  and 100, respectively. This indicates that label imbalance affects the test data as well. At the same time, the performance of ngram baseline suggest that lexical variability is limited. This is not expected given the large variety of misinformation topics that seems affect the discussion around the COVID-19 pandemic. These results justify both our choice of models for the ensemble and majority voting as a robust aggregation method.

Conclusion
We participated in the COVID-19 misinformation shared task with an ensemble of pre-trained BERTbased encoders, fine-tuning each model for predictions in all questions and aggregating them into a final answer through majority voting. Our system is indeed a strong baseline for this task showing the effectiveness of available pre-trained language models for Twitter data, mixed with variants fine tuned for a specific topic (COVID-19) and multiple downstream tasks (emotion detection, hate-speech, etc.). Results indicate that this holistic approach to transfer learning allows for a data-efficient and compute-conscious methodology, omitting the often prohibitive computational requirement of retraining a model from scratch for a specific task, in favour of an ensemble architecture based on task/domain-similar solutions from a large ecosystem of publicly available models.
With appropriate scaling of the associated dataset, a system as proposed by this paper can be suitably integrated into a human-in-the-loop scenario, serving as an effective assistant in (semi-) automated annotation of Twitter data for misinformation.