WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets

In this paper, we provide an overview of the WNUT-2020 shared task on the identification of informative COVID-19 English Tweets. We describe how we construct a corpus of 10K Tweets and organize the development and evaluation phases for this task. In addition, we also present a brief summary of results obtained from the final system evaluation submissions of 55 teams, finding that (i) many systems obtain very high performance, up to 0.91 F1 score, (ii) the majority of the submissions achieve substantially higher results than the baseline fastText (Joulin et al., 2017), and (iii) fine-tuning pre-trained language models on relevant language data followed by supervised training performs well in this task.


Introduction
As of late-September 2020, the COVID-19 Coronavirus pandemic has led to about 1M deaths and 33M infected patients from 213 countries and territories, creating fear and panic for people all around the world. 1 Recently, much attention has been paid to building monitoring systems (e.g. The Johns Hopkins Coronavirus Dashboard) to track the development of the pandemic and to provide users the information related to the virus, 2 e.g. any new suspicious/confirmed cases near/in the users' regions.
It is worth noting that most of the "official" sources used in the tracking tools are not frequently kept up to date with the current pandemic situation, e.g. WHO updates the pandemic information only once a day. Those monitoring systems thus use social network data, e.g. from Twit-ter, as a real-time alternative source for updating the pandemic information, generally by crowdsourcing or searching for related information manually. However, the pandemic has been spreading rapidly; we observe a massive amount of data on social networks, e.g. about 3.5M of COVID-19 English Tweets posted daily on the Twitter platform (Lamsal, 2020) in which the majority are uninformative. Thus, it is important to be able to select the informative Tweets (e.g.  Tweets related to new cases or suspicious cases) for downstream applications. However, manual approaches to identify the informative Tweets require significant human efforts, do not scale with rapid developments, and are costly.
To help handle the problem, we propose a shared task which is to automatically identify whether a COVID-19 English Tweet is informative or not. Our task is defined as a binary classification problem: Given an English Tweet related to COVID-19, decide whether it should be classified as INFORMATIVE or UNINFORMATIVE. Here, informative Tweets provide information about suspected, confirmed, recovered and death cases as well as the location or travel history of the cases. The following example presents an informative Tweet:

INFORMATIVE
Update: Uganda Health Minister Jane Ruth Aceng has confirmed the first #coronavirus case in Uganda. The patient is a 36-yearold Ugandan male who arrived from Dubai today aboard Ethiopian Airlines. Patient travelled to Dubai 4 days ago. #Coron-avirusPandemic The goals of our shared task are: (i) To develop a language processing task that potentially impacts research and downstream applications, and (ii) To provide the research community with a new dataset for identifying informative COVID-19 English Tweets. To achieve the goals, we manually construct a dataset of 10K COVID-19 English Tweets with INFORMATIVE and UNIN-FORMATIVE labels. We believe that the dataset and systems developed for our task will be beneficial for the development of COVID-19 monitoring systems. All practical information, data download links and the final evaluation results can be found at the CodaLab website of our shared task: https://competitions.codalab. org/competitions/25845.
2 The WNUT-2020 Task 2 dataset 2.1 Annotation guideline We define the guideline to annotate a COVID-19 related Tweet with the "INFORMATIVE" label if the Tweet mentions suspected cases, confirmed cases, recovered cases, deaths, number of tests performed as well as location or travel history associated with the confirmed/suspected cases.
In addition, we also set further requirements in which the "INFORMATIVE" Tweet has to satisfy. In particular, the "INFORMATIVE" Tweet should not present a rumor or prediction. Furthermore, quantities mentioned in the Tweet have to be specific (e.g. "two new cases" or "about 125 tested positives") or could be inferred directly (e.g. "120 coronavirus tests done so far, 40% tested positive"), but not purely in percentages or rates (e.g. "20%", "1000 per million", or "a third").
The COVID-19 related Tweets not satisfying the "INFORMATIVE" annotation guideline are annotated with the "UNINFORMATIVE" label. An uninformative Tweet example is as follows: UNINFORMATIVE Indonesia frees 18,000 inmates, as it records highest #coronavirus death toll in Asia behind China HTTPURL

COVID-19 related Tweet collection
To be able to construct a dataset used in our shared task, we first have to crawl the COVID-19 related Tweets. We collect a general Tweet corpus related to the COVID-19 pandemic based on a predefined list of 10 keywords, including: "coronavirus", "covid-19", "covid 19", "covid 2019", "covid19", "covid2019", "covid-2019", "Coron-aVirusUpdate", "Coronavid19" and "SARS-CoV-2". We utilize the Twitter streaming API to download real-time English Tweets containing at least one keyword from the predefined list. 3 We stream the Tweet data for four months using the API from 01 st March 2020 to 30 th June 2020. We then filter out Tweets containing less than 10 words (including hashtags and user mentions) as well as Tweets from users with less than five hundred followers. This is to help reduce the rate of Tweets with fake news (our manual annotation process does not involve in verifying fake news) with a rather strong assumption that reliable information is more likely to be propagated by users with a large number of followers. 4 To handle the duplication problem: (i) we remove Retweets starting with the "RT" token, and (ii) in cases where two Tweets are the same after lowecasing as well as removing hashtags and user mentions, the earlier Tweet is kept and the subsequent Tweet will be filtered out as it tends to be a Retweet. Applying these filtering steps results in a final corpus of about 23M COVID-19 English Tweets.

Annotation process
From the corpus of 23M Tweets, we select Tweets which are potentially informative, containing predefined strings relevant to the annotation guideline such as "confirm", "positive", "suspected", "death", "discharge", "test" and "travel history". We then remove similar Tweets with the tokenbased cosine similarity score (Wang et al., 2011) that is equal or greater than 0.7, resulting in a dataset of "INFORMATIVE" candidates. We then randomly sample 2K Tweets from this dataset for the first phase of annotation.
Three annotators are employed to independently annotate each of the 2K Tweets with one of the two labels "INFORMATIVE" and "UN-INFORMATIVE". We use the "docanno" toolkit for handling the annotations (Nakayama et al., 2018). We measure the inter-annotator agreement to assess the quality of annotations and to see whether the guideline allows to carry out the task consistently. In particular, we use the Fleiss'

Item
Training Validation Test  Total  #INFOR 3,303  472  944  4,719  #UNINF 3,697  528  1,056 5,281  Total  7,000 1,000 2,000 10,000 Kappa coefficient to assess the annotator agreement (Fleiss, 1971). For this first phase, the Kappa score is 0.797 which can be interpreted as substantial (Landis and Koch, 1977). We further run a discussion for Tweets where there is a disagreement in the assigned labels among the annotators.
The discussion is to determine the final labels of the Tweets as well as to improve the quality of the annotation guideline.
For the second phase, we employ the 2K annotated Tweets from the first phase to train a binary fastText classifier (Joulin et al., 2017) to classify a COVID-19 related Tweet into either "INFORMA-TIVE" or "UNINFORMATIVE". We utilize the trained classifier to predict the probability of "IN-FORMATIVE" for each of all remaining Tweets in the dataset of "INFORMATIVE" candidates from the first phase. Then we randomly sample 8K Tweets from the candidate dataset, including 3K, 2K and 3K Tweets associated with the probability ∈ [0.0, 0.3), [0.3, 0.7) and [0.7, 1.0], respectively (here, we do not sample from the existing 2K annotated Tweets). The goal here is to select Tweets with varying degree of detection difficulty (with respect to the baseline) in both labels.
The three annotators then independently assign the "INFORMATIVE" or "UNINFORMATIVE" label to each of the 8K Tweets. The Kappa score is obtained at 0.818 which can be interpreted as almost perfect (Landis and Koch, 1977). Similar to the first phase, for each Tweet with a disagreement among the annotators, we also run a further discussion to decide its final label annotation.
We merge the two datasets from the first and second phases to formulate the final gold standard corpus of 10K annotated Tweets, consisting of 4,719 "INFORMATIVE" Tweets and 5,281 "UN-INFORMATIVE" Tweets.

Data partitions
To split the gold standard corpus into training, validation and test sets, we first categorize its Tweets into two categories of "easy" and "not-easy", in which the "not-easy" category contains Tweets with a label disagreement among annotators before participating in the annotation discussions. We then randomly select 7K Tweets for training, 1K Tweets for validation and 2K Tweets for test with a constraint that ensures the number of the "not-easy" Tweets in the training is equal to that in the validation and test sets. Table 1 describes the basic statistics of our corpus.

Task organization
Development phase: Both the training and validation sets with gold labels are released publicly to all participants for system development. Although we provide a default training and validation split of the released data, participants are free to use this data in any way they find useful when training and tuning their systems, e.g. using a different split or performing cross-validation.
Evaluation phase: The raw test set is released when the final phase of system evaluation starts. To keep fairness among participants, the raw test set is a relatively large set of 12K Tweets, and the actual 2K test Tweets by which the participants' system outputs are evaluated are hidden in this large test set. We allow each participant to upload at most 2 submissions during this final evaluation phase, in which the submission obtaining higher F 1 score is ranked higher in the leaderboard.
Metrics: Systems are evaluated using standard evaluation metrics, including Accuracy, Precision, Recall and F 1 score. Note that the latter three metrics of Precision, Recall and F 1 will be calculated for the "INFORMATIVE" label only. The system evaluation submissions are ranked by the F 1 score.
Baseline: fastText (Joulin et al., 2017) is used as our baseline, employing the default data split.

Results
In total, 121 teams spreading across 20 different countries registered to participate in our WNUT-2020 Task 2 during the system development phase. Of those 121 teams, 55 teams uploaded their submissions for the final evaluation phase. 5 We report results obtained for each team in Table 2. The baseline fastText achieves 0.7503 in  F 1 score. In particular, 48 teams outperform the baseline in terms of F 1 . There are 39 teams with an F 1 greater than 0.80, in which 10 teams are with an F 1 greater than 0.90. Both NutCracker (Kumar and Singh, 2020) and NLP North (Møller et al., 2020) obtain the highest F 1 score at 0.9096, in which NutCracker obtains the highest Accuracy at 91.50% that is 0.1% absolute higher than NLP North's.
Of the 55 teams, 36 teams submitted their system paper, in which 34 teams' papers are finally included in the Proceedings. All of the 36 teams with paper submissions employ pre-trained language models to extract latent features for learning classifiers. The majority of pre-trained language models employed include BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), BERTweet (Nguyen et al., 2020) and especially CT-BERT (Müller et al., 2020).
Not surprisingly, CT-BERT, resulted in by continuing pre-training from the pre-trained BERTlarge model on a corpus of 22.5M COVID-19 related Tweets, is utilized in a large number of the highly-ranked systems. In particular, all of top 6 teams including NutCracker, NLP North, UIT-HSE (Tran et al., 2020), #GCDH (Varachkina et al., 2020), Loner and Phonemer (Wadhawan, 2020) utilize CT-BERT. That is why we find slight differences in their obtained F 1 scores. In addition, ensemble techniques are also used in a large proportion (61%) of the participating teams. Specifically, to obtain the best performance, the top 10 teams, except NLP North, #GCDH and Loner, all employ ensemble techniques.

Conclusion
In this paper, we have presented an overview of the WNUT-2020 Task 2 "Identification of Informative COVID-19 English Tweets": (i) Provide details of the task, data preparation process, and the task organization, and (ii) Report the results obtained by participating teams and outline their commonly adopted approaches.
We receive registrations from 121 teams and final system evaluation submissions from 55 teams, in which 34/55 teams contribute detailed system descriptions. The evaluation results show that many systems obtain a very high performance of up to 0.91 F 1 score on the task, using pre-trained language models which are fine-tuned on unlabelled COVID-19 related Tweets (CT-BERT) and are subsequently trained on this task.