TATL at W-NUT 2020 Task 2: A Transformer-based Baseline System for Identification of Informative COVID-19 English Tweets

As the COVID-19 outbreak continues to spread throughout the world, more and more information about the pandemic has been shared publicly on social media. For example, there are a huge number of COVID-19 English Tweets daily on Twitter. However, the majority of those Tweets are uninformative, and hence it is important to be able to automatically select only the informative ones for downstream applications. In this short paper, we present our participation in the W-NUT 2020 Shared Task 2: Identification of Informative COVID-19 English Tweets. Inspired by the recent advances in pretrained Transformer language models, we propose a simple yet effective baseline for the task. Despite its simplicity, our proposed approach shows very competitive results in the leaderboard as we ranked 8 over 56 teams participated in total.


Introduction
The COVID-19 pandemic has been spreading rapidly across the globe and has infected more than 20 millions men and women. As a result, more and more people have been sharing a wide variety of information related to COVID-19 publicly on social media. For example, there are a huge number of COVID-19 English Tweets daily on Twitter. However, the majority of those Tweets are uninformative and do not contain useful information, therefore, systems which can automatically filter out uninformative tweets are needed by the community. Tweets are generally different from traditional written-text such as Wikipedia or news articles due to its short length and informal use of words and grammars (e.g abbreviations, hashtags, marker). These special characteristics of Tweets may pose a challenge for many NLP techniques that focus solely on formally written texts.
In this paper, we present our participation in the W-NUT 2020 Shared Task  Informative COVID-19 English Tweets (Nguyen et al., 2020b). Inspired by the recent success of Transformer-based pre-trained language models in many NLP tasks (Devlin et al., 2019;Nguyen and Nguyen, 2020;Lai et al., 2020), we propose a simple yet effective baseline for the task. Despite its simplicity, our proposed approach shows very competitive results. In the following sections, we first describe the task definitions in Section 2 and proposed methods in Section 3. We then describe the experiments and their results in Section 4. Finally, in Section 5, we conclude this work and discuss potential future research directions.

Task Definitions
The goal of Shared task 2 is to identify whether a COVID 19 English Tweet is informative or not. Such informative Tweet provides information about recovered, suspected, confirmed and death cases as well as location and history of each case. The dataset introduced in this Shared task consists of 10K COVID 19 English Tweets. Dataset statistics can be found in Table 1 3 Method

Baseline Model
The task is formulated as a binary classification of Tweets into informative or uninformative classes. Figure 1 gives a high-level overview of our proposed approach. Given a Tweet consisting of n tokens x = {x 1 , x 2 , ..., x n }, we first form a contextualized representation for each token using a Transformer-based encoder such as BERT (Devlin et al., 2019). Following common conventions, we append special tokens to the beginning and end of the input Tweet before feeding it to the Transformer model. For example, if we use BERT, x 1 will be the special [CLS] token and x n will be the special [SEP] token. Let H = {h 1 , h 2 , ..., h n } denote the contextualized representations produced by the Transformer model. We then use h 1 as an aggregate representation of the original input and feed it to a linear layer to calculate the final output: where the transformation matrix W and the bias term b are model parameters. σ denotes the sigmoid function. It squashes the score to a probability between 0 and 1. y is the predicted probability of the input Tweet being informative.

RoBERTa
RoBERTa (Liu et al., 2019) improved over BERT (Devlin et al., 2019) by leveraging different training objectives which leads to more robust optimization i.e removing next sentence prediction and using dynamic masking for masked language modelling. Liu et al. (2019) also shows that training the language model longer and with more data hugely benefits the performance on downstream tasks.

XLM-RoBERTa
Inspired by the success of multilingual language model (Devlin et al., 2019;Lample and Conneau, 2019), XLM-RoBERTa (Conneau et al., 2020) significantly scaled up the amount of multilingual training data used in unsupervised MLM pretraining compares to previous work (Lample and Conneau, 2019) and achieved state-of-the-art performance in both monolingual and cross-lingual benchmarks.

BERTweet
BERTweet (Nguyen et al., 2020a) is a domainspecific language model pre-trained on a large corpus of English Tweets. Similar to the success of BioBERT  in BioNLP domain and the success of SciBERT (Beltagy et al., 2019) in ScientificNLP domain, BERTweet achieved stateof-the-art performance across many TweetNLP tasks, outperformed its counterparts RoBERTa (Liu et al., 2019) and XLM-RoBERTa (Conneau et al., 2020).

ELECTRA
ELECTRA (Clark et al., 2020) proposed a new pretraining objective which is different from Masked Language Modelling (Devlin et al., 2019;Liu et al., 2019). Instead of masking input tokens, ELEC-TRA corrupts the tokens using a small generator network to produces distribution over tokens, while the discriminator tries to guess which tokens are actually corrupted by the generator. ELECTRA achieved state-of-the-art results across many tasks in the GLUE benchmark  while using much less compute resources compared to other pre-training methods (Devlin et al., 2019;Liu et al., 2019).

Ensemble Learning
To further boost the performance of our baseline models, we leverage ensemble learning technique. We performed ensemble learning over all of the Transformer models mentioned in the previous section and employed two different ensemble schemes, namely Unweighted Averaging and Majority Voting.

Unweighted Averaging
In this approach, the final prediction is estimated from the unweighted average of the posterior probability from all of our models. Thus, the final prediction is given by: where C is the number of classed, M is the number of models, and p i is the probability vector computed using the softmax function of model i.

Majority Voting
Majority Voting counts the votes of all the models and select the class with most votes as prediction. Formally, the final prediction is given by: where v c denotes the votes of class c from all different models, F i is the binary decision of model i, which is either 0 or 1.

Finetuning
To fine-tune our baseline models, we employ transformers library (Wolf et al., 2019). We use AdamW optimizer (Loshchilov and Hutter, 2019) with a fixed batch size of 32 and learning rates in the set {1e − 5, 2e − 5, 5e − 5}. We finetune the models for 30 epochs and select the best checkpoint based on performance of the model on the validation set.    Table 2 shows the overall results on the validation set. The large version of RoBERTa achieves the highest F1 score on the validation set (compared to other individual models). To our surprise, we find that BERTweet does not outperform the base version of RoBERTa on the validation set, even though BERTweet was trained on English Tweets using the same training procedure of RoBERTa. Finally, XLM-RoBERTa achieves lower F1 score than both RoBERTa and ELECTRA, suggesting that using a multilingual pretrained language models may not improve the performance since the shared task is mainly about English Tweets. We also evaluate the performance of our ensemble models. The results show that ensemble learning improves the F1 score compare to each individual model and Unweighted Averaging perform better than Majority Voting on the validation set. We also submitted the predictions of both ensemble scheme to the competition and final results on the leaderboard are shown in table 3. We notice that Majority Voting slightly performs better than Unweighted Averaging on the hidden test set.

Conclusion
In this paper, we introduce a simple but effective approach for identifying informative COVID-19 English Tweets. Despite the simplicity of our approach, it achieves very competitive results in the leaderboard as we ranked 8 over 56 teams partici-pated in total. In future work, we will conduct thorough error analysis and apply visualization techniques to gain more understandings of our models (Murugesan et al., 2019). Furthermore, we will also extend our approach to other languages. Finally, we will investigate the use of advanced techniques such as transfer learning, few-shot learning, and self-training to improve the performance of our system further (Pan et al., 2017;Huang et al., 2018;Lai et al., 2018;Xie et al., 2020).