IDSOU at WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets

We introduce the IDSOU submission for the WNUT-2020 task 2: identification of informative COVID-19 English Tweets. Our system is an ensemble of pre-trained language models such as BERT. We ranked 16th in the F1 score.


Introduction
The spread of the COVID-19 is causing fear and panic to people around the world. To monitor the COVID-19 outbreaks in real-time, SNS analysis such as Twitter is attracting much attention. Although there are 4 million COVID-19 English Tweets posted daily on Twitter (Lamsal, 2020), most of them are uninformative. Against this background, WNUT-2020 held a shared task 2 (Nguyen et al., 2020) to automatically identify whether a COVID-19 English Tweet is informative or not.
Our system employs an ensemble approach based on pre-trained language models. Such pretrained language models (Devlin et al., 2019;Yang et al., 2019;Liu et al., 2019;Lan et al., 2020;Conneau et al., 2020;Lewis et al., 2020) have achieved high performance in various text classification tasks (Wang et al., 2019). In addition, we employ domain-specific pre-trained language models Alsentzer et al., 2019;Müller et al., 2020) to build models suitable for COVID-19 and Twitter domains. Each model is optimized for three types of loss functions, cross-entropy, negative supervision (Ohashi et al., 2020), and Dice similarity coefficient (Li et al., 2020), which are useful for various text classification tasks. Finally, we ensemble 48 classifiers based on 16 pre-trained language models and 3 loss functions with a random forest classifier (Breiman, 2001

WNUT-2020 Shared Task 2
In the shared task (Nguyen et al., 2020), systems are required to classify whether a COVID-19 English Tweet is informative or not. Such informative Tweets provide information about recovered, suspected, confirmed and death cases as well as location or travel history of the cases. The 10,000 COVID-19 English Tweets 3 shown in Table 1 have been released for the shared task.
The baseline system is based on fastText (Bojanowski et al., 2017). Systems are evaluated by accuracy, precision, recall and F1 score, and are ranked by F1 score, which is the main metric. Note that the latter three metrics are calculated for the informative class only.

IDSOU System
We first introduce each base model in Section 3.1 and each loss function in Section 3.2. We then introduce the ensemble model in Section 3.3. Finally, Section 3.4 describes the implementation details.

Base Models
Recently, the fine-tuning approach for pre-trained language models (Devlin et al., 2019) has achieved the highest performance for many text classification tasks (Wang et al., 2019). We employ the following pre-trained language models of six types of architecture for the shared task.
BERT (Devlin et al., 2019) The transformer encoder pre-trained by multitask learning of masked language modeling and next sentence prediction. We employ three types of pretrained models, BERT-base, 4 BERT-large, 5 and BERT-large-wwm. 6 BERT-base consists of 12 transformer layers, 12 self-attention heads per layer, and a hidden size of 768.
BERT-large and BERT-large-wwm consist of 24 transformer layers, 16 self-attention heads per layer, and a hidden size of 1,024.
XLNet (Yang et al., 2019) The transformer encoder pre-trained by permutation language modeling. We employ two types of pretrained models, XLNet-base 7 and XLNetlarge. 8 The parameters of XLNet-base and XLNet-large are the same as BERT-base and BERT-large, respectively.
RoBERTa (Liu et al., 2019) The transformer encoder pre-trained by masked language modeling. RoBERTa has the same architecture as BERT, but pre-trains more steps on larger data with larger batch sizes. We employ two types of pre-trained models, RoBERTa-base 9 and RoBERTa-large. 10 XLM-RoBERTa (Conneau et al., 2020) The multilingual transformer encoder pretrained by masked language modeling. We employ a pre-trained model of XLM-RoBERTa-base. 11 XLM-RoBERTa-base consists of 12 transformer layers, 8 self-attention heads per layer, and a hidden size of 3,072.
ALBERT (Lan et al., 2020) The transformer encoder pre-trained by multitask learning of masked language modeling and sentence order prediction. ALBERT has significantly fewer parameters than the traditional BERT architecture due to two parameter reduction techniques, factorized embedding parameterization and cross-layer parameter sharing.
We employ two types of pre-trained models, ALBERT-base 12 and ALBERT-large. 13 ALBERT-base and ALBERT-large have the same number of layers, attention heads, and hidden size as BERT-base and BERT-large, respectively, but the embedded size is 128.
BART (Lewis et al., 2020) The denoising autoencoder based on a bidirectional transformer encoder and a left-to-right transformer decoder. We employ two types of pre-trained models, BART-base 14 and BART-large. 15 BART-base consists of 12 transformer layers, 16 selfattention heads per layer, and a hidden size of 768. BART-large consists of 24 transformer layers, 16 self-attention heads per layer, and a hidden size of 1,024.
The language models mentioned above are pretrained on corpora in the general domain such as the BookCorpus (Zhu et al., 2015) and English Wikipedia. Recent studies (Lee and Hsiang, 2019;Beltagy et al., 2019) have revealed that language models pre-trained on a domain-specific corpus achieve better performance in that domain. We employ the following three types of BERT models pre-trained on large-scale corpora of the medical domain and Twitter domain to build a classifier suitable for COVID-19 English Tweets.

Loss Functions
We train classifiers based on pre-trained language models with the following three loss functions.

XE: Cross Entropy
We employ the following cross-entropy loss commonly used in text classification tasks.
where P i := P (y i |X i ), y i is the gold label, and X i is the input text.
NS: Negative Supervision (Ohashi et al., 2020) This loss function separates the representation of Tweets with different labels.
where v i is the representation of i-th text and v n is that  DS: Dice Similarity Coefficient (Li et al., 2020) The loss function based on Dice-coefficient. The gap between maximizing F1 score and minimizing DS loss is less than that of minimizing XE loss.

Ensemble Model
We ensemble 48 classifiers (16 pre-trained language models for each 3 loss functions) described above to make prediction stable. The Random Forest Classifier (Breiman, 2001) is trained using kfold cross-validation on the development with the probabilities of the informative class estimated by each base model as the features.

Implementation Details
We implemented all models based on the Hugging Face's Transformers (Wolf et al., 2019) with Adam optimizer (Kingma and Ba, 2015). Hyperparameters of each base model were determined from the following combinations based on the F1 score in the development set.  We implemented the ensemble model based on the scikit-learn (Pedregosa et al., 2011). Hyperparameters of the random forest classifier were determined through 5-fold cross-validation from the following combinations in the development set. We followed the default data split provided by the task organizers. No external data has been used. Table 2 shows the F1 scores of each base model on the development set. The COVID-Twitter-BERT pre-trained with the in-domain corpus achieved the highest performance as expected. Since non-expert posts make up the majority of SNS, models pretrained in the biomedical and clinical domains did not outperform that of the general domain.

Results
Regarding the loss function, XE loss showed stable performance. NS loss is effective for 6 out of 16 models and seems to be compatible with BERT. DS loss achieved the best performance in combination with COVID-Twitter-BERT, although overall performance is not high. Table 3 shows the effect of our ensemble method. These results reveal the effectiveness of the ensemble of both different pre-trained language models and different loss functions. Table 4 shows the official results. We ranked 16th out of 55 teams in the F1 score.

Conclusions
We describe the IDSOU submission for the WNUT-2020 task 2. Our system is an ensemble model based on 16 pre-trained language models and 3 loss functions with a random forest classifier. In the official result, we ranked 16th out of 55 teams.