NLPRL at WNUT-2020 Task 2: ELMo-based System for Identification of COVID-19 Tweets

The Coronavirus pandemic has been a dominating news on social media for the last many months. Efforts are being made to reduce its spread and reduce the casualties as well as new infections. For this purpose, the information about the infected people and their related symptoms, as available on social media, such as Twitter, can help in prevention and taking precautions. This is an example of using noisy text processing for disaster management. This paper discusses the NLPRL results in Shared Task-2 of WNUT-2020 workshop. We have considered this problem as a binary classification problem and have used a pre-trained ELMo embedding with GRU units. This approach helps classify the tweets with accuracy as 80.85% and 78.54% as F1-score on the provided test dataset. The experimental code is available online.


Introduction
The Coronavirus disease (officially named COVID-19 by the World Health Organization or WHO on February 11, 2020) is still spreading worldwide, although the numbers in some countries have decreased. In mid-June 2020, there was fear and panic all around the world related to the COVID-19 outbreak. Twitter has been used for gathering information about crisis communications (Cho et al., 2013). There have been a massive number of tweets about the pandemic. One estimate puts the number at about four million COVID-19 English tweets daily (Lamsal, 2020).
For our task, the tweets can be classified in two categories: informative and uninformative. Informative tweets contain some useful information about the pandemic and the affected people and they can help in managing the spread of the pandemic. The rest can be treated as uninformative tweets. The informative tweets provide information about recovered, suspected, confirmed, and deceased cases, and possibly also about a person's location or travel history. The majority of tweets are uninformative. This kind of classification is an example of text classification, which is a core task in many areas of Natural Language Processing (NLP).

Related work
Continuous availability of vital number of Twitter posts has been useful for developing more accurate and reliable classification methods for noisy text. Conventional NLP techniques and machine learning-based classification methods do not seem to perform well with Twitter data. Jiang et al. (2018) described a work on identifying healthrelated Personal Experience Tweets (PET) by combining word embedding and an LSTM neural network that demonstrated significant improvement (with p < 0.01) in performance measures of accuracy, precision, recall, F1-score, and ROC/AUC over the conventional methods in identifying PETs.
In the past few months, many researchers have tried out several mathematical and statistical models to predict novel Coronavirus transmission (Zhao et al., 2020;Shim et al., 2020;Benvenuto et al., 2020). Since the COVID-19 dataset is of a time series nature, it is natural to use sequential networks to extract the patterns from it. Some studies have used LSTM networks to forecast the spread of infectious diseases such as the current COVID-19 epidemic (Chimmula and Zhang (2020); Bandyopadhyay and Dutta (2020); Huang et al. (2020); Tomar and Gupta (2020); Pal et al. (2020)).
Dubey (2020), who analyzed the country-wise sentiment analysis of tweets and emotions of the people from 12 countries (11th March 2020 to 31st March 2020) revealed that countries like Australia, Belgium, and India were tweeting about COVID-19 with a positive sentiment and people in China had negative sentiments about the same. Arora et al. (2020) proposed deep learning models to predict the number of COVID-19 positive cases in 32 states of India and its Union Territories. Using RNN based LSTM cells and its variants such as deep LSTM, convolutional LSTM, and bi-directional LSTM as predictive models led to the conclusion that at present Bi-directional LSTM gives the best results, and convolutional LSTM gives the lowest results based on prediction errors.
Contextual embeddings are known to provide better results on the basic NLP problems such as Sentiment analysis Müller et al., 2020;Kruspe et al., 2020;Akbik et al., 2018). In this paper, we made an effort to automatically identify whether a COVID-19 English tweet is informative or not, using a model based on a contextual embedding called ELMo (Peters et al., 2018). The dataset of 10K COVID-19 English tweets was provided by the organizers of the shared task.

Problem Statement
The goal of WNUT-2020 Task 2, named as the "Identification of informative COVID-19 English tweets", is to classify COVID-19 tweets into informative and uninformative categories. This is a binary classification task to learn F : X → Y where X = {X 1 , X 2 , X 3 . . . , X m } is a tweet of length m and Y ∈ {inf ormative, uninf ormative}.

System Description
Word embeddings or distributed representations of words use dense, real-valued vectors to represent vocabulary words. Word embeddings have a much smaller dimension than the size of the vocabulary and carry syntactic and semantic information about the words, unlike one-hot vectors. In this paper, we use ELMo (Embeddings from Language Models) method proposed by Peters et al. (2018) for word embeddings. The dimensionality of each word vector is 2048.
The Bi-GRU model is a variation of RNN (Hermans and Schrauwen, 2013) that simultaneously models the word representation with its preceding and following information. The output of the word embedding is fed to the GRU (Cho et al., 2014) unit. This GRU unit is then passed to a linear layer for predicting the classes.

System Training
The provided dataset (Nguyen et al., 2020) contains 10000 tweets, out of which 4719 tweets are informative, whereas the rest are labeled as uninformative. The dataset is divided into three subparts, each used to train, tune and test the model, respectively. The statistics for these sub-parts of the dataset, according to the class, are mentioned in Table 1.

Dataset
Informative Uninformative We have used pre-trained ELMo embedding with the size of 2048, with 512 GRU hidden units, and 0.5 as the dropout to perform our experiments. The model was trained by using the flair library (Akbik et al., 2019) at a batch size of 32 for 30 epochs, and with a learning rate of 0.1. To prevent overfitting of the model training, we have used the early-stoppage and the patience value of 3.

Results
ELMo and contextual string embeddings (Akbik et al., 2018) (released in the Flair framework 2 ) are prominent embedding techniques that consider contextual information while generating the word embedding. We tried both an ELMo-based model and a Flair-based model. The embedding used for the Flair-based model was the Flair embedding (not the Pooled Flair embedding). The pre-trained models of these techniques are trained on the news domain for English. In our experiments on the development data, ELMo gave better results. Hence, for the submitted system for test dataset, we have used the ELMo-based model. Using the provided dataset, we obtained the 78.54% as weighted F1-score on the test dataset, whereas 84.79% by ELMo-based model and 83.60% by Flair-based model on the development dataset. The evaluation of the test and development dataset with the usual metrics are outlined in Table 2.  The informative class F1-score obtained on the development dataset (83.69% with ELMo-based model and 83.96% with Flair-based model) are given in Table 3.

Conclusion
This system description paper reports a simple method that leverages the ELMo embedding features to train a COVID-19 informative tweet identification system. We obtained 78.54% F1-score on the test dataset with ELMo-based approach. For future work, we will use this model with adverse regularization, in order to make it more robust. Further, the self-learning algorithm can be applied an available large monolingual corpus gathered from Twitter.