NIT_COVID-19 at WNUT-2020 Task 2: Deep Learning Model RoBERTa for Identify Informative COVID-19 English Tweets

This paper presents the model submitted by NIT COVID-19 team for identified informative COVID-19 English tweets at WNUT-2020 Task2. This shared task addresses the problem of automatically identifying whether an English tweet related to informative (novel coronavirus) or not. These informative tweets provide information about recovered, confirmed, suspected, and death cases as well as location or travel history of the cases. The proposed approach includes pre-processing techniques and pre-trained RoBERTa with suitable hyperparameters for English coronavirus tweet classification. The performance achieved by the proposed model for shared task WNUT 2020 Task2 is 89.14% in the F1-score metric.


Introduction
Present the world is suffering from a novel coronavirus (Linton et al., 2020) from December-2019 to till date.It spreads all the continents and almost all countries in the world.Now a days many online tools or resources (Xu et al., 2020) provide coronavirus information to people in the world, but these resources are not continuously up to date.They are updating in a particular time interval.The world's major trusted resources like WHO (World Health Organization) (Chakraborty and Maity, 2020) also update this COVID-19 related information once a day.This pandemic COVID-19 has been spreading rapidly (Shao et al., 2020).In this situation world is looking for an automatic monitoring system for identifying useful information about COVID-19.
One of the most popular and trusted online social network platform (Karampelas, 2013) is TWIT-TER, an alternative source for updating the present pandemic information.Twitter is getting nearly 4 million COVID-19 tweets daily (Lopez et al., 2020), but only a few of them are informative.Twitter API provides tweet information openly to the research community.This is very useful to identify automatically informative COVID-19 tweets.
In this WNUT-2020 shared task-2, we proposed a model for automatically identifying COVID-19 informative tweets, which describes information about confirmed, recovered, and suspected and death cases as well as location or travel history of the COVID-19 affected patients.We have used pretrained deep learning "RoBERTa" (Liu et al., 2019) with word-level embedding along with some preprocessing techniques.We also experimented with CNN (Moriya and Shibata, 2018) and pre-trained BERT (Devlin et al., 2018)-encoded sentences as input.The proposed deep learning Model gives a high F1-score compares to the above approaches.After discussed the related work in section 2, section 3 discussed the methodology and the data in detail and discusses the results in section 4. We analyze the results and error analysis in section 5 and section 6 concludes the paper.

Realated Work
Informative tweet identification has been of interest for researchers in recent years.Early work in the related fields include detection of online media (ALRashdi and O'Keefe, 2019), racism (Kabir and Madria, 2019), and disaster (Sreenivasulu and Sridevi, 2020).Papers published in recent years include (Madichetty and Sridevi, 2019), which introduces the tweet classification detection dataset and experiments with different machine learning models, such as naıve Bayes, logistic regression, random forests, and linear SVMs to investigate hate speech and disrespectful language, which experiments further on the same dataset using SVMs with n-grams and skip-grams features, and (Gamba ¨ck and Sikdar, 2017) and (Bohra et al., 2018), both exploring the performance of neural networks and comparing them with other machine learning ap-proaches.Also, there has been published a couple of surveys covering various work addressing the identification of abusive, toxic, and offensive language, hate speech, etc., and their methodology including (Schmidt and Wiegand, 2017) and (Fortuna and Nunes, 2018).Additionally, there were several workshops and shared tasks on offensive language identification and related problems, including TA-COS2, Abusive Language Online3, and TRAC4 and GermEval (Wiegand et al., 2018), which shows the significance of the problem.

Methodology
The methodology used for WNUT-2020 Task 2, consists of a preprocessing phase and a deep learning model implementation phase.

Pre-processing
This phase consists of 1. Tokenization: In this step, the entire sentence (Pitsilis et al., 2018)is split into words (Tokens).Python "nltk" package helps to split the tweet into the tokens.

Convert Tokens to lower cases:
The tokens from tokenization are may in lower or upper cases.In this step convert all tokens convert into the same case(here lower case).
3. Filter out punctuation: Remove all punctuation's (Gupta and Joshi, 2017) , 1996).For example, the wait is a stem word for waiting and waited.

Deep Learning Model
The goal of this WNUT-2020 task 2 is to identify a given COVID-19 tweet that is INFORMATIVE or UNINFORMATIVE.In this task, we have used a pre-trained deep learning model RoBERTa (Robustly Optimized BERT Approach), which is an optimized model for BERT (Jawahar et al., 2019).RoBERTa has features like 1. Train the data up to 160 GB.
2. Increase the number of iterations up to 500k.
3. Train the model with batch size 8k.
4. Larger byte-level BPE vocabulary with 50k sub word units.
5. Dynamically changing the masking pattern applied to the training data.
We trained the RoBERTa model with different combinations of hyperparameters for the given dataset, which was provided by WNUT-2020.Finally, we have gotten better metrics for hyper parametric values.
• We used batch size is equal to 16.
Maximum sequence length of tweet in the dataset is 143.
Avoid over-fitting we set hidden dropout is equal to 0.05.
• Hidden size for 'roberta-base' is equal to 768.
• An 'adam' is used for optimizer.

Baseline Methods
we used three baseline methods: An Random Forest (Bhagat and Patil, 2015) with maximum depth of 26 and no of estimators 500.
An CNN (Zhang et al., 2018) with embedding layer output size of 128 and fully connected."adam" is used as an optimizer.
The CNN was trained for 25 epochs with stochastic gradient descent.Using Scikit-learn (Pedregosa et al., 2011), the baseline methods were implemented.

Results and Error Analysis
Finally, We done experiment on the baseline models in 3.3 and the model described in section 3.2 using the COVID-19 English tweets data sets.The task organizers did not provide any baseline scores for this task.formance on validation data by a margin of more than 0.2 with the best baseline performance(BERT) (Karisani and Karisani, 2020).From the confusing matrix as shown in Figure 1

Figure 1 :
Figure 1: Confusion Matrix for validation set

Table 2 :
For better results on the validation data set, RoBERTa model trained on training data set, tested on validation data set and obtained results shown in the table 2. Models of deep learning need a large amount of data for training.So for Results obtained on the Validation Set.

Table 3 :
Results obtained on the Test Data Set.