Linguist Geeks on WNUT-2020 Task 2: COVID-19 Informative Tweet Identification using Progressive Trained Language Models and Data Augmentation

Since the outbreak of COVID-19, there has been a surge of digital content on social media. The content ranges from news articles, academic reports, tweets, videos, and even memes. Among such an overabundance of data, it is crucial to distinguish which information is actually informative or merely sensational, redundant or false. This work focuses on developing such a language system that can differentiate between Informative or Uninformative tweets associated with COVID-19 for WNUT-2020 Shared Task 2. For this purpose, we employ deep transfer learning models such as BERT along other techniques such as Noisy Data Augmentation and Progress Training. The approach achieves a competitive F1-score of 0.8715 on the final testing dataset.


Introduction
The aim of WNUT 2020 Task 2 (Nguyen et. al., 2020), is to produce methods that automatically classify whether an English Tweet associated with the novel coronavirus or COVID-19 is informative or not. An informative tweet may report information regarding recovered, suspected, confirmed and death cases or may include the knowledge of location or travel history of such occurrences. To accomplish such a system, we are provided with a dataset of 10,000 tweets, consisting of 7000 tweets for training, 1000 tweets for a validation set and 2000 tweets for the evaluation phase.
Our solution employed an ensemble of pretrained models such as BERT (Devlin et. al., 2019), fine-tuned earlier on the task dataset, see Section 2. We also investigated text augmentation techniques such as replacing tokens with synonyms, random removal and swapping of tokens, see Section 3. The work also explored Progressive Training of a given model see Section 4. Certain aspects of these methodologies seemed to perform well and are discussed in detail.

Related Work
Sequence Labelling or Text Classification is one of the primary tasks in Computational Linguists. In general, the task contains different levels of scope such as Document Level, Paragraph Level, Sentence Level and Sub-sentence Level (words or groups of words). Typical pipeline for Sequence Labelling consists of feature extraction, followed by dimensionality reduction and a classification technique. Initial approaches for feature extraction involved techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) (Salton et. al., 1998), Word2Vec (Goldberg et. al., 2014) or Global Vectors for Word Representation (GloVe) (Pennington et. al., 2014). Dimensionality reduction can help in reducing time and memory complexity if datasets contain a large vocabulary of unique words. Therefore, this step is sometimes left out nonetheless, prevalent methods for feature extraction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA) or t-distributed stochastic neighbor embeddings (t-SNE). Early models for classification such as Naive Bayes, and Support Vector Machines have now been superseded by deep learning models. Such models incorporate sequential processing ability through architectures such as Recurrent Neural Networks which can be worked with a varied length of text sequences as well.
Lately, transfer learning has taken over these pipelines, producing high-quality textual representation by employing enormous corpora used in pre-training. Deep Learning methods have produced dominant performances in many

Linguist Geeks on WNUT-2020 Task 2: COVID-19 Informative Tweet Identification using Progressive Trained Language Models and Data Augmentation
Vasudev Awatramani 1 Anupam Kumar 1 1 Maharaja Agrasen Institute of Technology vasudev.w13@gmail.com, anupamkumar@mait.ac.in domains delivering state-of-the-art results. Transfer Learning is one such technique that has dominated this trend with models such as BERT (Devlin et. al., 2019) and RoBERTa (Liu et al., 2019) producing high performance on NLP benchmarks such as GLUE ().

Transfer Learning Ensemble
In our approach, we apply models like BERT that have already been pre-trained on data related to  For this purpose, we used huggingface's transformers package (Wolf et. al.,2019 Majority of the models that were tested are some variations of BERT. They vary in terms of data that have been fine-tuned on or number of parameters such as BERT-base or BERT-large (model architecture variants for BERT). For instance, deepset/covid_bert_base 2 is fine-tuned on CORD-19 dataset whereas CT-BERT (Müller et al.,2020) is trained on Crowbreaks Dataset (Müller et al.,2019).
In our system, we fine-tuned these models to the data in Shared Task 2 of WNUT 2020. Table 1 describes their performance in terms of accuracy and f1-score. There are slight variations in the training of these networks such as the optimiser used or number of epochs, however for the major part of the system we employed the following: 1. Weighted Adam or AdamW (Loshchilov et al.,2017) as the optimizer as opposed to Adam. The models seemed to converge with the learning rate of 3×10 -5 and 1×10 -8 as the epsilon value. We tried LAMB (You et. al.,2019) as well. Both of these optimisers, had similar effect on performance.
3. The epochs varied between 3 to 5, corresponding to the model. 4. Some of the models were trained on TPU and rest of GPUs due to computational constraints. Batch size of 128 on TPU proved more benefitting in terms of performance over smaller batch-sizes for the same model.
Ensemble inference is a very popular trick in machine learning competitions. We employed an averaging ensemble of the models, to get an improved f1-score and accuracy on the validation set of 0.897 and 90.6 respectively. Though marginal, we employed the ensemble strategy for our final submission as well.

Data Augmentation
Another typical trick to enhance the performance of neural networks is to employ more training data. To this purpose, we applied Data Augmentation techniques that are loosely inspired by those used in computer vision. Moreover, we wanted to add some noise to the input text, in order to produce more robust models, particular because the tweets are ordinarily noisy due to use of emoticons, social media lingos, short forms and symbols such as #, @ or URLs in them. For a given sentence in the training set, performed one of the following procedures with respect to a random probability:

Replacing tokens with Synonym:
Randomly choose n words from the sentence and replace each of these words with one of its WordNet synonyms.   (Müller et al.,2020) trained using augmentation training data, over original validation set.
Other augmentation techniques that we tried but did not use were Antonym Replacement and Sentence Augmentation with models such as GPT-2 (Radford et al.,2019). For instance, replacing with antonyms seemed to distort the meaning of the text unless, the random selection of words somehow resulted in some form of double negation. Original Tweet: Frey says Minneapolis has 131 known positive cases of Covid19. So far the city has declined to give regular updates on this beyond Friday meetings. Would be nice to get some ward/neighborhood breakdowns, even if known positives don't paint the whole picture. Augmented Text with Antonym Replacement: Frey says Minneapolis has 131 ignored positive cases of Covid19. So far the city has accepted to give irregular updates on this beyond Friday meetings. Would be nice to get some ward/ neighborhood breakdowns, even if known positives don' t paint the whole picture.
Similarly, with GPT-2 augmentations produced were distorting in cases we observed, such as unwarranted repetition of input sentence in the augmented text.

Progressive Training of Model
Progressive Training is one of the highly recommended techniques among deep learning practitioners, especially for its practical utility. This may involve training the model first on a smaller segment of the problem such that it training on a smaller sample of data, or priorly training on a smaller number of classes as opposed to all classes. A popular variation in Image Classification has been popularised by Jeremy Howard's FastAi lectures and library (Howard et al.,2020), that involves first training an image classification network on smaller sized images and then gradually increasing the dimensions. Similar intuition was extended for training of Generative Adversarial Networks by the ProGANS (Karras et al.,2017) study. Inspired by the concept, we developed a similar method by performing the following steps: 1. Assume, the model consisting of two parts: a. In our work, we followed the above described method gaining a marginal improvement as shown in Table 3.

Results
We made 2 submissions on the test set involving the following: a. Ensemble of various BERT models trained over original inputs. b. Progressively trained Digital Epidemiology Lab's CT-BERT model over augmented inputs.
From the results on the leaderboard, submission b. scored better with an f1-score of 0.8715.

Conclusion
We have outlined the motivation, design, and results of the WNUT Shared task 2 on detection of informative tweets pertaining to COVID-19 pandemic. We employed techniques such as Transfer Learning, Ensemble Learning, Data Augmentation and Progressive Training. However, we hope to study the task and come up with new findings in the future. One of the major aspects, we would like to analyze is the timestamp of the tweets. COVID-19 has had a very dynamic impact on social media, with varying trends concerning the number of cases, information regarding vaccine trials or new developments in social distancing guidelines. Therefore, in further work, we would like to include a temporal factor associated with these tweets for more reliable prediction on their informativeness. Furthermore, cross-lingual research and applications such as XLM-Roberta (Conneau et. al, 2020) have been rising in NLP studies. Therefore, data augmentations from multi-lingual or code-mixed tweets and microblogs sources about COVID-19 can comprehensively contribute to robust of the system such as ours. Lastly, to further increase the robustness of the model, we would like to explore techniques such as Knowledge Distillation.