Dartmouth CS at WNUT-2020 Task 2: Fine tuning BERT for Tweet classification

We describe the systems developed for the WNUT-2020 shared task 2, identification of informative COVID-19 English Tweets. BERT is a highly performant model for Natural Language Processing tasks. We increased BERT’s performance in this classification task by fine-tuning BERT and concatenating its embeddings with Tweet-specific features and training a Support Vector Machine (SVM) for classification (henceforth called BERT+). We compared its performance to a suite of machine learning models. We used a Twitter specific data cleaning pipeline and word-level TF-IDF to extract features for the non-BERT models. BERT+ was the top performing model with an F1-score of 0.8713.


Introduction
In an effort to aid automated the development of COVID-19 related monitoring systems, the WNUT-2020 shared task 2: Identification of informative COVID-19 English Tweets tasked participants with developing systems to automatically classify Tweets as INFORMATIVE or UNINFORMATIVE.The WNUT task organizers have constructed and provided a data set of 10,000 Tweets related to Covid-19 for this task.(Nguyen et al., 2020b).
For this language processing task, we used Google's Bidirectional Encoder Representations from Transformers (BERT) to achieve performant results.BERT uses the now ubiquitous Transformer neural network architecture as explained in depth in the article, "Attention is All You Need," (Vaswani et al., 2017) and garnered acclaim for obtaining new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5%.(Devlin et al., 2019) To optimize BERT for this task, we finetuned the BERT-large-uncased pretrained language model on the WNUT-2020 shared task 2 data set.We then further improved performance by concatenating the fine-tuned BERT embedding vectors with Tweet-specific features and using a Support Vector Machine (SVM) for classification (BERT+).
To benchmark the performance of the BERT+ model, we compared its performance to five traditional classifiers.We also developed a preprocessing pipeline for data cleaning and used Text Frequency Inverse Document Frequency (TF-IDF) to extract features for the traditional classifiers.

Pretrained BERT model
We used the BERT-large-uncased pretrained language model.This BERT model contains an encoder with 24 Transformer blocks, 16 selfattention heads, with the hidden size of 1024.
BERT generates its pretrained word and sentence level embeddings by using two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
During pretraining, BERT utilizes MLM by first selecting 15% of the inputted tokens for potential masking.Our of this 15%, 80% are replaced with the [MASK] Token, 10% are replaced by a randomly selected word, and the remaining 10% are not manipulated.The MLM Objective is a crossentropy loss on predicting the masked tokens.
BERT also uses NSP in its pretraining.The NSP objective is a binary classification loss for predicting if two sequences follow each other.NSP uses an equal proportion of consecutive sentences for the text corpus as positive examples and randomly paired sentences as negative examples.(Liu et al., 2019) The "BERT for sequence classification" model utilizes the special [CLS] classifier token as the first token in every sequence.This token contains the classification embedding of the sequence.
BERT uses the final hidden state of the [CLS] token as the aggregated sequence representation for classification tasks.(Devlin et al., 2019)

Data
The data set provided consists of 10,000 English Tweets related to

Preprocessing
To clean the raw Tweets we created a data processing pipeline to: 1) remove non-alphanumeric characters, 2) remove stop-words, 3) convert words to their lemmas, and 4) convert words to lower case.We used the Natural Language Toolkit (NLTK) Python package (Bird and Loper, 2004) for these methods.
The data set provided replaced URLs and in-Tweet mentions of other users with the HTTPURL and @USER tokens respectively.
Our datacleaning pipeline removes these tokens when cleaning the data to avoid the models over-fitting to these tokens that do not reflect their original usage within the Tweet.Similarly, we removed the # character from Tweets.
We converted emojis into word tokens so that the models would interpret the emojis as words.When using BERT, this allowed the BERT model to generate word embeddings for these emojis Tweets occasionally contain repeated characters with the purpose of emphasizing a word.For example, "yesssss" instead of "yes".In order to consistently capture this kind of emphasis as a separate feature from the original word, we compressed the repetition of a single character into two repetitions of that character.

Traditional ML models
Using the Sklearn Python package, we generated two separate feature vectors from the preprocessed Tweets.The first method generated feature vectors based on the raw counts of word level unigrams in each Tweet.The second method used TF-IDF to extract word-level unigrams, bigrams, and trigrams features.TF-IDF is a numerical statistic that captures the frequency of a term against the frequency of the documents it appears in.TF-IDF reduces the weight of common words and increases the weight of less frequent words.(Ramos et al., 2003) We used the Sklearn to implement five traditional machine learning models using the features described above: Logistic Regression, Multinomial Naïve-Bayes, Decision Tree, Random Forest, and K-Neighbors (Pedregosa et al., 2011), using the default hyper-parameters.

Fine-tuned BERT model
We first partitioned each Tweet into an array of word tokens.The BERT model requires that each document is the same length, so we padded each array with the embedding 0 so that the length of each entry was 128 tokens.The length of 128 tokens was selected, because the maximum number of words a Tweet could contain within the 255 characters limit is 128 words.We added the '[SEP]' token to the end of each array to denote the end of a sequence.Because we used BERTForSequenceClassification, we also added the special '[CLS]' classifier token to the beginning of the array.(Devlin et al., 2019) For our fine-tuning optimizer, we utilized the Adam algorithm with weight decay (AdamW) as introduced in "Decoupled Weight Decay Regularization."(Loshchilov and Hutter, 2018) We used the default parameters β 1 = 0.9, β 2 = 0.999, and epsilon = .1e− 8.We chose a learning weight of 2e-5 as it offered the lowest training and validation loss when compared to other learning rates between 1e-5 and 1e-4.As fine-tuning BERT required extensive computational resources, we used a Google Colab Research notebook for implemen-tation as it allowed for high-RAM GPU processing.This fine-tuning approach follows the original BERT paper.(Devlin et al., 2019)

Fine-tuned BERT+ model
In an attempt to capture additional differences in the language of INFORMATIVE and UNINFORMATIVE Tweets that would not be observed by BERT, we extracted 1024 dimensional embeddings from the the last, non-softmax layer of our fine-tuned BERT and concatenated those with seven Twitter-specific features.We then trained a SVM classifier on these concatenated feature vectors using Sklearn's SVM implementation with default hyper-parameters.
The Twitter-specific features for each Tweet were: 1) Count of the following in the Tweet: HTTPURL token, #, @USER token, and emoji 2) word count, 3) syllable count, and 4) a Boolean specifying whether the Tweet contains profanity.To generate some of these features, we used PyPI's profanity-check, syllables, and emojis packages (https://pypi.org/).

Results
The results reported in subsections 5.1 and 5.2 were generated using 8-fold cross validation on the combined train and validation data sets.

Preprocessing experiments
Our best performing combination of data-cleaning methods, called the Optimal Preprocessor (OP), utilized the Twitter lexicon specific methods we created: 1) @USER token removal, 2) HTTPURL token removal, 3) "#" character removal, 4) repeated character compression, and 5) word representation of emojis.The performance of the OP combined with other preprocessing methods can be seen in Table 1.
To compare the performances of the TfidfVectorizer and CountVectorizer for feature extraction, we used the OP for data cleaning and Logistic Regression model for predictions.The TfidfVectorizer consistently outperformed the CountVectorizer with the average F1-scores of .8422and .8279respectively.

Traditional ML models
Using data processed through our OP with features extracted with TF-IDF, we achieved the F1-scores seen in Table 2

BERT models
To generate the results for our BERT models, we fit each BERT model with the train data set and evaluate on the validation data set.The pretrained BERT model yielded a F-1 score of .8312,our finetuned BERT model yielded a F-1 score of .8701,and the BERT+ model yielded a further improved F1-score of .8713(see Table 4).

BERT Model
F1-Score Fine-tuned BERT+ .8713Fine-tuned BERT .8701Pre-trained BERT .8312 Table 4: BERT models trained on the train data set and evaluated on the validation data set.

Preprocessing and feature extraction
The method to compress repeated characters in our final data preprocessor might have improved the performance of our model by generating a common word level feature between features that would have been interpreted differently.For example, if one user Tweet, "good," and another user Tweeted, "gooood," these different words would now represent the same feature.While stop-word removal and word lemma conversion are commonly in the field of NLP, the presence of the stop words and the complexity of words pre-lemma conversion appeared to help our machine learning models detect stylometric features and improved the performance of our machine learning models.
For feature extraction, TF-IDF outperformed the word count vectorizer.As TF-IDF decreases the weight of terms that occur frequently across all documents and increases the weight of less common terms, (Ramos et al., 2003) it is unsurprising that it outperforms the simple word count feature extraction.

Traditional ML models
With optimized data cleaning and feature extraction, our highest performing traditional models outperformed the base pretrained BERT model.This high performance demonstrates the efficacy of fine tuning the preprocessing steps to achieve competitive performances with these models.
The features in Table 3 depict the features that strongly impact the classification of Tweets in the Logistic Regression model.Based on the description from the WNUT-2020 shared task 2 description of INFORMATIVE Tweets as Tweets that, "provide information about recovered, suspected, confirmed and death cases as well as location or travel history of the cases," (Nguyen et al., 2020b) it is unsurprising that word-level features concerning cases, test results, and deaths have large weights.

BERT
Somewhat surprisingly the pre-trained BERT model was outperformed by our logistic regression model for this task.This shows that even for largescale pre-trained language models such as BERT, task-specific fine-tuning is of utmost importance.
Though both the fined-tuned BERT and BERT+ models both outperformed the logistic regression model, the difference in performance was not large.(around 3% boost in performance when using the BERT+ model compared to the logistic regression).We believe this is because BERT is not ideal for classifying noisy Twitter data as it has been trained on well-formed English sentences.This is why several Twitter-specific models have been proposed to deal with noisy Twitter data (Vosoughi et al., 2016;Nguyen et al., 2020a).

Conclusion & Future Work
In this paper, we have described multiple techniques for automatically identifying and classifying informative COVID-19 Tweets.We have demonstrated the applicability of Logistic Regression with an optimized data cleaning pipeline and TF-IDF for feature extraction for the task of Tweet classification.We have also displayed the higher performance of the BERT+ model.Automated classification of real time data feeds will be important as the COVID-19 pandemic continues to impact the world around us.
For future work, we would to like pre-train a BERT model on a large corpus of Tweets as Twitter's lexicon and grammatical styling differ from normal usage of the English language.We also want to compare the performance of our fine-tuned BERT model to the performances of other state-of-the-art, pretrained NLP models such as fast.ai'sULMFIT (Howard and Ruder, 2018) and OpenAI's GPT2 (Radford et al.) on this task.We would also like to train a Convolutional Neural Network for Tweet classification.Moreover, as (Kim, 2014) has demonstrated, Convolutional Neural Network (CNN) trained on top of pre-trained word vectors can achieve state-of-theart performance for sequence classification.We would like to develop a similar method of utilizing a CNN for the task of Tweet classification.

Table 2 :
for our suite of machine learning Averaged F1-score across K-fold cross validation (k=8) with TF-IDF.

Table 3 :
Top 10 Features with weights of the greatest magnitude from the highest performing model (Logistic Regression).