Fine-tuning Transformers for Identifying Self-Reporting Potential Cases and Symptoms of COVID-19 in Tweets

We describe our straight-forward approach for Tasks 5 and 6 of 2021 Social Media Min- ing for Health Applications (SMM4H) shared tasks. Our system is based on fine-tuning Dis- tillBERT on each task, as well as first fine- tuning the model on the other task. In this paper, we additionally explore how much fine- tuning is necessary for accurately classifying tweets as containing self-reported COVID-19 symptoms (Task 5) or whether a tweet related to COVID-19 is self-reporting, non-personal reporting, or a literature/news mention of the virus (Task 6).


Introduction
Fine-tuning off-the-shelf Transformer-based contextualized language models is a common baseline for contemporary Natural Language Processing (Ruder, 2021).When developing our system for Task 6 of the 2021 Social Media Mining for Health Applications (SMM4H), we quickly discovered that fine-tuning DistilBERT (Sanh et al., 2019), a smaller and distilled version of BERT (Devlin et al., 2019), outperformed training traditional, non-neural machine learning models.Fine-tuning DistilBERT on the released training set resulted in a micro-F1 of 97.60 on the Task 6 release development set.While this approach was not as successful for Task 5 (binary-F1 of 51.49), in this paper, we explore how much fine-tuning is necessary for these tasks and whether there are benefits to first training the model on the other task since both are related to COVID-19. 1
Tweets were extracted via manually crafted regular expressions for potential self-reported mentions of COVID-19 and then annotated by two people.1, 148 Tweets were labeled as containing a selfreporting potential cases and 6, 033 were labeled as "Other."The other tweets that might discuss COVID-19 but do not specifically reporting a user's or their household's potential cases were labeled as "Other."2Systems were ranked by F1-score for the "potential case" class.
In Task 6, systems must determine whether a tweet related to COVID-19 is self-reporting, nonpersonal reporting, or a literature/news mention of the virus.1, 421 released examples are labeled as self-reporting, 3, 567 as non-personal reports, and 4, 464 as literature/news mentions.Systems were evaluated by micro-F1 score.Table 1 includes examples tweets from the development sets.

Method
We fine-tuned DistillBERT using the implementation developed and released by HuggingFace transformer's library (Wolf et al., 2020).We trained the model for 3 epochs, using a batch size of 64 examples, warm-up steps of 500 for the learning rate scheduler and a weight decay of 0.01.Following Peters et al. (2019) recommendation to add minimal task hyper-parameters when finetuning pre-trained models, we used the remaining default hyper-parameters from the library's Trainer class.All models were trained across 2 NVIDIA RTX 3090's.

Cross-validation
We used 5-fold evaluation to determine the utility of this simple approach.

Nonpersonal Task6
Covid week 13 update.Week 11 kidney pain on the wane, presenting as high BP (affecting brain speed, vision, tightness in veins).

Self Report
Table 1: Examples of tweets and labels for each task, abridged for space.respectively. 3We divided the datasets into 5 folds of roughly 1, 435 and 1, 890 labeled examples for Task 5 and Task 6 and fine-tune models on 4 of the folds and test on the held out fold.For each fold, we fine-tuned the model on a increasing number of training examples: 10, 50, 100, 175, 250, 500, 750, 1K, 1.5K, 2K, 3K, 4K, 5K, 6K, 7K, 8K.4 Additionally, for both tasks, we experimented with using a model pre-trained on the other task.We hypothesized this might be beneficial as these tasks seem to be related.

Conclusion
We discussed our straightforward approach of finetuning a DistilBert model on Tasks 5 and 6 of the 2021 Social Media Mining for Health Applications shared tasks.While not attaining state-of-the-art, these results are competitive and demonstrate the benefit of leveraging large scale pre-trained contextualized language models.We additionally explored the benefits of first training the model on the corresponding task and determine when this can be beneficial.Future work might consider jointly  fine-tuning a Bert-based model on both tasks using a multi-task approach as opposed to the transfer learning approach employed here.

Figure 1 :
Figure 1: 5-fold results.The left and right graph respectively reflect binary-F1 results for Task 5 and micro-F1 results for Task 6. y-axes indicate F1 and x-axes indicate the number of training examples used.Dotted and solid lines, respectively, indicated that the model was pre-trained on the other task or not.Blue and orange respectively correspond to the training and development folds.The lines indicate the average across the 5 folds and the shaded areas indicate the range of results.

Figure 1
Figure 1 shows the results of fine-tuning DistillBert on each task.For Task 5 (left graph), when finetuning on 50 examples or less, initially training on Task 6 (dotted lines) is detrimental.When finetuning on somewhere between 50 and 100 training examples, first training the models on Task 6 leads to a noticeable improvement.This continued until we fine-tuned the model on 500 examples.Once we fine-tuned the model on 1000 to 3000 examples, there is no difference between first training on the other task as the models only predict the majority class "Other".As the number of training examples increases from this point, we begin to see large improvements and larger variances between the models trained on different folds.First training on Task 6 appears to be most beneficial when fine- indicates the model was finetuned on the specific task and indicates the model was first fine-tuned on the other task.The first line reports the results trained on the combination of the corresponding train and development sets -7, 174 for Task 5 and 9, 452 for Task 6.The remaining lines are based on a ensemble of the 5 models trained on the corresponding number of examples using a majority vote.
For each task, we combined the training and development sets and removed duplicate tweets, resulting in 7, 174 and 9, 452 annotated examples for Task 5 and Task 6 My dad tested positive for COVID-19 earlier this week, started having difficulty breathing this morning, and is now in the ED.

Table 2 :
Results on the official test sets available on CodaLabs.Numbers indicate binary-F1 for Task 5 and micro-F1 for Task 6.