Stance Detection in COVID-19 Tweets

The prevalence of the COVID-19 pandemic in day-to-day life has yielded large amounts of stance detection data on social media sites, as users turn to social media to share their views regarding various issues related to the pandemic, e.g. stay at home mandates and wearing face masks when out in public. We set out to make use of this data by collecting the stance expressed by Twitter users, with respect to topics revolving around the pandemic. We annotate a new stance detection dataset, called COVID-19-Stance. Using this newly annotated dataset, we train several established stance detection models to ascertain a baseline performance for this specific task. To further improve the performance, we employ self-training and domain adaptation approaches to take advantage of large amounts of unlabeled data and existing stance detection datasets. The dataset, code, and other resources are available on GitHub.


Introduction
We live in unprecedented times caused by a global COVID-19 pandemic, which has forced major changes in our daily lives. Given the developments concerning COVID-19, communities and governments need to take appropriate action to mitigate the effects of the novel coronavirus, which is at the root of the pandemic. For example, states in the United States that have imposed strict social distancing mandates were able to slow the growth of the virus within their communities (Courtemanche et al., 2020). For such measures to work, however, it is important that the public fully adhere to these guidelines and mandates. "Pandemic fatigue," or when people become tired of pandemic mandates and begin to ease in adherence, can lead to 1 https://github.com/kglandt/ stance-detection-in-covid-19-tweets resurgences of the novel coronavirus (Feuer and Rattner, 2020). To reduce the spread of COVID-19, it is essential to understand the public's opinion on the various initiatives, such as stay at home orders, wearing a face mask in public, school closures, etc. Understanding how the public feels about these mandates could help health officials better estimate the expected efficacy of their mandates, as well as detect pandemic fatigue before it leads to a serious resurgence of the virus.
In the era of Web 2.0, and especially during a pandemic in which people often resort to online communications, social media platforms provide an astounding amount of data relating to the stance and views held by various populations with respect to a variety of current and important topics. However, the total amount of data that is being generated each second makes it impossible for humans alone to fully make use of them. Fortunately, recent developments in deep learning have yielded stateof-the-art performance in text classification.This makes deep learning an ideal solution for extracting and making sense of the large amounts of data currently in circulation on social media sites.
In particular, given the current events, it is evident that automated approaches for detecting the stance of the population towards targets, such as health mandates related to COVID-19, using Twitter posts, or tweets, can help gauge the level of cooperation with the mandates. Stance detection is a natural language processing (NLP) task in which the goal is for a machine to learn how to automatically determine from text alone an author's stance, or perspective/view, towards a controversial topic, or target. Research in the area of stance detection has yielded accurate results, especially in the United States politics (Mohammad et al., 2017;Ghosh et al., 2019;Xu et al., 2020). However, research on stance detection for targets relevant to COVID-19 health mandates lags behind, due to the Table 1: Examples of tweet/target pairs from the COVID-19-Stance dataset, manually annotated with respect to user's stance towards the target, the way stance opinion was expressed, and the overall sentiment of the tweet. recency of the pandemic and a lack of benchmark datasets. We set out to address this problem by constructing a COVID-19 stance detection dataset (called COVID-19-Stance), which includes tweets that express views towards four targets, specifically "Anthony S. Fauci, M.D.", "Keeping Schools Closed", "Stay at Home Orders", and "Wearing a Face Mask." This is a challenging task, which is related but different from sentiment analysis. A tweet may express support for a target, while using a negative language, and expressing a negative sentiment overall. Furthermore, the opinion expressed in a tweet may not be explicitly towards the target of interest, while the stance can be implicitly inferred. Some examples of tweet/target pairs labeled with respect to stance, target of opinion and sentiment are shown in Table 1 to illustrate the above mentioned challenges.
To address the stance detection task, carefully designed approaches are needed to extract language patterns informative with respect to stance. We provide a comprehensive set of baseline results for the newly constructed COVID-19-Stance dataset, including results with established supervised baselines for stance detection tasks, and also baselines that employ approaches for handling small amounts of labeled data, including self-training and domain adaptation approaches. In summary, the contributions of this work are as follows: • We construct a COVID-19-Stance dataset that consists of 6,133 tweets covering user's stance towards four targets relevant to COVID-19 health mandates. The tweets are manually annotated for stance according to three categories: in-favor, against, and neither.
• We establish baseline results using state-ofthe-art supervised stance detection models, including transformer-based models.
• We also establish baselines for self-training and domain adaptation approaches that use unlabeled data from the current task, or labeled data from a related task, to complement for limited labeled data for the current task.

Related Work
We discuss related work in terms of existing datasets and approaches for stance detection.

Stance Detection Datasets
Recent work on stance detection in social media data has been facilitated by Mohammad et al. (2016Mohammad et al. ( , 2017, who constructed a manually annotated stance detection dataset, shared publicly as SemEval2016 Task 6. The dataset was based on tweets about United States politics, collected during the lead up to the United States 2016 presidential election. Given a set of politics-relevant targets (e.g., politicians, feminism, climate change), the initial selection of tweets to be included in the dataset was done using "query hashtags", which are Twitter hashtags within a manually curated shortlist that had been observed to correlate stances and targets on Twitter. Subsequently, tweet/target pairs were annotated by CrowdFlower 2 workers, who were provided with a generic, but detailed questionnaire regarding the stance of a tweet's author toward a target, as well as the sentiment of the tweet (Mohammad et al., 2016(Mohammad et al., , 2017. Several other datasets for stance detection have become available in the last few years, including a large dataset (containing approximately 50,000 tweets) focused on the stance towards financial transactions that involve mergers and acquisition (Conforti et al., 2020), a dataset for identifying the stance in Twitter replies and quotes (Villa-Cox et al., 2020), datasets in languages different from English (Hercig et al., 2017;Vychegzhanin and Kotelnikov, 2019;Evrard et al., 2020), and multilingual datasets (Zotova et al., 2020;Vamvas and Sennrich, 2020;Lai et al., 2020).
Furthermore, the global prevalence and impact of the COVID-19 pandemic has led to the quick development, concurrently with our work, of several COVID-19 stance-related Twitter datasets (Mutlu et al., 2020;Miao et al., 2020;Hossain et al., 2020). Mutlu et al. (2020) published a dataset of approximately 14,000 tweets (called COVID-CQ), which were manually annotated with respect to the author's stance regarding the use of hydroxychloroquine in the treatment of COVID-19 patients. Miao et al. (2020) constructed a dataset focused on author's stance towards lockdown regulations in New York City. The authors used keywords related to "lockdown" and "New York City" and extracted approximately 31,000 relevant tweets from a large COVID-19 tweet dataset published by Chen et al. (2020). They manually annotated 1629 tweet with respect to stance, while the remaining tweets were used as unlabeled.
Our dataset construction procedure is similar to the one followed by Miao et al. (2020), but we label data for four targets using global English tweets, as opposed to Miao et al. (2020) who label data for just one target ("lockdown") in one location ("New York City").

Stance Detection Approaches
In terms of approaches used for stance detection, strong baseline results based on support vector machines (SVM) with manually engineered features were provided for the SemEval2016 Task 6 by Mohammad et al. (2016Mohammad et al. ( , 2017. Deep learning approaches used in SemEval2016 Task 6 included recurrent neural networks (RNNs) (Zarrella and Marsh, 2016) and convolutional neural networks (CNNs) (Vijayaraghavan et al., 2016;Wei et al., 2016). Such approaches used the tweets as input, but did not use any target-specific information, and did not outperform the SVM baselines. Later approaches were provided with both target and tweet representations as input, and employed RNNs and/or CNNs, together with the attention mechanism (Augenstein et al., 2016;Du et al., 2017;Zhou and Cristea, 2017;Sun et al., 2018;Siddiqua et al., 2019) to improve the performance of the SVM baselines.
Given the dominance of transformers (Vaswani et al., 2017), especially bidirectional encoder representations from transformers (BERT) (Devlin et al., 2019), in NLP tasks, some recent works (Slovikovskaya and Attardi, 2020;Li and Caragea, 2021;Ghosh et al., 2019) have focused on investigating the use of BERT models for stance detection. For example, Ghosh et al. (2019) explored the reproducibility of approaches for stance detection and compared them to BERT. They found BERT to be the best model overall for stance detection on the SemEval2016 Task 6. Li and Caragea (2021) also explored BERT based models with data augmentation and found BERT to be a powerful model for stance detection. Thus, we have selected BERT as a strong baseline for our paper.
Several works have shown that auxiliary information, such as sentiment and emotion information, or the subjective/objective nature of a text (provided as additional inputs or presented in the form of auxiliary tasks in a multi-task framework), can help improve the performance obtained from the tweet/target information alone (Mohammad et al., 2017;Sun et al., 2019;Li and Caragea, 2019;Hosseinia et al., 2020;Xu et al., 2020). Other approaches to improve the performance, especially when the amount of labeled data for the task of interest is small, include weak supervision (Wei et al., 2019) and knowledge distillation (Miao et al., 2020); transfer learning through distant supervision (Zarrella and Marsh, 2016) or pre-trained models (Ebner et al., 2019;Hosseinia et al., 2020); and domain adaptation from a source task to the target task (Xu et al., 2018(Xu et al., , 2020. In particular, the Dual-view Adaptation Network (DAN) (Xu et al., 2020) learns to predict the stance of a tweet by combining the subjective and objective views/representations of the tweet, while also learning to adapt them across domains. We use an adaptation of the DAN model as a strong baseline in this work. Most relevant to our work on COVID-19-Stance, Miao et al. (2020) compared a supervised in-domain BERT model trained and tested on "lockdown" tweets, with cross-domain models, and knowledge distillation variants. The results showed significantly improved performance for the knowledge distillation variants, and emphasized the importance of having a small amount of data for the task of interest (as a better alternative to zero-shot learning). Similar to Miao et al. (2020), we also use BERT together with knowledge distillation/self-training as a strong baseline.

COVID-19-Stance Dataset
The recency of the COVID-19 pandemic means there was no established stance detection dataset for this broader topic, when we began our research. Therefore, we set out to construct our own dataset, called COVID-19-Stance, by following the methodology introduced by Mohammad et al. (2016,2017), which is generic and applicable for any controversial topic discussed on Twitter.
Data collection. We began crawling Twitter, using the Twitter Streaming API, on February 27 th , 2020. We collected tweets that contained general keywords pertaining to the novel coronavirus (e.g. "coronavirus", "covid-19", "corona virus", "#covid19", etc.). As new hashtags emerged, we iteratively added additional, more specific keywords to the search (e.g., "#lockdown", "stay at home", "#socialdistancing", "#washhands", etc.). We continued crawling until August 20 th , 2020. The full list of keywords that was used over this time period is provided in Appendix A. We only stored original tweets (not a retweet or quoted tweet) that contained no hyperlinks, and ended up collecting a grant total of 30,331,993 tweets. Target selection. After being able to analyze the initial tweets, and following the developments of the COVID-19 events, we began to identify controversial topics that arose as the virus continued its spread in the United States (US). Four topics that we found to be among the most prevalent in our collection of tweets, and are understood by a large number of people in the US, were "Stay at Home Orders", "Wearing a Face Mask", "Keeping Schools Closed", and "Anthony S. Fauci, M.D.".
Data selection. Similar to Mohammad et al.  the four main targets/topics selected, and began to collect and organize the tweets according to topic and likely labels. For example, if "#FireFauci" is contained within a tweet, it is likely that the author of that tweet is posting information indicating they do not support the current director of the National Institute of Allergy and Infectious Diseases (NIAID), Anthony S. Fauci, M.D. For each of the four selected targets, we identified two types of query hashtags, specifically, "in-favor" hashtags and "against" hashtags (stance-neutral hashtags were very rare). The exact query hashtags identified for each target are shown in Table 2. Using the "in-favor" and "against" query hashtags, we selected a "noisy stance set" of tweets for each target, as shown in Table 3. Out of the total number of tweets corresponding to a target, we further selected a relatively balanced (in terms of in-favor and against noisy labels) dataset to be manually labeled, and another relatively balanced dataset of tweets to be used as unlabeled in the self-training approach. The exact number of tweets to-label and to be used unlabeled are shown in Table 4.
Data Annotation. Although query hashtags are great for selecting likely relevant tweets, they are noisy and not reliable enough to accurately identify the stance towards a target for a tweet (see Table 5 for some examples illustrating this point). There-  fore we used Amazon Mechanical Turk (AMT) to enlist the help of gig workers to analyze and label our collection of 7,122 tweets selected to be labeled (the exact number of tweets for each target is shown in Table 4). We removed the hashtags that appeared at the end of a tweet to exclude obvious cues, without making the tweet syntactically ambiguous. This increases the chance that our collection contains tweets that do not explicitly mention the target, and potentially some tweets with neutral stance towards the target. Each tweet was labeled by three annotators. At one time, each annotator was shown a page with a tweet and a target, and asked to answer a questionnaire designed and detailed by Mohammad et al. (2017). The questionnaire, shown in Appendix B, contains detailed questions and multi-choice answers that allow us to annotate each tweet with respect to three criteria: 1. the stance of the tweet's author/user towards the given target: in favor, against or neither; 2. the way the opinion is expressed, which captures whether the text of the tweet reveals the stance explicitly, implicitly, or neither; 3. the sentiment of the tweet, which essentially captures the language used in the tweet: positive, negative, both, sarcasm, or neither.
Our final COVID-19-Stance dataset contains only tweets for which at least two out of the three annotators agreed on the stance category. The Cohen's Kappa scores that we obtained for interannotator agreement for the final dataset were 0.82 for stance, 0.83 for target of opinion, and 0.60 for sentiment. According to (Cohen, 1960), the scores for stance and target of represent almost perfect agreement, while the score for sentiment shows substantial agreement. Table 1 shows several examples of annotated tweets in our dataset.
Dataset statistics. The number of tweets for each target and the stance distribution for each target are shown in Table 6. The number of tweets for each target over the months when data was crawled is graphically displayed in Figure 1, which shows that a large number of the tweets in our dataset were posted in July 2020. The distribution of the type of opinion is shown in Tables 7 and 8, for each target and each stance, respectively. Similarly, the distribution of the sentiment (or tweet language) is shown in Tables 9 and 10, for each target and each stance, respectively. As can be seen from these tables, our dataset contains a good mix of in-favor, against and neutral categories, and also a good mix of tweets with implicit and explicit opinion towards the target. However, the sentiment is generally negative or in the other category (which includes both positive and negative, sarcastic language and neither). Together, these characteristics make our task both realistic and challenging. While we only use the stance label in this work, the other labels will be explored in future works, as auxiliary information potentially useful for stance detection.
Benchmark subsets. To enable progress on COVID-19 stance detection, and facilitate comparisons between models developed for this task, we randomly split our COVID-19-Stance dataset (using stratified sampling) into training (Train), development (Val) and test (Test) subsets, respectively. We used the training subset to train our models, the development to select hyperparameters and the test to evaluate the final performance of the models. Statistics for the dataset in terms of number of tweets in the Train, Test and Val subsets, respectively, are shown in Table 11.

Baseline Models
Having described our COVID-19-Stance dataset, we now briefly review several models that we use to establish baseline results on this dataset.

Supervised Baseline Models
To get a baseline understanding of how established stance detection networks perform on our dataset, we used the following models: • BiLSTM: Bi-Directional Long Short Term Memory Networks (Schuster and Paliwal, 1997) take tweets as input, and are trained to predict the stance towards a target, without explicitly using the target information.
• Kim-CNN: Convolutional Neural Networks for text, proposed by Kim (2014), are also provided with tweets as input, and trained to predict the stance towards a target, without explicitly using the target information.
• TAN: Target   the target information explicitly, and identifies specific target features using the attention.
• GCAE: The Gated Convolutional Network with Aspect Embedding (Xue and Li, 2018) is based on a CNN model. In addition to tweets, it also has information about the target, and uses a gating mechanism to block target-unrelated information.
• BERT: Bidirectional Encoder Representations from Transformers (Devlin et al., 2019) represent language models that are pre-trained on a large unlabeled corpus to encode sentences and their tokens into dense vector representations. We used the pre-trained COVID-Twitter-BERT model 3 (Müller et al., 2020).

Self-training Baseline
Given that a large amount of unlabeled data is available for each target included in our COVID-19-Stance dataset, we explored the use of a selftraining approach that can make use of unlabeled data, as described below:   • BERT-NS: Self-training with Noisy Student (Xie et al., 2020) is a semi-supervised learning approach that employs self-training and knowledge distillation (Hinton et al., 2015) to improve the performance of a teacher model using unlabeled data. More specifically, a teacher is originally trained from the available labeled data, and is used to predict pseudolabels for the unlabeled data. Subsequently, a noisy student model is trained using the labeled and pseudo-labeled data. By replacing the teacher with the student, the process can be iterated several times. In our work, we performed just one iteration. Both the teacher and the student models were COVID-Twitter-BERT, with a softmax layer at the top.

Domain Adaptation Baseline
To understand the benefits of using a prior stance detection dataset, in addition to the dataset we constructed, we experimented with a domain adaptation model, as described below: • BERT-DAN: Dual-view Attention Networks (Xu et al., 2020) capture explicitly subjective and objective information contained in tweets, and also enable the use of labeled data for a prior, related task to train a model for a current task of interest. The original DAN model proposed by Xu et al. (2020) makes use of BiLSTM networks and domain adversarial networks to learn the subjective and objective representations and make them domain invariant. At the same time, DAN learns to predict the stance using labeled data from the prior task (under the assumption that no labeled data is available for the task of interest). Compared to the original DAN model, we re-placed the BiLSTM networks with pre-trained COVID-Twitter-BERT models, and trained the network to predict the stance using both labeled data from the prior task and from the current task. The prior data was the whole SemEval2016 Task 6 data.

Implementation Details
Data Pre-processing Before the tweets in our dataset were used for training, they were preprocessed and transformed to embedded tensors. For every tweet in the dataset, we removed any emojis, URLs, and reserved words. We then used the pre-trained COVID-Twitter-BERT to tokenize and embed each tweet, truncating the sequence length to 128 as needed.
Hyperparameters. The validation set was used to determine generally good hyperparameters for the models. For each non-BERT supervised model, Adam optimizer was used with a learning rate of 1e −5 , weight decay of 4e −5 , and gradient clipping with a max norm of 4.0. Each model was trained for 120 epochs, with a mini-batch size of 16 in each iteration. A dropout of 0.5 was used for each network. Other specific hyper-parameters for each network are shown below: • RNN Networks: BiLSTM, ATGRU, and TAN each had a hidden LSTM dimension of 512 with a dropout of 0.2.
• CNN Networks: GCAE and Kim-CNN both used filters of width 2, 3, 4, and 5. For each filter width, there were 25 feature maps. Following the convolutional layers was a linear classifier with a hidden dimension of 128.
• BERT: This model was initialized with the pre-trained COVID-Twitter-BERT model. It was optimized with AdamW with a learning rate of 1e −5 over the course of 10 epochs, with 15 warmup steps.
• BERT-NS: The implementation of the student model is exactly the same as that of the supervised BERT. The teacher and the student models are set up in the same manner, except that the teacher has no dropout.
• BERT-DAN: The formation functions are the same as those of the supervised BERT model,  except that there is no softmax layer on top. The discriminators and classifiers were all two layer neural nets with a hidden dimension of 1024. A dropout of 0.15 was used throughout the network. Optimization was performed by AdamW with a learning rate of 3e −6 for first 7 epochs, and 3e −7 for the final 3 epochs. The following weights were assigned to this network's loss functions: 0.1 for the domain discriminators, 0.05 for the objective and subjective classifiers, and 0.4 for the source stance classifier. A mini-batch size of 4 was used due to GPU memory limitations.

Evaluation Metrics
To evaluate the performance of the baseline models on our dataset, we used the following standard metrics: accuracy, (macro average) precision, recall, and F1 score 4 . We report the performance on the test set at the epoch in which the model recorded the highest F1 score on the validation data. We performed 3 independent runs for each model to account for variability, and report average results over the three runs. 4 Precision, recall and F1 scores for each stance category are also reported in Appendix C

Results and Discussion
The results of the experiments are shown in Table  12 for the four targets in the COVID-19-Stance dataset, respectively. Between the two supervised baselines that do not explicitly use the target information, Bi-LSTM and Kim-CNN, the Bi-LSTM gives better results overall, in all metrics, except for the "Wearing a Face Mask" target. When comparing Kim-CNN with GCAE (a CNN-based models that explicitly uses the target), Kim-CNN gives better accuracy and F1 scores for two targets ("Anthony S. Fauci, M.D." and "Stay At Home Orders"), while the GCAE model gives better results for the other two targets ("Keeping Schools Closed" and "Wearing a Face Mask"). Similarly, when comparing the two recurrent models with attention, TAN and ATGRU, TAN performs better on two targets, "Keeping Schools Closed" and "Stay At Home Orders", while ATGRU performs better on "Anthony S. Fauci, M.D." and "Wearing a Face Mask". Surprisingly, these two models, which explicitly use the target information, perform worse than the BiLSTM model overall. Finally, we can see that among the supervised baselines, the BERT model performs significantly better than all the other models, a result that is in agreement with prior works (Ghosh et al., 2019;Miao et al., 2020).  When comparing BERT with BERT-NS with BERT-DAN (models that use unlabeled data and SemEval2016 Task 6 data, respectively), we see that BERT performs better than the models that use additional information on the "Stay At Home Orders" target and comparable to the BERT-NS on the "Keeping Schools Closed" target -specifically, the targets with smaller labeled datasets. On the other hand, BERT-DAN performs the best on the "Anthony S. Fauci, M.D." target, and comparable to BERT-NS on the "Wearing a Face Mask" target, i.e., the targets with larger labeled datasets. This result suggests that a larger amount of labeled data is useful for the domain adaptation approach. However, when only a small amount of labeled data is available, BERT is better than the noisy student which may not start with a very good teacher.
Error Analysis. To better understand how two of our best models would perform in the wild, we have included some of their predictions on examples from the Wearing A Face Mask test set, along with the gold-standard label in Table 13. As we can see, both models perform well on examples where the stance is presented explicitly, such as in tweets 1 and 2. However, the models generally struggle with sarcasm and humor as seen in tweets 3, 5, and 6. They also both demonstrate a strong bias towards certain phrases such as "form of government control" which is a common phrase in AGAINST tweets for Wearing A Face Mask. Interestingly, the noisy student model seems to be more likely to incorrectly predict a FAVOR stance when the sentiment of the tweet is positive compared to the DAN model, as seen in tweets 7 and 8.

Conclusions and Future Work
In this work, we have constructed a COVID-19-Stance dataset that can be used to further the research on stance detection, especially in the context of COVID-19 pandemic. In addition to the dataset, we have established baselines using several supervised models used in prior works on stance detection, and also two models that can make use of unlabeled data and data from a prior stance detection task, respectively. Our results show the pre-trained COVID-Twitter-BERT model constitutes a strong baseline. When a larger amount of labeled data is available for a target, the BERT-NS and BERT-DAN can help further improve the performance. As part of future work, we plan to study the benefits of the opinion and sentiment data that we annotated towards the stance detection. We also plan to study the usefulness of multi-task learning, where we train models for all our targets concurrently. Other transfer learning approaches that can leverage existing datasets will also be explored.

Ethics and Impact Statement
Our dataset does not provide any personally identifiable information as only the tweet IDs and human annotated stance labels will be shared. Thus, our dataset complies with Twitter's information privacy policy. The research enabled by this dataset has the potential to help officials and health organizations understand the public's opinion on various initiatives, estimate the efficacy of their mandates and prevent serious resurgence of the novel coronavirus.   Table 15: Performance of the baseline models for stance detection on the target "Keeping Schools Closed". Average performance over three classes, as well as performance per class is reported in terms of accuracy (Acc), precision (Pr), recall (Re) and F1 score (F1). Each baseline was trained and evaluated three times. The results reported are averaged over three runs.   Table 17: Performance of the baseline models for stance detection on the target "Wearing a Face Mask". Average performance over three classes, as well as performance per class is reported in terms of accuracy (Acc), precision (Pr), recall (Re) and F1 score (F1). Each baseline was trained and evaluated three times. The results reported are averaged over three runs.