Improving Stance Detection with Multi-Dataset Learning and Knowledge Distillation

Stance detection determines whether the author of a text is in favor of, against or neutral to a specific target and provides valuable insights into important events such as legalization of abortion. Despite significant progress on this task, one of the remaining challenges is the scarcity of annotations. Besides, most previous works focused on a hard-label training in which meaningful similarities among categories are discarded during training. To address these challenges, first, we evaluate a multi-target and a multi-dataset training settings by training one model on each dataset and datasets of different domains, respectively. We show that models can learn more universal representations with respect to targets in these settings. Second, we investigate the knowledge distillation in stance detection and observe that transferring knowledge from a teacher model to a student model can be beneficial in our proposed training settings. Moreover, we propose an Adaptive Knowledge Distillation (AKD) method that applies instance-specific temperature scaling to the teacher and student predictions. Results show that the multi-dataset model performs best on all datasets and it can be further improved by the proposed AKD, outperforming the state-of-the-art by a large margin. We publicly release our code.


Introduction
People often express their stances toward specific targets (e.g., political figures, or abortion) on social media. These opinions can provide valuable insights into important events, e.g., legalization of abortion. The goal of stance detection is to determine whether the author of a text is in favor of, against or neutral toward a specific target (Mohammad et al., 2016b;Küçük and Can, 2020;AlDayel and Magdy, 2020). For example, for the tweet in Tweet: We all have a duty to protect the sanctity of life...from the first cell division, to the last. #Pro-tectLife #pjnet #ctot #ccot #SemST

Target: Legalization of Abortion
Stance: Against Table 1: An example of stance detection. Table 1, we can infer that the author is against to the legalization of abortion implied by the presence of the words "protect the sanctity of life". Even though stance detection has received a lot of attention, one of the biggest challenges for the stance detection tasks is the scarcity of annotated data. Even worse, previous studies (Mohammad et al., 2017;Du et al., 2017;Wei et al., 2018;Li and Caragea, 2019;Siddiqua et al., 2019) focused on a per-target training strategy, which aims to train one model for each target and evaluate it on the test data corresponding to that target (which we call ad-hoc training). In this case, the model is more likely to make predictions based on specific words without fully considering the target information, and hence, to overfit in the ad-hoc training setting. Motivated by these observations, we aim to investigate the following Research Question (RQ): RQ1: Can we improve the performance of a stance detection model by training one model on all targets of each dataset and can we improve the performance further by training one model on all datasets?
Toward this question, we evaluate two training settings: multi-target training and multi-dataset training, by training one model on each dataset and five datasets of different domains, respectively. We expect the model to learn more universal representations on the combined dataset and alleviate overfitting. On the other hand, compared to having many single-target models, a multi-target model or a multi-dataset model is simpler to deploy and more scientifically meaningful from the perspec-tive of building general natural language processing systems.
Besides the limited training data, models might also overfit to the ground truth labels (one-hot labels) that the meaningful rankings are destroyed, i.e., models fail to consider the similarity among different categories during training. Knowledge distillation (Hinton et al., 2015) transfers knowledge from a teacher model to a student model by training the student model to imitate the teacher's prediction logits (which we call soft labels). It is commonly believed that the soft labels of the teacher model can benefit the student model by providing more training signals than one-hot labels. However, less attention has been paid to applying knowledge distillation to the stance detection. This naturally gives rise to another research question: RQ2: Can knowledge distillation benefit the stance detection task in different training settings?
Regarding RQ2, we apply various knowledge distillation methods in both multi-target and multidataset training settings. We train a teacher model and a student model for each dataset and all datasets for multi-target learning and multi-dataset learning, respectively. Moreover, we propose an Adaptive Knowledge Distillation (AKD) method that applies instance-specific temperature scaling to the predictions of teacher and student models. Experimental results show that knowledge distillation contributes to the performance improvement of stance detection models.
Even though we show that knowledge distillation can be beneficial to the stance detection task, how to most effectively transfer knowledge to the student remains an open question. Therefore, our third research question investigates: RQ3: Which knowledge distillation setting benefit the stance detection task the most?
In this paper, we perform empirical comparisons of three knowledge distillation settings: Single→Single, Multiple→Multiple and Multiple→Single. More specifically, we train only one teacher model and student model on all datasets for Single→Single and use Multiple→Multiple to indicate distilling single-dataset teacher models into single-dataset student models, i.e., both teacher and student models are trained on one dataset. Multiple→Single indicates distilling multiple teacher models individually trained on each dataset into one student model trained on all datasets.
In order to answer these questions, we finetune a pre-trained BERTweet (Nguyen et al., 2020) model for stance detection and perform the selfdistillation (Furlanello et al., 2018) in both multitarget and multi-dataset training settings, i.e., both teacher and student models have the same model architecture. Our contributions include the following: 1) We evaluate three training settings (ad-hoc, multi-target and multi-dataset settings) for stance detection and observe that models trained in multitarget and multi-dataset settings show substantially better performance than models trained in ad-hoc setting. The model that is trained on all datasets performs best, outperforming the state-of-the-art by a large margin. 2) We explore the effectiveness of knowledge distillation on the stance detection and experimental results show that knowledge distillation can help improve the performance of stance detection models. We further propose an instance-specific temperature scaling method, which achieves superior performance on five stance detection datasets. 3) We show that Single→Single consistently outperforms other distillation settings, indicating that transferring the knowledge from a well-trained teacher that learns more universal representations is more beneficial to the stance detection.
Interestingly, despite significant progress on stance detection, the large-scale annotated datasets are limited and the number of training samples varies drastically between datasets. To make matters worse, previous studies (Mohammad et al., 2017;Du et al., 2017;Sun et al., 2018;Wei et al., 2018;Li and Caragea, 2019;Siddiqua et al., 2019;Sobhani et al., 2019;Li and Caragea, 2021b) adopted an ad-hoc training strategy, which means that the number of models that need to be trained is proportional to the number of targets. To address these issues, Schiller et al. (2021) explored multitask learning for various stance detection tasks by fine-tuning the pre-trained BERT (Devlin et al., 2019) on multiple datasets. Different from Schiller et al. (2021), in this paper, we evaluate three different training settings on the datasets of diverse domains, showing the improvement brought by the joint training step by step. Moreover, we investigate whether knowledge distillation can help further improve the performance of stance detection models.
Knowledge distillation (Ba and Caruana, 2014;Hinton et al., 2015) aims to distill the knowledge from a teacher model into a student model and has been widely adopted and modified in computer vision (Romero et al., 2015;Gupta et al., 2016;Zagoruyko and Komodakis, 2017;Wang et al., 2019;Mirzadeh et al., 2020; and natural language processing (Kim and Rush, 2016;Sun et al., 2019;Liu et al., 2019a;Aguilar et al., 2020;Sun et al., 2020;Tong et al., 2020;Currey et al., 2020;Jiao et al., 2020). Furlanello et al. (2018) proposed self-distillation in which the teacher and student models have identical architectures. Clark et al. (2019) further extended selfdistillation to the multi-task setting to achieve supe-rior performance than standard multi-task training. Zhang and Sabuncu (2020) attributed the success of self-distillation to the increasing uncertainty and diversity in teacher predictions.
Despite recent progress in knowledge distillation, less attention has been paid to combining knowledge distillation with stance detection. Miao et al. (2020) distilled knowledge in a semisupervised manner for COVID-19 stance detection. However, experiments have only been conducted on a small dataset and the test set only contains one target. Motivated by recent works, we comprehensively investigate self-distillation in the stance detection under multi-target and multi-dataset training settings and evaluate the model performance on five datasets of different domains. Moreover, we propose an instance-specific temperature scaling method to further improve the self-distillation and explore how to effectively distill knowledge to the student model in a holistic way.

Model
BERTweet (Nguyen et al., 2020) is used as our base model, which is a pre-trained language model following the training procedure of RoBERTa (Liu et al., 2019c). We fine-tune the pre-trained BERTweet to predict the stance by appending a linear classification layer to the hidden representation of the [CLS] token. The input is formulated as:

Joint Training
Most previous work focused on an ad-hoc training setting (Figure 1(a)) that aims to train one model for each target, which fails to explore the potential of all the training data and is unable to learn universal representations of targets. Therefore, in order to explore the benefits of incorporating more training data, we compare the performance of ad-hoc setting with two training settings: multi-target training and multi-dataset training, by training models on all targets of each dataset and on all targets of all datasets, respectively. More specifically, as shown in Figure 1(b), the multi-target model is trained and validated on data of all targets of each dataset, and tested on single target separately to be compared with the results of ad-hoc models. Different from the multi-target training, as shown in Figure 1(c), the multi-dataset model is trained and validated on the combination of all datasets in which training data come from different domains. Multi-target and multi-dataset training can be considered as one kind of multi-task learning approaches that help the pretrained language models learn more generalized text representations by sharing the domain-specific information across the related targets.

Knowledge Distillation
In this subsection, we first introduce a vanilla knowledge distillation method, and then present our proposed Adaptive Knowledge Distillation (AKD) in details.

Vanilla Knowledge Distillation
We assume that the training dataset D tr is composed of m different datasets in multi-dataset training: where x i is a sequence of words, t i is the corresponding target and y i is the hard label. The goal is to train a fixed-capacity model that performs well on targets of all m datasets. Standard supervised learning aims to minimize the cross-entropy loss L CE (p, y) of training data where p denotes softmax outputs. In knowledge distillation, a teacher-student learning mechanism is used to improve the performance of the student model. Let p t τ denote the softmax outputs of the teacher model with temperature scaling and where τ is the temperature used to scale the model predictions and z t is the output logits of the teacher model. The idea behind knowledge distillation (Hinton et al., 2015) is to transfer knowledge from the teacher model to the student model by minimizing the sum of cross-entropy loss between the predictions of student and hard labels and the distance loss between the predictions of student and teacher: where L KL is Kullback-Leibler (KL) divergence loss, α is the hyper-parameter that balances the importance of the cross-entropy loss and the KL divergence loss.
Adaptive Knowledge Distillation Previous works usually apply the same amount of temperature scaling to all teacher and student predictions. However, given a training set, we would expect some samples to be more representative of the label class than others and we hope to classify the typical examples with much greater confidence than the ambiguous ones. In this way, samples with larger confidence obtained from the teacher predictions should receive less amount of temperature scaling and vice versa. Therefore, we propose an Adaptive Knowledge Distillation (AKD) approach that applies instance-specific temperature scaling to the predictions of the teacher and student models. Formally, given a teacher output distribution z t i of sample i, the temperature can be written as: where max(sof tmax(z t i )) is the maximum probability of softmax output distribution, a 1 and a 2 are hyper-parameters to control the range of scaling, T 1 and T 2 are random variables that follow the uniform distributions, taking values in (b 1 , b 2 ), (1, b 1 ), respectively, b 1 and b 2 are hyper-parameters that control the amount of scaling. By doing so, the amount of temperature scaling applied to a sample will be proportional to the amount of confidence the teacher model shows in that sample's prediction. Examples that are more challenging to classify will receive more temperature scaling applied to their soft labels. More specifically, we use higher temperature to soften the teacher prediction of a sample if the teacher shows lower confidence in that sample's prediction and vice versa.
We perform the self-distillation (Furlanello et al., 2018;Zhang and Sabuncu, 2020) in both multitarget and multi-dataset training settings, i.e., both teacher and student models have the same network architecture. Self-distillation can be repeated iteratively to further improve the performance: the trained student model can be treated as the new teacher model and the knowledge can be further distilled to another student model. However, to better demonstrate the benefits of applying our proposed AKD approach to stance detection, we train teacher and student models only once for all distillation methods.

Experimental Settings
In this section, we first describe stance detection datasets used for evaluation and introduce the evaluation metrics. Then, we describe several baseline methods of knowledge distillation.

Datasets
Stance detection datasets of diverse domains are used to evaluate the performance of the proposed models. We train and validate the multi-dataset model on the combined dataset of SemEval, MT, AM, WT-WT and COVID-19. We then test the generalization abilities of stance detection models on unseen datasets WT-WT-E and Election-2020. Summary statistics of these datasets are shown in Tables 2, 3, 4, 5, 6, 7 and examples of these datasets are shown in Table 8. Datasets used for training a multi-dataset model are described as follows.
SemEval SemEval-2016 (Mohammad et al., 2016a) is a benchmark stance dataset and contains five different targets: "Atheism", "Climate Change", "Feminist Movement", "Hillary Clinton" and "Legalization of Abortion". The dataset is annotated for detecting whether the author is against to, neutral or in favor of a given target. We split the train set in a 5:1 ratio into train and validation sets and removed the target "Climate Change" due to the limited and highly skewed data. The test set of each target is the same as provided by the authors.
MT Multi-Target stance dataset ) is a political dataset containing presidential candidates of 2016 US election. It contains three sets of tweets corresponding to target pairs: "Donald Trump and Hillary Clinton", "Donald Trump and Ted Cruz", "Hillary Clinton and Bernie Sanders". The task aims at detecting the stances (against, none or favor) toward two targets for each data. Train, validation and test sets are as provided by the authors.
AM AM (Stab et al., 2018) is an argument mining dataset containing eight different topics: "Abortion", "Cloning", "Death Penalty", "Gun Control", "Marijuana Legalization", "Minimum Wage", "Nuclear Energy" and "School Uniforms". The dataset is annotated for detecting whether an argument is in support of, neutral or opposed to a given topic. Train, validation and test sets are as provided by the authors.
WT-WT WT-WT (Conforti et al., 2020b) is a financial dataset and the task aims at detecting the stance toward mergers and acquisition operations between companies. This dataset consists of four targets in the healthcare domain and one target in the entertainment domain. We train the model on the four target pairs of healthcare domain. Each tweet of WT-WT is annotated with one of four labels (refute, comment, support and unrelated). We split the dataset in a 10:2:3 ratio into train, validation and test sets and removed the data of label "unrelated" to be consistent with other datasets.
COVID-19 COVID-19 (Miao et al., 2020) is a stance detection dataset collected during COVID-19 pandemic, which contains one target "Lockdown in New York State". The dataset is annotated for detecting whether the author is in support of, neutral or against to the lockdown policy in New York State of United States. We split the train set in a 5:1 ratio into train and validation sets and used the test set as provided by the authors.
Two additional datasets are used to test the generalization abilities of stance detection models (no sample from these datasets is used for training).
WT-WT-E Target "Fox and Disney" of WT-WT (Conforti et al., 2020b) in the entertainment domain is used to test the generalization ability of stance detection models.
Election-2020 Election-2020 (Grimminger and Klinger, 2021) is a political dataset containing two presidential candidates: "Donald Trump" and "Joe Biden" of 2020 US elections. The task aims at detecting the stance (favor, against, neutral, neither or mixed) toward a given target. We test the generalization ability of the model on this dataset and removed the data of label "neither" and "mixed" to be consistent with other datasets.

Evaluation Metrics
Similar to Mohammad et al. (2016a) and , F avg and macro-average of F1-score       (F macro ) are adopted to evaluate the performance of our baseline models. F avg is calculated by averaging the F1-scores of label "Favor" and "Against". We calculate the F avg for each target and F macro is calculated by averaging the F avg across all targets for each dataset. Further, we can obtain avgF m by averaging the F macro across all datasets.

Baseline Methods
We run experiments with the following baseline methods: Base: The pre-trained BERTweet (Nguyen et al., 2020) is fine-tuned under the PyTorch framework for 5 epochs. The maximum length is set to 128 and the batch size is 32. We use AdamW optimizer (Loshchilov and Hutter, 2019) and the learning rate is 2e-5. Each experiment is conducted on a single NVIDIA V100 GPU.
KD: A vanilla knowledge distillation method with temperature scaling. The student has the same model architecture as the teacher.   b 1 and b 2 are set to 2 and 3, respectively 2 . AKD-plus: A variation of AKD with oversampling. First, we find the target with the largest number of training samples. Let T max denote this number. Second, for each of the remaining targets, we oversample its sentences until we obtain T max training samples for that target.

Results
In this section, we thoroughly discuss the experimental results to answer our research questions presented in §1. First, we study the performance of models in three training settings in §5.1. Then we explore the effectiveness of different distillation models on stance detection datasets and test the generalization ability of knowledge distillation models in §5.2. We finally compare the distillation models in different settings in §5.3.

Multi-Target Training and Multi-Dataset
Training (RQ1) We use the Base M ultiple and Base Single to indicate the base models trained in multi-target and multi-dataset settings, respectively 3 . Table 9 shows performance comparisons of ad-hoc, multi-target and multi-dataset settings. Each result is the average of six runs with different initializations. First, we can observe that Base M ultiple and Base Single significantly outperform the model trained in the ad-hoc setting. The performance of Base M ultiple is the same with Base Ad−hoc on the COVID-19 dataset because there is only one target in this dataset.
Second, Base Single shows promising improvements over Base M ultiple , which demonstrates that Base Single learns more universal representations with respect to targets by leveraging the data from datasets of diverse domains. Note that Base Single achieves a substantial improvement over Base M ultiple on SemEval and COVID-19 datasets. One possible reason is that Base M ultiple still overfits the training data heavily and training on all datasets can alleviate overfitting. Last, we can also observe that Base Single outperforms the current state-of-the-art models on the MT and Se-mEval stance datasets, demonstrating its effectiveness.

Stance Detection with Knowledge Distillation (RQ2)
Table 10 shows performance comparisons of different distillation models on five stance detection datasets. We observe that all distillation models show improvements over their base models in avgF m , which demonstrates that knowledge distillation can benefit the stance detection. Moreover, our proposed model AKD that performs instance-specific temperature scaling outperforms knowledge distillation with fixed temperature for each instance in both settings. Specifically, AKD M ultiple→M ultiple and AKD Single→Single outperform the vanilla knowledge distillation models by 0.68% and 1.23% in avgF m in multi-target and multi-dataset training settings, respectively, which indicates that training with instance-specific temperature scaling leads to better performance. Note that distillation models show less improvements in multi-dataset learning. One explanation is that knowledge distillation can be viewed as the instance-specific regularization on the softmax outputs of neural networks and the effect of knowledge distillation diminishes with increasing the size of train set (Zhang and Sabuncu, 2020). We test the generalization ability of knowledge distillation models on the unseen WT-WT-E dataset and Election-2020 dataset. Even though target "Donald Trump" of Election-2020 dataset has been seen in training data, the task is still challenging since the target-related topics in election 2016 are quite different from the ones in 2020 4 . Table 11 shows performance comparisons of various distillation models in multi-dataset training setting. We can observe that our proposed model achieves the best performance on both datasets, showing better generalization abilities.

Different Distillation Settings (RQ3)
We further compare Single→Single distillation with several variants (Multiple→Single and Single→Single with oversampled data). Experimental results of training models in different distillation settings are shown in Table 12   leads to significant performance gains than AKD M ultiple→M ultiple on SemEval and COVID-19 datasets, reinforcing the claim that multi-dataset training helps models learn more generalized text representations. Moreover, AKD Single→Single consistently outperforms AKD M ultiple→Single , indicating that transferring the knowledge from a welltrained teacher model is more beneficial to the stance detection task. Second, we can observe that AKD Single→Single shows improvements over AKD-plus Single→Single . This might be due to the difference in size between the target with the largest train set (CVS and Aetna, 5,040 sentences) and the target with the smallest train set (Atheism, 439 sentences).

Single-Dataset Fine-Tuning
Multi-task models such as MT-DNN (Liu et al., 2019b) achieve further improvements by continuing training the model on individual tasks after the multi-task training. However, we do not finetune the model on each dataset after multi-dataset training because our goal is training one model for all datasets instead of training one model for each dataset. Moreover, one multi-dataset model is much easier to deploy, and thus has more practical value.
We evaluate the effectiveness of single-dataset fine-tuning on the base model and distillation model in Table 13. We first train a multi-dataset model and then fine-tune the model on each dataset. It is unsurprising to observe that single-dataset finetuning further improves the performance of both models.

Conclusion
In this paper, we formulated three research questions for which evidence-based answers were unknown. We conducted extensive experiments on stance detection datasets and answer the questions as follows: 1) The performance of a stance detection model can be significantly improved by training on all targets of each dataset and on multiple datasets. Moreover, the model trained on datasets of diverse domains shows superior performance than the model trained on each dataset, indicating that the multi-dataset model benefits from learning with more training data and the multi-target model might still overfit the training data. 2) Selfdistillation can further improve the stance detection in both training settings. Our proposed AKD benefits stance detection the most and shows better generalization abilities over other knowledge distillation methods. 3) We explore different distillation settings and observe that Single→Single achieves the best performance overall, which indicates that distilling knowledge from a well-trained teacher is more beneficial to the stance detection.
Future work includes further strengthening the multi-dataset model by incorporating more stance detection datasets. It would be also interesting to extend the knowledge distillation to more stance detection tasks such as rumour detection and multilingual stance detection.

A.1 RQ1
We use the Base M ultiple and Base Single to indicate the base models trained in multi-target setting and multi-dataset setting, respectively.