A Multi-Task Learning Framework for Multi-Target Stance Detection

Multi-target stance detection aims to identify the stance taken toward a pair of different targets from the same text, and typically, there are multiple target pairs per dataset. Existing works generally train one model for each target pair. However, they fail to learn target-speciﬁc representations and are prone to overﬁtting. In this paper, we propose a new training strategy under the multi-task learning setting by training one model on all target pairs, which helps the model learn more universal representations and alleviate overﬁtting. Moreover, in order to extract more accurate target-speciﬁc representations, we propose a multi-task learning network which can jointly train our model with a stance (dis)agreement detection task that is designed to identify agreement and disagreement between stances in paired texts. Experimental results demonstrate that our proposed model outperforms the best-performing baseline by 12.39% in macro-averaged F1-score. Our resources are publicly available on GitHub. 1


Introduction
Nowadays, people often take to social media to express their stances toward specific targets (e.g., various political figures). These stances in an aggregate can provide valuable information for obtaining insight into important events such as presidential elections. The common stance detection task is to determine from a piece of text whether the author of the text is in favor of, neutral or against to a specific target, which can be categorized as singletarget stance detection (STSD) Küçük and Can, 2020;ALDayel and Magdy, 2021). More recently, since people often comment on multiple target entities in the same text, a more challenging task, i.e., multi-target stance detection (MTSD), was designed to test whether a model can 1 https://github.com/chuchun8/MTSD Tweet: #Trump2016 can beat #HillaryClinton as he is easily beating #JebBush ;) People R sick and tired of career politicians. accurately predict the stance toward multiple targets in the same text (Sobhani et al., 2017). For example, for the tweet in Figure 1, the author aims at expressing stance toward two targets, Donald Trump and Hillary Clinton, implied by the presence of the words "beat" and "career politicians". Problem statement. Given a sentence x = [w 1 , w 2 , t 1 , w 3 , ..., w l−1 , t 2 , w l ], where t 1 and t 2 are targets, and w i , i = 1, ..., l, denotes a nontarget word, the goal of MTSD is to classify the stance toward these targets into one of the three classes: {FAVOR, AGAINST, NONE}.
Previous work focused on a per-target-pair training strategy, which aims to train one model for each target pair and evaluate it on the test data corresponding to that target-pair (which we call "Ad-hoc" training). The framework is illustrated in Figure 2(a). However, the model is more likely to make predictions based on specific words without fully considering the target information, and hence, to overfit in the "Ad-hoc" training setting. To address this, as shown in Figure 2(b), we propose a "Merged" training strategy by training one model on data from all target pairs, which helps the model learn more universal representations on the whole dataset and alleviate overfitting. Furthermore, in order to extract more accurate target-specific representations, we propose a multi-task learning network which is able to jointly train our model with a stance (dis)agreement detection task that is designed to identify agreement and disagreement between stances expressed in paired-target sentences. Results show that the proposed "Merged" training setting together with identifying whether the author expresses the same stance toward two targets are beneficial to the MTSD.
Our contributions include: 1) We propose a "Merged" training strategy for MTSD and show that models fine-tuned on the pre-trained BERTweet (Nguyen et al., 2020) perform substantially better than strong baselines. Meanwhile, the decrease in performance can be widely observed in baseline results when using the "Merged" training strategy, making it a more challenging evaluation for MTSD; 2) We propose a multi-task learning network which considers the stance (dis)agreement detection task as an auxiliary task to further improve the performance of our proposed model; 3) Our proposed model outperforms the best-performing baseline by 12.39% in macro-averaged F1-score. Sobhani et al. (2017) introduced the MTSD task and presented the first dataset. They also proposed an attention-based encoder-decoder (Seq2Seq) model that predicts stance labels by focusing on important parts of a tweet. Wei et al. (2018a) proposed a dynamic memory network for detecting stance. First, target-specific attention is extracted for each target. Then, a shared external memory module that maintains useful information for targets is dynamically updated. This model achieves state-of-the-art performance on the multi-target stance dataset of Sobhani et al. (2017). We used the above two works as strong baselines for our evaluation. Sobhani et al. (2017) and Wei et al. (2018a) deal with MTSD by training one model for each target pair and the model predicts the stance toward two targets simultaneously. However, we can also solve this task by treating it as a special case of single-target stance detection (STSD). Instead of training a model that receives two targets and a sentence as an input, we train two STSD models that receive one target and a sentence as an input, on each target pair. For the example in Figure 1, we train one STSD model for target "Donald Trump" and train another model for "Hillary Clinton" in a STSD manner.

Related Work
Previous studies on STSD often employ feature engineering , Convolutional Neural Network (CNN) (Vijayaraghavan et al., 2016;Wei et al., 2016) and Recurrent Neural Network (RNN) (Zarrella and Marsh, 2016) to predict the stance for a given target. One of the major limitations is that they do not consider the target information. To address this, Augenstein et al. (2016) proposed a conditional BiLSTM encoder that learns tweet representations conditioned on the respective target. More recently, inspired by the attention mechanism (Bahdanau et al., 2015), various target-specific attention-based approaches (Du et al., 2017;Sun et al., 2018;Wei et al., 2018b;Caragea, 2019, 2021) have been proposed to connect the target with the sentence representation, which is similar to aspectbased sentiment analysis (Hazarika et al., 2018;Majumder et al., 2018;Lin et al., 2019;Song et al., 2019). We compare the baseline models of STSD and MTSD with our proposed model in §4.4 using both "Ad-hoc" and "Merged" settings.

Approach
Previous work focused on an "Ad-hoc" training strategy, which fails to explore the potential of the training data and is unable to learn universal representations of targets. Moreover, we observe that STSD models that do not consider target information can still perform well on the multi-target dataset, which makes MTSD easier. Therefore, in order to learn more universal representations and better evaluate the performance of models on MTSD, we propose a "Merged" training strategy by training one model on all target pairs. More specifically, the model is trained on training data combined from all target pairs, and tested on each target pair separately to be compared with the results of the "Ad-hoc" strategy. Our proposed training strategy can be considered as a multi-task learning approach that helps the pre-trained language models to learn more generalized text representations by sharing the domain-specific information  across the related target pairs. BERTweet (Nguyen et al., 2020) is a largescale language model pre-trained on 850M tweets. BERTweet follows the training procedure of RoBERTa  and uses the same model configuration with BERT-base (Devlin et al., 2019). We fine-tune the pre-trained BERTweet on the multi-target dataset. The model architecture is shown in Figure 3. Given an input data x and a target t (t is either target 1 or target 2 in Figure 1) [CLS] to get the predictionp 1 toward target t.
In order to learn better target-specific representations, we propose a multi-task learning network that can jointly train our model with a stance (dis)agreement detection task, which is a binary classification task where the label is 1 when the author expresses the same stance toward two targets (e.g., "FAVOR" and "FAVOR") and 0 otherwise (e.g., "FAVOR" and "AGAINST"). More specifically, given an input data x and two targets t 1 and t 2 , we formulate the inputs as x]. Then we leverage the representations of [CLS] token of two sequences to detect whether the author of the text expresses the same stance toward two targets. The (dis)agreement class probabilityp 2 can be computed as follows:  where , d h is the size of the hidden dimension, f is an activation function, semicolon denotes vector concatenation. Note that the main task is to identify the stance toward target t 1 . The target t 2 is only used in auxiliary task. Similarly, we predict the stance toward target t 2 in main task where t 1 is only used in auxiliary task. Let D be a labeled training dataset and D j be a mini-batch for the MTSD, and let y 1 and y 2 denote the true labels for stance detection task and (dis)agreement task, respectively. The crossentropy loss is used to train the model. Let L 1 and L 2 be the loss of stance detection task and (dis)agreement task, respectively. Then the final loss is: where i is the index of a data sample and α is a hyper-parameter to account for the importance of the auxiliary task. α is set to 0.5 in our experiments.

Dataset
To test the performance of our proposed model, we use the multi-target stance dataset (Sobhani et al., 2017) of tweets annotated with stance labels with respect to two targets. This dataset contains three different target pairs: Donald Trump and Hillary Clinton, Donald Trump and Ted Cruz, Hillary Clinton and Bernie Sanders. Table 1 provides dataset statistics. Each tweet has two stance labels concerning two targets and each label has one of the values: "FAVOR", "AGAINST" or "NONE".

Evaluation Metrics
F p avg and macro-average of F1-score (F macro ) are adopted to evaluate the performance of our baseline models. First, the F1-score of label "Favor" and "Against" is calculated as follows: where P and R are precision and recall, respectively. After that, the F avg is calculated as: For each target pair, we compute the F avg for each target and use the F p avg , which is calculated as the average of F avg on two targets, as our evaluation metric. Moreover, we get F macro by averaging F p avg on all target pairs.

Baseline Methods
First, we compare the proposed model with the following baselines from STSD.
BiLSTM (Schuster and Paliwal, 1997): A BiL-STM model that takes sentences as inputs without considering the target information.
CNN (Kim, 2014): The vanilla CNN that has the same input format with BiLSTM. Similarly, target information is not considered in this model.

TAN (Du et al., 2017): TAN is an attention-based LSTM that extracts target specific features.
BiCE (Augenstein et al., 2016): A BiLSTM model that uses conditional encoding for stance detection. The target information is first encoded by using a BiLSTM and the tweet is then encoded by another BiLSTM, whose state is initialised with the hidden representation of the target.
GCAE (Xue and Li, 2018): A model that is based on CNNs and gating mechanism, which is designed to block target-unrelated information.
PGCNN (Huang and Carley, 2018): Similar to GCAE, PGCNN is based on gated convolutional networks and encodes target information by generating target-sensitive filters.
The second group contains baselines from MTSD.
Seq2Seq (Sobhani et al., 2017): An attentionbased encoder-decoder model that generates stance labels according to different parts of a tweet.
DMAN (Wei et al., 2018a): Using attention and memory modules to extract important information for detecting stance.
We compare the baselines of STSD and MTSD with our proposed models.  Table 2: Comparison with the baselines on the multitarget stance dataset (%). * : the result is from the original paper. †: the proposed models improve the best baseline at p < 0.05 with two-tailed t-test. ‡: the BERTweet-A improves the BERTweet at p < 0.05 with two-tailed t-test. F macro is the average of all target pairs. Bold scores are best overall.
BERTweet We fine-tune the BERTweet model using "Merged" and "Ad-hoc" training strategies. The pre-trained BERTweet model is fine-tuned under the PyTorch framework. When fine-tuning, the batch size is 32 and maximum sequence length is 128. We use AdamW optimizer (Loshchilov and Hutter, 2019) and the learning rate is 2e-5.
BERTweet-A BERTweet is further improved by joint training with another stance detection task that identifies agreement and disagreement between stances in "Merged" and "Ad-hoc" training settings.

Results and Analysis
Main Results Table 2 shows the results of the comparison of our proposed models with the baselines mentioned above by using the proposed training strategy "Merged" and the "Ad-hoc" training strategy. We make the following observations.
First, the performance of baseline models that perform well in the "Ad-hoc" training setting drops heavily in our proposed "Merged" setting, especially for the BiLSTM and CNN. Specifically, the F macro of BiLSTM and CNN drops by 10.28% and 12.64%, respectively. The results indicate that baseline models overfit the training data quite heavily  Table 3: Performance comparison of models on the target "Donald Trump" of SemEval 2016 stance dataset (%). †: the proposed BERTweet-merged improves the BERTweet-adhoc at p < 0.05 with two-tailed t-test. Bold scores are best overall. and our proposed "Merged" training strategy can serve as a better evaluation method to test whether the model learns target-specific features.
Second, different from other baselines suffering significant performance drops, BERTweet performs better in the "Merged" setting. Training all target pairs improves the F macro of BERTweet from 59.51% to 67.77%, which demonstrates that BERTweet learns more universal representations with respect to targets by leveraging the data of multiple target pairs. Moreover, joint training with stance (dis)agreement detection task further improves the F macro of BERTweet from 67.77% to 69.65% in the "Merged" setting. Similarly, in the "Ad-hoc" setting, the F macro of BERTweet is improved from 59.51% to 60.56%, indicating that this auxiliary task is beneficial to the MTSD in both settings and helps the model put more attention on the target-related words.
Third, BERTweet-A of the "Merged" setting significantly outperforms the best-performing baseline by 12.39% in F macro , showing the effectiveness of the proposed model.

Generalization Analysis
To test the generalization ability of the BERTweet of the "Merged" setting (which we call BERTweet-merged), we train and validate the BERTweet-merged without auxiliary agreement task on the whole multitarget dataset and test it on the target "Donald Trump" of SemEval 2016 dataset  where an overall shift in the distribution of words and topics can be observed. Moreover, we train and validate the BERTweet-adhoc (BERTweet in "Ad-hoc" setting) on the target "Donald Trump" of multi-target dataset and test it on the same set of SemEval 2016 dataset to be compared with BERTweet-merged. The results are shown in Table 3. From the table, we can observe that BERTweet-merged significantly outperforms BERTweet-adhoc on the SemEval 2016 dataset, which indicates that the BERTweet model trained in the "Merged" setting shows better generalization ability than the BERTweet model trained in the "Ad-hoc" setting.

Conclusion
In this paper, we presented a comprehensive investigation into multi-target stance detection (MTSD) and proposed a more challenging task that trains a single model on data from all target pairs instead of training a model per target pair. The new training strategy can alleviate overfitting and help the model learn more universal representations by using the data of all target pairs. Moreover, we proposed to integrate a stance (dis)agreement detection module into the proposed model as an auxiliary task to gain more accurate representations of targets. Experimental results show that the proposed model outperforms the best-performing baseline by a large margin and demonstrates its effectiveness even in the face of a more challenging evaluation. Future work includes extending the proposed training strategy and (dis)agreement task to more stance detection tasks and datasets.