Enhancing Zero-shot and Few-shot Stance Detection with Commonsense Knowledge Graph

In this paper, we consider a realistic scenario on stance detection with more application potential, i.e., zero-shot and few-shot stance detection, which identiﬁes stances for a wide range of topics with no or very few training examples. Conventional data-driven approaches are not applicable to the above zero-shot and few-shot scenarios. For human beings, commonsense knowledge is a crucial element of understanding and reasoning. In the absence of annotated data and cryptic expression of users’ stance, we believe that introducing commonsense relational knowledge as support for reasoning can further improve the generalization and reasoning ability of the model in the zero-shot and few-shot scenarios. Speciﬁcally, we introduce a commonsense knowledge enhanced model to exploit both the structural-level and semantic-level information of the relational knowledge. Extensive experiments demonstrate that our model outperforms the state-of-the-art methods on zero-shot and few-shot stance detection task.


Introduction
Stance detection aims to identify the text authors' attitudes or positions towards a specific topic as a category label from this set: {Pro, Con, Neu-tral} (Mohammad et al., 2016b(Mohammad et al., , 2017. Conventionally, this task is designed to learn a targetspecific classifier for prediction on the same topic. Afterward, cross-target stance detection comes out as a subclass of the initial generic stance detection, where the classifier is adapted from different but closely related topics (e.g., training classifier on "Hillary Clinton" and predicting on "Donald Trump") (Augenstein et al., 2016a). However, both target-specific and cross-target stance detection models (Du et al., 2017;; * Zheng Lin is the corresponding author.

Topic: Stability
Stance: Pro Text: Tenure does not mean a teacher cannot lose their job. It requires due process before termination. Before tenure is achieved, a teacher can be fired without due process. In the Atlanta School District administrators, fearing that low test scores would cost them their jobs, instructed teachers to change student test responses. Without tenure and due process, teachers risked being fired if they didn't follow instructions.  Figure 1: An example where the topic isn't contained in the text. Entity mentions in the text and the topic are highlighted. We omit the reverse edges in the relation graph for clarity. Augenstein et al., 2016a;Zhang et al., 2020) require a large number of training examples with manual annotation, and annotating data for thousands of new topics is time-consuming and expensive.
In this paper, we focus on zero-shot and few-shot stance detection (Allaway and McKeown, 2020), a task to classify stances for a large number of topics with no or very few training examples. A key challenge for zero-shot and few-shot stance detection is the generalization ability of the models. However, most of the previous approaches (Xu et al., 2018;Augenstein et al., 2016b; for stance detection have relied on only the training data, which fails to achieve satisfactory results in zero-shot and few-shot scenarios. Another prominent challenge is the implicit expression of the users' stance, where the topic does not always appear in the document, resulting in the difficulty of directly establishing a connection between the topic and the document. Take Figure 1 as an example, the topic "Stability" is not mentioned in the document, where the relational knowledge can supplement the lack of explicit inferential evidence such as (stability, Antonym, perturbation) and (perturbation, RelatedTo, change) etc. Despite attempting to introduce the external word-level semantic and emotion knowledge (Cambria et al., 2018) about each word of the document, Zhang et al. (2020) neglect the global relationship between the topic and the document.
To further tackle the above challenges, we propose to bring in commonsense knowledge from external structural knowledge base Concept-Net (Speer et al., 2017). We believe that the relational knowledge graph extracted from ConceptNet can promote the transmission of relational information between the document and the topic as well as the inference of corresponding stances, which can further reduce the dependency on annotated data. Specifically, we introduce a commonsense knowledge enhanced module based on Graph Convolution Networks (Kipf and Welling, 2017;Velickovic et al., 2018;Vashishth et al., 2020) to exploit both the structural-level and semantic-level information of the relation subgraph, which can further strengthen the generalization and reasoning capacities of the model. Extensive experiments show that our method outperforms the state-of-the-art models on the benchmark dataset for zero-shot and few-shot stance detection.

Problem Formulation
denotes the zeroshot stance detection dataset which contains N exmples, where x i is a document, t i is the corresponding topic, and y i is the stance label. The goal of the task is to obtain a stance label y given x i and t i . To bridge the document and the topic, we introduce an commonsense knowledge subgraph G = (V, E) extracted from the external KG, where V is the subset of the concepts and E denotes the relations between concepts.

BERT Encoding
We employ the pre-trained language model BERT (Devlin et al., 2019) to encode the document x and topic t. Specifically, we concatenate x and t into one input sequence in the following . Then, the input sequence is fed into BERT to obtain the contextual representations X = {x 1 , · · · , x m } for the document and T = {t 1 , · · · , t n } for the topic, where m and n is the length of the document and the topic respectively. Finally, we can get the average representationsx andt of the document and the topic, respectively.

Knowledge Graph Encoding with CompGCN
Before introducing our graph encoder, let's first describe the process of constructing the relational subgraph from the external knowledge graph. We adopt ConceptNet as our knowledge graph base G.
ConceptNet consists of millons of relation triples, which contains 34 relations in total. Each triple is represented as R = (u, r, v), where u is the head concept, r is the relation, and v is the tail concept. We match phrases in documents and topics to sets of mentioned concepts (C d and C t respectively) from the ConceptNet. To extract the relational subgraph G = (V, E) from G, we find the two-hop directed paths from concepts in C d to concepts in C t . All concepts on the paths form the concepts set V and E is composed of all edges between concepts within V. Moreover, we add reverse relations edge between any concept pair to improve the information flow.
Most of the existing research on GCNs mainly focuses on non-relational graphs. Thus, to incorporate commonsense relational knowledge, we utilize CompGCN (Vashishth et al., 2020), a variant of Graph Convolution Networks (GCNs), which jointly embeds both nodes and relations of the subgraph G. The graph encoder consists of L-stacked CompGCN layers. The features of nodes and relations are all initialized with TransE (Bordes et al., 2013) embeddings. We update node representations by aggregating the information from their neighbors and their relational edges. Formally, the update equation of nodes is defined as: (1) where f is an activation function, N (v) is the neighbors of the node v; h u , h r , and h v are the representations of node u, node v and relation r.
Here, φ is a entity-relation composition operation based on the translational theory (Bordes et al., 2013) in the form of subtraction: The relation embeddings are transformed as follows: h l+1 r = W l r h l r . After that, we obtain the node representations H d and H t of C d and C t , respectively. To aggregate reasonable relational information, we compute the average relational representationd for C d by performing scaled dotproduct attention (Vaswani et al., 2017), witht as the key and H d as the query and value. Similarly, we get the average relational representationĝ for C t .

Stance Classification
We concatenate the representations of plain texts (i.e.,x andt) with the relational representations (i.e.,d andĝ) to make full use of the textual information and the graph structural information. Afterward, the concatenated representations are fed into a two-layer multi-layer perception (MLP) with a softmax function to predict the stance label: where [;] is vector concatenation operation. Finally, the parameters of the network are trained using multi-class cross-entropy loss.  Note that a document only belongs to one partition, which means that documents in the training set do not appear in the validation set or the test set, and vice versa. In addition, the zero-shot topics in the test set never appear in the training set, and the few-shot topics only contains few training data. Following the previous work (Allaway and McKeown, 2020), the macro average of F1-score is adopted as the evaluation metric.

Experimental Settings
We employ the base version of BERT as the backbone. The graph encoder has two layers of CompGCN. We train our model on 1 GPU (Nvidia RTX TITAN, 24G) using Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 4e-5 and a batch size of 64. All documents are kept the first 200 words and the topic is the first 5 words. The best checkpoints are selected according to the evaluation metrics on the validation set. We repeat our model three times using different random seeds and report the averaged results. Our code will be released on Github.
We compare our model with the several stateof-the-art baselines: BiCond (Augenstein et al., 2016b), CrossNet (Xu et al., 2018), SEKT (Zhang et al., 2020), BERT-joint (Allaway and McKeown, 2020) and TGA-Net (Allaway and McKeown, 2020). The first three models are based on BiLSTM for cross-target stance detection. When training the latter two BERT-based models, Allaway and McKeown (2020) fixed the parameters of the BERT module. Hence, we extend two models BERT-joint-ft and TGA-Net-ft, in which BERT has been fine-tuned during the training process. Besides, we compare our model with BERT-GCN, which applies the conventional GCN (Kipf and Welling, 2017) only considering the node information aggregation.

Model
F1 Zero-Shot F1 Few-Shot F1 All pro con neu all pro con neu all pro con neu all BiCond .

Results and Discussions
Results of Different Scenarios The overall results of our model and baselines are shown in Table 2. To evaluate the effectiveness of our method on different scenarios, we categorize the results into three subsets: Zero-Shot, Few-shot and All.
Our model outperforms all baselines by a large margin, which can demonstrate the importance of incorporating the rich commonsense knowledge in the form of relational graphs. Additionally, we observe that all BERT-based baselines perform worse on pro examples than on con examples for zeroshot topics. A possible explanation might be that there are more negative words in the con examples, which is easier to identify in terms of semantics. Conversely, our model brings a significant improvement on average for both zero-shot and few-shot topics, which indicates that the relational information from the external knowledge base can boost the generalization and reasoning ability. Compared to BERT-GCN only modeling the node aggregation, our model takes full advantage of the relational information to contribute much to the overall performance. Furthermore, all BERT-based models perform better than other baseline methods. In presents that the pre-trained models possess more strong generalization capability because it learns from a large-scale unsupervised corpus. Besides, SEKT does not achieve effective improvement on VAST, probably because they only introduce the external semantic knowledge at the token level without explicitly considering the overall relationship between the topic and the document. And the tokenlevel approach is difficult to transplant to BERT.

Results of Different Phenomena
To further analyse the effectiveness of our model, we test it under five challenging phenomena in the VAST following (Allaway and McKeown, 2020): (1) Imp: examples with non-neutral labels, where the topic does not appear in the document, (2) mlT: documents having multiple examples with different topics, (3) mlS: documents having multiple examples with different and non-neutral labels, (4) Qte: documents with quotations, (5) Sarc: documents with sarcasm (Habernal et al., 2018). As shown in table 3, our model achieves the best performance on all difficult phenomena. In particular, the improvement on Imp demonstrates that introducing external relational knowledge can help the model better understand the relationship between the topic and the article. Besides, the external semantic-level information from the relational subgraph makes our model perform better on the special rhetorics (Qte and Sarc).

Conclusion
In this paper, we interpret the necessity of introducing commonsense knowledge for zero-shot and few-shot stance detection. We present a commonsense knowledge enhanced method, which facilitates the integration of the relational knowledge to further strengthen the generalization and reasoning capacities of the stance detection model. Extensive experiments show that our proposed model achieved state-of-the-art results.