Unified Dual-view Cognitive Model for Interpretable Claim Verification

Recent studies constructing direct interactions between the claim and each single user response (a comment or a relevant article) to capture evidence have shown remarkable success in interpretable claim verification. Owing to different single responses convey different cognition of individual users (i.e., audiences), the captured evidence belongs to the perspective of individual cognition. However, individuals’ cognition of social things is not always able to truly reflect the objective. There may be one-sided or biased semantics in their opinions on a claim. The captured evidence correspondingly contains some unobjective and biased evidence fragments, deteriorating task performance. In this paper, we propose a Dual-view model based on the views of Collective and Individual Cognition (CICD) for interpretable claim verification. From the view of the collective cognition, we not only capture the word-level semantics based on individual users, but also focus on sentence-level semantics (i.e., the overall responses) among all users and adjust the proportion between them to generate global evidence. From the view of individual cognition, we select the top-k articles with high degree of difference and interact with the claim to explore the local key evidence fragments. To weaken the bias of individual cognition-view evidence, we devise inconsistent loss to suppress the divergence between global and local evidence for strengthening the consistent shared evidence between the both. Experiments on three benchmark datasets confirm that CICD achieves state-of-the-art performance.


Introduction
The problem of claim credibility has seriously affected the media ecosystem. Research (Allen et al., 2020) illustrates that the prevalence of 'fake news' has decreased trust in public institutions, and undermined democracy. Meanwhile, 'massive infodemic' during COVID-19 has taken a great toll on health-care systems and lives (Fleming, 2020). Therefore, how to verify the claims spread in networks has become a crucial issue.
Current approaches on claim verification could be divided into two categories: 1) The first category re-

R1:
This year is really hard, Let's get over this summer soon! R2: I think it's true. My husband had dengue fever before, and I also got infected soon. Maybe he infected me.

R3:
No, not all types of mosquitoes transmit dengue fever.
R4: False, please don't continue to spread. It has been refuted that dengue fever will spread in the air.

such a hot and rainy season, we itoes can transmit it, but it is s
This year is really lies on traditional machine learning and deep learning methods to capture semantics , sentiments (Ajao et al., 2019), writing styles (Przybyla, 2020), and stances (Kumar and Carley, 2019) from claim content, and meta-data features, such as user profiles Wu et al., 2020b) for verification. Such approaches could improve verification performance, but they are hard to make reasonable explanations for the verified results, i.e., where false claims go wrong; 2) To tackle this issue, many researchers further focus on interpretable claim verification (the second category) by establishing interactive models between claims and each individual relevant article (or comment) to explore coherent (Ma et al., 2019;Wu et al., 2021), similar (Nie et al., 2019;Wu et al., 2020a), or conflicting (Zhou et al., 2020) semantics as evidence for verifying the false parts of claims.
In interpretable claim verification, the majority of models construct interactions between claims and each single user response (i.e., a comment or a relevant article) to capture evidence, which could effectively learn some of errant aspects of false claims. Due to different single responses reflect the cognition of different individual users, the evidence captured by these models is usually confined to individual cognition. However, individuals' cognition of social things is not always able to truly reflect the objective (Greenwald et al., 1998;Boogert et al., 2018). Owing to individuals are affected by factors such as emotional tendency (Ji et al., 2019), traditional beliefs (Willard and Norenzayan, 2017), and selectively capturing information (Hoffman, 2018), there are considerable differences in cognition of different individuals, and they are prone to cognitive bias, like primacy effect (Troyer, 2011) and halo effect (Goldstein and Naglieri, 2011), there may be one-sided or biased semantics in their expressed opinions. Thus, the captured evidence also correspondingly contains some unobjective and biased evidence fragments, deteriorating task performance. For instance, as shown in Figure 1, facing a claim to be verified, different individual users (here, users are the normal users on social media, not journalists or professionals) have different reactions. R2 (i.e., response 2 or relevant article 2) and R3 released by users contain unreliable and biased information perceived by their individuals, which may lead to some misleading information being captured as evidence by existing interactive models. Therefore, how to explore users' collective cognition on claims is a major challenge for interpretable claim verification.
To address the deficiencies, we propose a unified Dual-view model based on Collective and Individual Cognition (CICD) for interpretable claim verification, which focuses on discovering global evidence and local key evidence, respectively, and then strengthens the consistent shared evidence between the both. Specifically, to explore users' collective cognition to capture global evidence, we design Collective cognition view-based Encoder-Decoder module (CED). CED develops claim-guided encoder that not only learns word-level semantics based on individual user, but also captures sentence-level semantics (i.e., the overall opinions) among all users. Here, a relevant article (a response) released by an individual user is usually a sentence sequence, so all sentence-level semantics convey the overall opinions of all users. Then, CED develops hierarchical attention decoder to generate global evidence by adjusting weights of word-level and sentence-level semantics. To further acquire the local key evidence based on individual cognition, we develop Individual cognition view-based Selected Interaction module (ISI) to screen representative top-k articles with high difference and interact with the claim to gain local key evidence fragments. To weaken the bias of individual cognition view and strengthen the consistent shared evidence between global and local evidence, we project inconsistent loss to suppress the divergence. Experimental results not only reveal the effectiveness of CICD but also provide its interpretability. Our contributions are summarized: • A novel framework integrating interdisciplinary knowledge on interpretable claim verification is explored, which discovers global and local evidence from the perspectives of collective and individual cognition to interpret verified results.
• Proposed CED captures word-level (individual) and sentence-level (holistic) opinions, and reasonably adjusts the proportion between them, which generates global evidence of the view of all users.
• Experiments on three competitive datasets demonstrate that CICD achieves better performance than other strong baselines.

Related Work
Automatic verification approaches rely on neural networks to extract content-based features, like semantics (Popat et al., 2018;, sentiments (Nguyen et al., 2020), writing styles (Przybyla, 2020), etc., and metadata-based features, like user profilesbased (Kumar and Carley, 2019), comment-based (Bovet and Makse, 2019), etc., for verification. These methods could improve the accuracy of claim verification, but they are lack of interpretability for the verified results. To tackle this, interpretable claim verification has received great attention. Its basic principle is to obtain queried, corrected, and rumorrefuted semantics from the articles (or comments) related to claims to interpret the credibility of claims. At present, the methods for this task generally focus on direct interactions between claims and relevant articles to identify their matching degree (Nie et al., 2019), consistency (Ma et al., 2019), implication , conflict (Wu et al., 2020c), etc., to learn practical evidence. For instances, HAN (Ma et al., 2019) and EHIAN (Wu et al., 2020c) learned implication relationships between claims and relevant articles to capture semantic conflicts as evidence, which reflected a certain interpretability. However, since all relevant articles are involved, the captured conflicts may be affected by some low-quality articles with noisy semantics, easily resulting in the invalidation of the evidence. In our model, we design ISI module to screen all relevant articles to capture the valuable representative articles with differential semantics, so as to learn local key evidence fragments. In addition, some methods, such as GEAR  and KGAT (Liu et al., 2020), relied on graph-based networks to conduct semantic aggregation and reasoning on relevant articles, so as to capture global evidence. Nevertheless, these models treat an entire article (at the sentence level) as a node and ignore the importance of word-level semantics in each article.
To overcome these defects, our model constructs a hierarchical attention decoder to fuse sentence-level and word-level semantics for finely-grained generating global evidence.

The Proposed Approach
In this section, we introduce the details of CICD as illustrated in Figure 2.
Inputs and Outputs For cognitive input representations, the inputs of CED are a claim sequence and the concatenation of its all relevant articles with the number of N, while the inputs of ISI are a claim sequence and each relevant article. Given any a sequence of length l words X={x 1 , x 2 , ..., x l }, where each word x i ∈R d is a d-dimensional vector obtained by pre-trained BERT model (Devlin et al., 2019). Particularly, the length of each sequence in relevant articles is l and that of the claim sequence is p. Thus, we obtain the representations of the i-th relevant article and the claim as X r i ∈R l×d , X c ∈R p×d , respectively. For the outputs of the model, the outputs of CED are the generated global evidence sequence of length o words G = {g 1 , g 2 , ..., g o }, where g t is the representation of the t-th generated word and o is the length of G. The outputs of ISI are the integrated vector of top-k local key evidence fragments I=[I 1 ; I 2 , ; ...; I k ], where ; is the concatenation operation.

Collective Cognition View-based
Encoder-Decoder (CED) To explore users' collective cognition on claims, we first rely on claim-guided encoder to capture word-level and sentence-level semantics from all relevant articles, and then adjust the proportion between the both by hierarchical attention decoder to generate global evidence.

Claim-guided Encoder
The claim-guided encoder module involves a sequence encoding layer and a matching layer. Sequence Encoding Layer We rely on BiL-STMs to encode all relevant articles and the claim for their contextual representations. We utilize the produced hidden states H r = {h r 1 , h r 2 , ..., h r l all } (where l all means the total length of all articles) and H c ={h c 1 , h c 2 , ..., h c p } to denote the contextual representations of relevant articles and the claim, respectively, where h i (i.e., h r i or h c i ) is defined as follows: where state of the forward and backward LSTMs for the word x i respectively. ; is concatenation operation. Attention-based Matching Layer is engaged to aggregate the relevant information from the claim for each word within the context of relevant articles. The aggregation operation a i =attn(h r i , H c ) is as follows: where a i is the aggregated vector for the i-th word of the articles. α i,j is the normalized attention score between h r i and h c j . Here, the purpose of adopting claim to guide the encoding of relevant articles includes two perspectives: 1) Strengthening the focus of consistent semantics associated with the claim in relevant articles, i.e., exploring how relevant articles evaluate the claim; and 2) Making the encoding semantics purer. We observe that there are some advertisements or useless information in relevant articles. This way is able to effectively filter the noise irrelevant to the claim from relevant articles, and consolidates the generation of relevant semantics in the decoder module.
Furthermore, we output the hidden state corresponding to the last word encoded by each relevant article to form consistent sentence-level representations, where h s i represents sentence-level representations of the i-th relevant article. Particularly, we apply word-level representations H r ={h r 1 , h r 2 , ..., h r l all } (which can also be represented in the form of different relevant articles, i.e., H r ={h r 1,1 , h r 1,2 , ..., h r N,l }, where l all =N×l) and sentence-level representations H rs ={h s 1 , h s 2 , ..., h s N } as memory bank for decoder generation.

Hierarchical Attention Decoder
To capture the collective cognition-view evidence from relevant articles, we devise hierarchical attention decoder to consider the consistent semantics with different granularity of relevant articles to generate global evidence. Specifically, we employ unidirectional LSTM as the decoder, and at each decoding time-step, we calculate in parallel both sentencelevel attention weight β and word-level α by: where h d t is the hidden state of the decoder at the t-th time-step, W 2 and W 3 are trainable parameters. The word-level attention ascertains how to distribute the attention over words in each sentence (each article), which could learn salient evidence segments in each article, while the sentence-level attention determines how much each article should contribute to the generation at current time-step, which could capture potential global semantics in all articles.
Then the context vector c t is derived as a combination of all word-level representations reweighted by combined attention γ: And the attentional vector is calculated as: Finally, the predicted probability distribution over the vocabulary V at the current step is: We adopt G = {g 1 , g 2 , ..., g o } to denote the generated sequence rich in global evidence.

Individual Cognition View-based Selected Interaction (ISI)
To capture evidence fragments from individual cognition view, we design ISI module with the following layers: 1) Sentence-level representation for capturing high-level representations of relevant articles; 2) Selected mechanism for screening the representative top-k relevant articles with degree of difference; and 3) Co-interaction layer for making the claim and the selected articles interact with each other to explore local key evidence fragments.

Sentence-level Representation
We exploit BiLSTM to encode each relevant article and capture the output of the last hidden state as the sentence-level representation, where the encoding process is similar to sequence encoding layer in Section 3.2.1, where the sentence-level representation of the i-th article is h rs i .

Selected Mechanism
To capture representative top-k articles, we develop selected mechanism to calculate the difference between each articles and other articles in an automated manner. To do this, selected mechanism learns and optimizes an inter-sentential attention matrix A ∈ R N×N . The entry (m, n) of A holds the difference between article m and article n (1≤m, n≤N and m =n) and is computed as: u m =ϕ(W m h rs m +b m ) u n =ϕ(W n h rs n +b n ) (10) where ϕ is a activation function, W m and W n are weight matrix, b m and b n are biases, and denotes dot product operator. The larger the entry A[m, n] is, the higher the similar between article m and article n is. Thus, the smaller A[m, n] corresponds to article m and n contain more differential semantics, and finally we screen top-k relevant articles with high difference for further downstream interaction.

Co-Interaction Layer
This co-interaction layer aims to explore local key evidence fragments. Specifically, the layer enables the claim to focus on the i-th article to discover the specific evidence fragment, while the i-th article pays close attention to the claim to explore the possible false part of the claim. Finally, we combine the two interactions to constitute the individual key local evidence fragments.
where H rin i is the evidence fragment of the i-th article, H cin is the false part of the claim, and H cs is the outputs of the last time step of H c .
For all top-k articles, we integrate all local evidence fragments by concatenation operation.

Dual-View Classification
To alleviate the bias of individual cognition-view evidence fragments and strengthen the consistent shared evidence between global and local evidence, we introduce an inconsistency loss to penalize the disagreement between the both evidence. We define the inconsistency loss function as the Kulllback-Leibler (KL) divergence between G and I.
where G k is the k-th element of the concatenation of the words in G, and I k is the k-th element of I. Furthermore, we fuse the two types of penalized evidence, and adopt softmax function to emit the probability distribution for training, where a loss forces the model to minimize the cross-entropy error for a training sample with ground-truth label y: where W p and b p are the learnable parameters.
To ensure the effective synergy of the two cognition views, we put together all loss mentioned above for joint training. L = Loss + αLoss in (19) where α is the hyper-parameter.

Datasets and Evaluation Metrics
For evaluation, we utilize three publicly available datasets, i.e., Snopes, PolitiFact (both released by (Popat et al., 2018)), and FEVER (Thorne et al., 2018). The first two datasets contain 4,341 and 3,568 news claims, associating with 29,242 and 29,556 relevant articles (these articles can be regarded as responses of different individual users to claims) collected from various web sources respectively. FEVER consists of 185,445 claims accompanied by manual annotation Wikipedia articles. For labels, each claim in Snopes is labeled as true and false, while Poli-tiFact divides claims into six kinds of credibility labels: true, mostly true, half true, mostly false, false, and pants on fire. To distinguish the veracity more practically, like Ma et al. (2019), we merge mostly true, half true and mostly false into mixed, and treat false and pants on fire as false. Then, the labels of PolitiFact are classified as true, mixed, and false. On FEVER, each claim is partitioned as supported, refuted, or NEI (not enough information). For evaluation metrics, on Snopes and PolitiFact, we exploit micro-/macro-averaged F1(micF1/macF1), class-specific precision (Prec.), recall (Rec.) and F1score (F1) as evaluation metrics. We hold out 10% of the claims for tuning the hyper parameters, and conduct 5-fold cross-validation on the rest of the claims. On FEVER, we leverage accuracy (Acc.), and F1score (F1) as evaluation metrics, and follow Thorne et al. (2018) to partition the annotated claims into training, development (Dev.), and testing (Test.) sets.

Settings
For parameter configurations, we adjust them according to the performance of development sets, we set the word embedding size d to 768. The dimensionality of LSTM hidden states d h is 120. The length l of each relevant article is 100 and that of the claim p is assigned as 20. Due to no parameters depend on the number of articles N, instead of intercepting a fixed number, we set N to vary with claims. Initial learning rate is set to 2e-3. The loss weight coefficient α is trained to 0.2. The dropout rate is 0.4, and we set the mini-batch size of the three datasets as 32, 32, and 64, respectively. Additionally, an Adam (Kingma and Ba, 2015) optimizer with β1 as 0.9 and β2 as 0.999 is used to optimize all trainable parameters.

Performance Comparison
We compare CICD and several competitive baselines   (Wu et al., 2020c) is an evidence-aware hierarchical interactive attention network, which focuses on the direct interaction between claim and relevant articles to explore key evidence fragments. As shown in Table 1, we observe that: • BERT achieves at least 6.5% improvement on micF1 than DeClarE, which illustrates pre-trained model can learn rich semantic context features to improve performance, which is also the reason that we adopt BERT to train word embeddings. HAN consistently outperforms BERT, which indicates HAN capturing the coherence between relevant articles could help improve the task performance.
• In interpretable methods, CICD outperforms De-ClarE, which is because our model not only focuses on word-level semantics like DeClarE, but also grasps the holistic sentence-level features. Moreover, owing to HAN and HAN-ba drive all relevant articles to participate in the interaction, prompting them to gain a small boost in precision on Snopes, but this way may introduce noise from nonsignificant articles. CICD effectively avoids this problem by selecting vital articles for interaction, which obtains significant improvements in other metrics compared with HAN and HAN-ba. Furthermore, CICD consistently outperforms EHIAN on Snopes and PolitiFact. The superiority is clear: CICD not only values individual cognition view to capture key evidence fragments, but also generates collective cognition-view evidence for claim verification.

Ablation Study
In order to evaluate the impact of each component of CICD, we ablate CICD into the following simplified models: 1) -matching U represents the attentionbased matching layer of CED is removed; 2) -CED  means CED is deleted from our model; 3) -selected I refers to the selected mechanism is removed from ISI; 4) -interaction I represents the co-interaction unit of ISI is replaced by concatenation operation; 5) -ISI corresponds to ISI is separated; and 6) -inconsistency loss means the inconsistency loss is removed. As shown in Table 2, we observe that: • The removal of each module (-CED or -ISI) weakens the performance of CICD, presenting from 4.2% to 5.5% degradation in micF1, and the stripping of different layers (like -selected I andinteraction I) of each module also reduces the model performance, reducing at least 2.4% performance in micF1, which describes the effectiveness of each component and the organic integrity of CICD.
• -CED reflects the lowest performance in all simplified models, decreasing 5.5% and 4.6% in micF1 on the two datasets, respectively, which elaborates the effectiveness of our CICD capturing the collective cognition-view global evidence. Meanwhile, -ISI underperforms CICD, showing 4.3% and 4.2% degradation in micF1 on the two datasets respectively, which conveys the necessity of the exploration of local key evidence fragments from individual cognition view.
• When compared with -inconsistency loss, CICD significantly improves the performance on the two datasets with the help of inconsistency loss unit, which verifies the effectiveness of our model rely-ing on inconsistency loss to discover shared valuable semantics between global and local evidence.

Evaluation of Co-Interaction Networks
To obtain a more detailed understanding of the superiority of our co-interaction networks (CoI), we compare CoI with the following prevalent interaction networks: 1) MLP (Multilayer Perceptron) acts as an interaction strategy to automatically abstract the integrated representation of claims and articles; 2) Self-Att (Self-attention Networks) (Vaswani et al., 2017) adopts the claim as query, and relevant articles to serve as values and keys for interaction; 3) Biaf-Att (Biaffine Attention) (Ma et al., 2019) measures the degree of semantic matching for interaction; and 4) Symm-Intr (Symmetric interaction attention) (Tao et al., 2019) is exploited to model the interaction between claims and articles. Specifically, we investigate the performance and time cost of these methods on Snopes and PolitiFact based on Linux CentOS with NVIDIA TITAN Xp GPU, as shown in Figure 3. We observe that: From the overall performance of all methods, our method achieves the optimal performance, outperforming other methods by more than 5.1% and 5.6% performance in micF1, respectively. From the indicator of time cost, our method saves a great deal of time. Compared with Self-Att and Symm-Intr, our method saves from 500 to 1,000 seconds in time cost on the two datasets, respectively. The reason is that the structures of multiple mappings of self-attention networks and the repeat stacks of symmetric attention delay the efficiency. Although the time cost of our method is higher than that of MLP and Biaf-Att, the performance of both methods is unsatisfactory, which is lower than our method al least 2.6% and 3.7% in micF1 on both datasets. On the whole, these adequately manifest the superiority of our method.

Evaluation of Hierarchical Attention Decoder
To verify the effectiveness of the internal structure of hierarchical attention decoder (HAD) in CED, we ablate HAD with the following models: -word., -sentence., and -merge. respectively denote HAD removing word-level attention α, sentence-level attention β, and merged semantics γ. decoder. represents the vanilla decoder. Experimental results are shown in Table 3, we observe that: first, the removal of any module of HAD could weaken the performance of the model, which confirms the effectiveness of each module. Second, in addition to the basic decoder,   our model achieves the most prominent boost with the support of sentence-level attention, which proves the effectiveness of HAD fusing sentence-level semantics to capture global semantics of HAD.
To further investigate the contribution of sentencelevel semantics to the global evidence, we take Figure  1 as an example to visualize the global evidence generated by our model with and without sentence-level attention, respectively. As shown in Figure 4, we observe that the model with sentence-level attention focuses more on the sentences with maximum weight, that is, R4, such as the words 'do not spread' and 'refuted it spreads in the air', while the model without sentence-level attention does not identify which relevant articles are more valuable, so that they concentrate more on R2 and R3, like 'get infected husband' and 'not all types of mosquitoes'. These fully prove the effectiveness of sentence-level semantics for the generation of global evidence.

Experiments on FEVER
To examine the extensibility of our model, we also compare CICD and the following state-of-theart baselines on FEVER dataset: 1) NSMN: The pipeline-based system, Neural Semantic Matching Network (Nie et al., 2019), conducts document retrieval, sentence selection, and claim verification jointly for fact extraction and verification; 2) HAN: It has introduced in Section 4.3.1; 3) GEAR: A graph-Do not spread this news, we prevent the transmission of dengue fever through mosquito. It is refuted that it spreads in the air.

I get infected after my husband, it maybe true that dengue fever could be transmitted through mosquitoes and air, but not all types of mosquitoes.
(a) Our model with sentence-level attention (b) Our model without sentence-level attention  The screenshot of the video is one-sided, it is only a segment of the video. However, he corrected himself immediately to say the number was actually 120,000.
Claimed 120 million Americans died of coronavirus is a serious error, Biden's mistake, the screenshot is one-sided. based evidence aggregating and reasoning model ) enables information to transfer on a fully-connected evidence graph and then utilizes different aggregators to collect multi-evidence information; 4) KGAT: Kernel graph attention network (Liu et al., 2020) conducts more fine-grained fact verification with kernel-based attentions, where using BERT (Base) encoder with ESIM retrieved sentences. As shown in Table 4, we observe that: CICD outperforms the two pipelines (NSMN and HAN) by from 4.3% to 11.0% boost in accuracy, respectively. This is because these two baselines lack the integration and reasoning process between relevant articles when capturing evidence. CICD boosts the performance in comparison with GEAR and KGAT, showing at least 1.8% and 1.5% improvement in accuracy on development and testing sets, respectively. The reason may be that although the two graph-based models aggregate and reason information from relevant articles to collect multi-evidence, they treat each relevant article equally, leading to individual-cognitive relevant articles with some biased semantics interfering with their reasoning process. It is more feasible for our model to discover global evidence and local key evidence fragments comprehensively from the perspectives of collective and individual cognition.

Case Study: Cognition-view Explanation Analysis
To interpret the results of our model more transparently and intuitively, we visualize the outputs of each module of CICD as shown in Figure 5, where Figure   5 (a) is the sequence generated by CED module, and the highlighted words in Figure 5(b) and 5 (c) are respectively the words captured by CICD to interpret the results and the words obtained by ISI module to obtain the evidence fragments. We could learn: • ISI ignores some articles with pale and feeble semantics (R2 and R4), and selects the articles with more valuable semantics (R1, R3, and R5) and captures multiple local evidence fragments, such as 'this video screenshot shows' (E1), 'serious error' (E2), and 'screenshot of the video is one-sided' (E3). Particularly, fragment E1 is misleading, which reflects the deviation of individual cognition.
• The sequence generated by CED effectively gains available evidence '120 million Americans a serious error' and 'the screenshot is one-sided' through balancing the possible evidence semantics in relevant articles from a global perspective.
• By constraining global and local evidence, CICD disciplines the misleading evidence fragment E1 captured by ISI, and finally highlights the shared salient evidence between the both as the final interpretability of the verification results.

Conclusion
In this paper, we proposed a unified dual-view model based on the perspectives of collective and individual cognition for interpretable claim verification, which constructed collective cognition view-based encoder-decoder module to generate global evidence and designed individual cognition view-based selected interaction module to explore local key evidence segments. Besides, we introduced inconsistent loss to penalize the disagreement between global and local evidence for promoting the capture of consistent shared evidence. Experiments on three different widely used datasets demonstrated the effectiveness and interpretability of our model. In the future, we plan to expand the work as follows: 1) Developing questioning mechanism to filter the suspicious evidence; and 2) Integrating social cognition, psychology, and other interdisciplinary knowledge to improve the interpretability of claim verification.