Topic-Aware Evidence Reasoning and Stance-Aware Aggregation for Fact Verification

Fact verification is a challenging task that requires simultaneously reasoning and aggregating over multiple retrieved pieces of evidence to evaluate the truthfulness of a claim. Existing approaches typically (i) explore the semantic interaction between the claim and evidence at different granularity levels but fail to capture their topical consistency during the reasoning process, which we believe is crucial for verification; (ii) aggregate multiple pieces of evidence equally without considering their implicit stances to the claim, thereby introducing spurious information. To alleviate the above issues, we propose a novel topic-aware evidence reasoning and stance-aware aggregation model for more accurate fact verification, with the following four key properties: 1) checking topical consistency between the claim and evidence; 2) maintaining topical coherence among multiple pieces of evidence; 3) ensuring semantic similarity between the global topic information and the semantic representation of evidence; 4) aggregating evidence based on their implicit stances to the claim. Extensive experiments conducted on the two benchmark datasets demonstrate the superiority of the proposed model over several state-of-the-art approaches for fact verification. The source code can be obtained from https://github.com/jasenchn/TARSA.


Introduction
The Internet breaks the physical distance barrier among individuals to allow them to share data and information online. However, it can also be used by people with malicious purposes to disseminate misinformation or fake news. Such misinformation may cause ethnics conflicts, financial losses and political unrest, which has become one of the greatest threats to the public (Zafarani et al., 2019; corresponding author et al., 2019b). Moreover, as shown in Vosoughi et al. (2018), compared with truth, misinformation diffuses significantly farther, faster, and deeper in all genres. Therefore, there is an urgent need for quickly identifying the misinformation spread on the web. To solve this problem, we focus on the fact verification task (Thorne et al., 2018), which aims to automatically evaluate the veracity of a given claim based on the textual evidence retrieved from external sources.
Recent approaches for fact verification are dominated by natural language inference models (Angeli and Manning, 2014) or textual entailment recognition models (Ma et al., 2019), where the truthfulness of a claim is verified via reasoning and aggregating over multiple pieces of retrieved evidence. In general, existing models follow an architecture with two main sub-modules: the semantic interaction module and the entailment-based aggregation module (Hanselowski et al., 2018a;Nie et al., 2019a;Soleimani et al., 2020;Liu et al., 2020). The semantic interaction module attempts to grasp the rich semantic-level interactions among multiple pieces of evidence at the sentence-level (Ma et al., 2019;Zhou et al., 2019a;Subramanian and Lee, 2020) or the semantic roles-level (Zhong et al., 2020). The entailment-based aggregation module aims to filter out irrelevant information to capture the salient information related to the claim by aggregating the semantic information coherently.
However, the aforementioned approaches typically learn the representation of each evidenceclaim pair from the semantic perspective such as obtaining the semantic representation of each evidence-claim pair through pre-trained language models (Devlin et al., 2019) or graph-based models (Velickovic et al., 2018), which largely overlooked the topical consistency between claim and evidence. For example in Figure 1, given the claim "A high school student named Cole Withrow was Claim A high school student named Cole Withrow was charged for leaving an unloaded shotgun in his vehicle while parking at school.

E1 (gold)
Family friend Kim Boykin said Withrow, an Eagle Scout and honors student, accidentally left his gun in the car afterskeet shooting over the weekend.

E2 (gold)
Others in the Princeton High community agree that Withrow's punishment is too harsh, especially after charges weren't filed when a loaded gun was found in an assistant principal's car two years ago.

E3 (non-gold)
"Please know that with student and personnel issues, We carefully balance all factors to a rrive at a fair and just Outcome." she said in a statement.

E4 (non-gold)
He locks his vehicle, goes inside and tries to do the right thing. charged for leaving an unloaded shotgun in his vehicle while parking at school" and the retrieved evidence sentences (i.e., E1-E4), we would expect a fact checking model to automatically filter evidence which is topically-unrelated to the claim such as E3 and E4 and only relies on the evidence which is topically-consistent with the claim such as E1 and E2 for veracity assessment of the claim. In addition, we also expect the topical coherence of multiple pieces of supporting evidence such as E1 and E2. Furthermore, in previous approaches, the learned representations of multiple pieces of evidence are aggregated via element-wise max pooling or simple dot-product attention, which inevitably fails to capture the implicit stances of evidence toward the claim (e.g., E1 and E2 support the claim implicitly, E3 and E4 are unrelated to the claim) and leads to the combination of irrelevant information with relevant one.
To address these problems, in this paper, we propose a novel neural structure reasoning model for fact verification, named TARSA (Topic-Aware Evidence Reasoning and Stance-Aware Aggregation Model). A coherence-based topic attention is developed to model the topical consistency between a claim and each piece of evidence and the topical coherence among evidence built on the sentence-level topical representations. In addition, a semantictopic co-attention is created to measure the coherence between the global topical information and the semantic representation of the claim and evidence. Moreover, the capsule network is incorporated to model the implicit stances of evidence toward the claim by the dynamic routing mechanism.
The main contributions are listed as follows: • We propose a novel topic-aware evidence reasoning and stance-aware aggregation approach, which is, to our best knowledge, the first attempt of jointly exploiting semantic interaction and topical consistency to learn latent evidence representation for fact verification.
• We incorporate the capsule network structure into our proposed model to capture the implicit stance relations between the claim and the evidence.
• We conduct extensive experiments on the two benchmark datasets to demonstrate the effectiveness of TARSA for fact verification.

Related Work
In general, fact verification is a task to assess the authenticity of a claim backed by a validated corpus of documents, which can be divided into two stages: fact extraction and claim verification (Zhou and Zafarani, 2020). Fact extraction can be further split into the document retrieval phase and the evidence selection phase to shrink the search space of evidence (Thorne et al., 2018). In the document retrieval phase, researchers typically reuse the top performing approaches in the FEVER1.0 challenge to extract the documents with high relevance for a given claim (Hanselowski et al., 2018b;Yoneda et al., 2018;Nie et al., 2019a). In the evidence selection phase, to select relevant sentences, researchers generally train the classification models or rank models based on the similarity between the claim and each sentence from the retrieved documents (Chen et al., 2017;Stammbach and Neu-mann, 2019;Soleimani et al., 2020;Wadden et al., 2020;Zhong et al., 2020;Zhou et al., 2019a). Many fact verification approaches focus on the claim verification stage, which can be addressed by natural language inference methods (Parikh et al., 2016;Ghaeini et al., 2018;Luken et al., 2018). Typically, these approaches contain the representation learning process and evidence aggregation process. Hanselowski et al. (2018b) and Nie et al. (2019a) concatenate all pieces of evidence as input and use the max pooling to aggregate the information for claim verification via the enhanced sequential inference model (ESIM) (Chen et al., 2017). In a similar vein, Yin and Roth (2018) incorporate the identification of evidence to further improve claim verification using ESIM with different granularity levels. Ma et al. (2019) leverage the co-attention mechanism between claim and evidence to generate claim-specific evidence representations which are used to infer the claim.
Benefiting from the development of pre-trained language models, Zhou et al. (2019a) are the first to learn evidence representations by BERT (Devlin et al., 2019), which are subsequently used in a constructed evidence graph for claim inference by aggregating all claim-evidence pairs. Zhong et al. (2020) further establish a semantic-based graph for representation and aggregation with XLNet (Yang et al., 2019). Liu et al. (2020) incorporate two sets of kernels into a sentence-level graph to learn a more fine-grained evidence representations. Subramanian and Lee (2020) further incorporate evidence set retrieval and hierarchical attention sum block to improve the performance of claim verification.
Different from all previous approaches, our work for the first time handles the fact verification task by considering the topical consistency and the semantic interactions between claim and evidence. Moreover, we employ the capsule network to model the implicit stance relations of evidence toward the claim.

Method
In this section, we present an overview of the architecture of the proposed framework TARSA for fact verification. As shown in Figure 2, our approach consists of three main layers: 1) the representation layer to embed claim and evidence into three types of representations by a semantic encoder and a topic encoder; 2) the coherence layer to incorpo-rate the topic information into our model by two attention components; 3) the aggregation layer to model the implicit stances of evidence toward claim using the capsule network.

Representation Layer
This section describes how TARSA extracts semantic representations, sentence-level topic representations, and global topic information through a semantic encoder and a topic encoder separately.
Semantic Encoder The semantic encoder in TARSA is a vanilla transformer (Vaswani et al., 2017) with the eXtra hop attention (Zhao et al., 2020). For each claim c paired with N pieces of retrieved evidence sentences E = {e 1 , e 2 , · · · , e N }, TARSA constructs the evidence graph by treating each evidence-claim pair and build a fully-connected evidence graph G. We also add a self-loop to every node to perform message propagation from itself.
Specifically, we first apply the vanilla transformer on each node to generate the claimdependent evidence representation using the input where i denotes the i-th node in G. We treat the first token representation h i,0 as the local context of node i. Then the eXtra hop attention takes the [CLS] token in each node as a "hub token", which is to attend on hub tokens of all other connected nodes to learn the global context. One layer of eXtra hop attention can be viewed as a single-hop message propagation among all the nodes along the edges, where e i,j = 1 denotes that there is an edge between the node i and the node j,q i,0 denotes the query vector of the [CLS] token of node i,k j,0 and ν j,0 denote the key vector and the value vector of the [CLS] token of node j, respectively, and √ d k denotes the scaling factor.
The local context and the global context are concatenated to learn the semantic representation of all the nodes: (3) Figure 2: The overview of the architecture of our Topic-Aware Evidence Reasoning and Stance-Aware Aggregation model (TARSA) By stacking L layers of the transformer with the eXtra hop attention which takes the semantic representation of the previous layer as input, we learn the semantic representation of evidence H

Multi-Head Attention
Topic Encoder We extract topics in the following two forms via latent Dirichlet allocation (LDA) (Blei et al., 2003): Sentence-level topic representation: Given a claim c and N pieces of the retrieved evidence E, we extract latent topic distribution t ∈ R K for each sentence as the sentence-level topic representation, where K is the number of topics. More concretely, we denote t c ∈ R K for claim c and t e i ∈ R K for evidence e i . Each scalar value t k denotes the contribution of topic k in representing the claim or evidence. Global topic information: We extract global topic information P = [p 1 , p 2 , · · · , p K ] ∈ R K×V from the topic-word distribution by treating each sentence (i.e., claim or evidence) in corpus D as a document, where V denotes the vocabulary size.

Coherence Layer
This section describes how to incorporate the topic information into our model with two attention components.
Coherence-based Topic Attention Based on the observation as illustrated in Figure 1, we as-sume that given a claim, the sentences used as evidence should be topically coherent with each other and the claim should be topically consistent with the relevant evidence. Therefore, two kinds of topical relationship are considered: 1) topical coherence among multiple pieces of evidence (T C ee ); 2) topical consistency between the claim and each evidence (T C ce ).
Specifically, to incorporate the topical coherence among multiple pieces of evidence into our model, we disregard the order of evidence and treat each evidence independently. Then we utilize the multihead attention (Vaswani et al., 2017) without position embedding to generate the new topic representation of evidencet e based on the sentence-level topic representation t e ∈ R N ×K of the retrieved evidence for a given claim.
Moreover, we utilize the co-attention mechanism (Chen and Li, 2020) to weigh each evidence based on the topic consistency between the claim and the evidence. Given the sentence-level topic representation t c for claim and t e for the corresponding evidence, the co-attention attends to the claim and the evidence simultaneously. We first compute the proximity matrix F ∈ R N , where W l ∈ R K×K is the learnable weight matrix. The proximity matrix can be viewed as a transformation from the claim attention space to the evidence attention space. Then we can predict the interaction attention by treating F as the feature, where W e , W c ∈ R l×K are the learnable weight matrices. Finally we can generate a topic similarity score between the claim and each evidence using the softmax function, where w ∈ R 1×l is the learnable weight, α e ∈ R N is the attention score of each piece of evidence for the claim. Eventually, the topic representation A ∈ R N ×K can be computed as follows, where is the dot product operation.
Semantic-Topic Co-attention We weigh each piece of evidence e i to indicate the importance of the evidence and infer the claim based on the coherence between the semantic representation and the global topic information via the co-attention mechanism, which is similar to the coherence-based topic attention in Section 3.2. More concretely, taking H and P as input, we compute the proximity matrix F ∈ R K×N to transform the topic attention space to the semantic attention space by Eq. (5). As a result, the attention weights β e ∈ R N of evidence can be obtained by Eq. (6) and (7). Eventually, the semantic representation S ∈ R N ×d can be updated via S = β e H.

Aggregation Layer
To model the implicit stances of evidence toward claim, we incorporate the capsule network (Sabour et al., 2017) j=1 ∈ R do denote the high-level class capsules, where M denotes the number of classes. The capsule network models the relationship between the evidence capsules and the class capsules by the dynamic routing mechanism (Yang et al., 2018), which can be viewed as the implicit stances of each evidence toward three classes.
Formally, let u j|i be the predicted vector from the evidence capsule u i to the class capsule o j , where W j,i ∈ R do×de denotes the transformation matrix from the evidence capsule u i to the class capsule o j . Each class capsule aggregates all of the evidence capsules by a weighted summation over all corresponding predicted vectors: where g is a non-linear squashing function which limits the length of o j to [0, 1], γ ji is the coupling coefficient that determines the probability that the evidence capsule u i should be coupled with the class capsule o j . The coupling coefficient is calculated by the unsupervised and iterative dynamic routing algorithm on original logits b ji , which is summarized in Algorithm 1. We can easily classify the claim by choosing the class capsule with the largest ρ j via the capsule loss (Sabour et al., 2017). Moreover, the cross entropy loss is applied on the evidence capsules to identify whether the evidence is the ground truth evidence.

Experimental Setting
This section describes the datasets, evaluation metrics, baselines, and implementation details in our experiments.   as "REFUTES","SUPPORTS" and "NEI", respectively. And we omit all other labels (i.e., legent, outdated, and miscaptioned) as these instances are difficult to distinguish. Table 1 presents the statistics of the two datasets.

Evaluation Metrics
The official evaluation metrics 1 for the FEVER dataset are Label Accuracy (LA) and FEVER score (F-score). LA measures the accuracy of the predicted labelŷ i matching the ground truth label y i without considering the retrieved evidence. The FEVER score labels a prediction as correct if the predicted labelŷ i is correct and the retrieved evidence matches at least one gold-standard evidence, which is a better indicator to reflect the inference capability of the model. We use precision, recall, and macro F1 on UKP Snopes to evaluate the performance.

Implementation Details
We describe our implementation details in this section. Document retrieval takes a claim along with a collection of documents as the input, then returns N most relevant documents. For the FEVER dataset, following Hanselowski et al. (2018a), we adopt the entity linking method since the title of a Wikipedia page can be viewed as an entity and can be linked easily with the extracted entities from 1 https://github.com/sheffieldnlp/fever-scorer the claim. For the UKP Snopes dataset, following Hanselowski et al. (2019), we adopt the tf-idf method where the tf-idf similarity between claim and concatenation of all sentences of each Snopes page is computed, and then the 5 highest ranked documents are taken as retrieved documents.
Evidence selection retrieves the related sentences from retrieved documents in ranking setting. For the FEVER dataset, we follow the previous method from Zhao et al. (2020). Taking the concatenation of claim and each sentence as input, the [CLS] token representation is learned through BERT which is then used to learn a ranking score through a linear layer. The hinge loss is used to optimize the BERT model. For the UKP Snopes dataset, we adopt the tf-idf method from Hanselowski et al. (2019), which achieves the best precision.
Claim verification. During the training phase, each claim is paired with 5 pieces of evidence, we set the batch size to 1 and the accumulate step to 8, the layer L is 3, the head number is 5, the l is 100, the number of class capsules M is 3, the dimension of class capsules d o is 10, the topic number K ranges from 25 to 100. In our implementation, the maximum length of each claim-evidence pair is 130 for both datasets.

Experimental Results
In this section, we evaluate our TARSA model in different aspects. Firstly, we compare the overall performance between our model and the baselines. Then we conduct an ablation study to explore the effectiveness of the topic information and the capsule network structure. Finally, we also explore the advantages of our model in single-hop and multihop reasoning scenarios. Table 2 and Table 3 report the overall performance of our model against the baselines for the FEVER dataset and the UKP Snopes dataset 2 . As shown in Table 2, our model significantly outperforms BERTbased models on both development and test sets. However, compared with the graph-based models,   TARSA outperforms previous systems, GEAR and KGAT, except DREAM for LA on the test set. One possible reason is that DREAM constructs an evidence graph based on the semantic roles of claim and evidence, which leverages an explicit graph-level semantic structure built from semantic roles extracted by Semantic Role Labeling (Shi and Lin, 2019) in a fine-grained setting. Nevertheless, TARSA shows superior performance than DREAM on the FEVER score, which is a more desirable indicator to demonstrate the reasoning capability of the model. As shown in Table 3, TARSA performs the best compared with all previous approaches on the UKP Snopes dataset.  set of FEVER and UKP Snopes. It can be observed that the optimal topic number is 25 for FEVER and 50 for UKP Snopes. One possible reason is that UKP Snopes is retrieved from multiple domains which includes more diverse categories than those of FEVER.

Ablation Study
To further illustrate the effectiveness of the topic information and the capsule-level aggregation modeling, we perform an ablation study on the development set of FEVER.
Effect of Topic Information: We first explore how the model performance is impacted by the removal of various topic components. The first six rows in Table 5 present the label accuracy (LA) and the FEVER score on the development set of FEVER after removing various components, where STI denotes the semantic-topic information in Section 3.2, T C ee denotes the topical coherence among multiple pieces of evidence, T C ce denotes the topical consistency between the claim and each piece of evidence. As expected, LA and the FEVER score decrease consistently with a gradual removal of various components, which demonstrates the effectiveness of incorporating topic information in three aspects. We find that after all modules are removed, the performance of TARSA is still nearly 2% higher than our base model, Transformer-XH, due to the use of the capsule network in TARSA.
Effect of Capsule-level Aggregation: We explore the effectiveness of the capsule-level aggregation by comparing it with four different aggregation methods. The last four rows in Table 5 show the results of aggregation analysis in the development set on FEVER. The max pooling, sum, and mean aggregation consider the learned representations of evidence as a single matrix, then apply a linear layer to classify the input claim as SUPPORTS, REFUTES, or NEI. The attention-based aggrega-   tion method is used in Zhou et al. (2019a), where the dot-product attention is computed between the claim and each evidence to weigh them differently. Finally, our TARSA model aggregates the information of all pieces of evidence using the capsule network, which connects the evidence capsules to the class capsules in a clustered way. From the results, our model outperforms all other aggregation methods. Table 6 presents the performance of our model on single-hop and multi-hop reasoning scenarios on the FEVER dataset compared with several baselines. The single-hop mainly focuses on the denoising ability of the model with the retrieved evidence, which selects the salient evidence for inference. The multi-hop mainly emphasizes the relatedness of different pieces of evidence for the joint reasoning, which is a more complex task. We build the training and testing sets for both single-hop and multi-hop scenarios based on the number of gold-standard evidence of a claim. If more than one gold-standard evidence is required, then the claim would require multi-hop reasoning. The instances with the NEI label are removed because there is no gold-standard evidence matching this label. The single-hop reasoning set contains  78,838 and 9,682 instances for training and testing, respectively, while the multi-hop reasoning set contains 30,972 and 3,650 instances for training and testing, respectively. As Table 6 shows, TARSA outperforms all other baselines on LA by at least 0.31% in the single-hop scenario and 1.09% in the multi-hop scenario, respectively, which shows a consistent improvement in both scenarios. In addition, TARSA is more effective on the multihop scenario as the capsule-level aggregation helps better aggregate the information of all pieces of evidence. Table 7 illustrates an example from the UKP Snopes dataset which is correctly detected as RE-FUTES, where the topic words extracted by LDA are marked in blue. From the table we can observe: 1) the top two pieces of evidence (i.e., e1 and e2) have higher topical overlap with the claim and also with each other; 2) the lower two pieces of evidence (i.e., e4 and e5) seem less important because they are less topically relevant to the claim; 3) for e3, it is difficult to judge its relevance from either the topical or the semantic perspective, which is ambiguous for the identification of the truthfulness of the claim.

Error Analysis
We randomly select 100 incorrectly predicted instances from FEVER and UKP Snopes datasets and categorize the main errors. The first type of errors is caused by the quality of topics extracted by LDA. This is because the average length of sentences in both datasets is much shorter after removing the low-and high-frequency tokens, which poses a challenge for LDA to extract high quality topics to match the topical consistency between a claim and each evidence. The second type of errors is due to the failure of detecting multiple entity mentions referring to the same entity. For example, the claim describes "Go Ask Alice was the real life diary of a teenager girl", where evidence describes that "This book is a work of fiction". The model fail to understand the relationship between diary and fiction.

Conclusion
We have presented a novel topic-aware evidence reasoning and stance-aware aggregation model for fact verification. Our model jointly exploits the topical consistency and the semantic interaction to learn evidence representations at the sentence level. Moreover, we have proposed the use of the capsule network to model the implicit stances of evidence toward a claim for a better aggregation of information encoded in evidence. The results on two public datasets demonstrate the effectiveness of our model. In the future, we plan to explore an iterative reasoning mechanism for more efficient evidence aggregation for fact checking.