A Multi-Level Attention Model for Evidence-Based Fact Checking

Evidence-based fact checking aims to verify the truthfulness of a claim against evidence extracted from textual sources. Learning a representation that effectively captures relations between a claim and evidence can be challenging. Recent state-of-the-art approaches have developed increasingly sophisticated models based on graph structures. We present a simple model that can be trained on sequence structures. Our model enables inter-sentence attentions at different levels and can benefit from joint training. Results on a large-scale dataset for Fact Extraction and VERification (FEVER) show that our model outperforms the graph-based approaches and yields 1.09% and 1.42% improvements in label accuracy and FEVER score, respectively, over the best published model.


Introduction
False or misleading claims spread through online media faster and wider than the truth (Vosoughi et al., 2018). False claims can occur in many different forms, e.g., fake news, rumors, hoaxes, propaganda, etc. Identifying false claims that are likely to cause harm in the real world is important. Generally, claims can be categorized into two types: verifiable and unverifiable. Verifiable claims can be confirmed to be true or false as guided by evidence from credible sources, while unverifiable claims cannot be confirmed due to insufficient information.
Verifying the truthfulness of a claim with respect to evidence can be regarded as a special case of recognizing textual entailment (RTE) (Dagan et al., 2006) or natural language inference (NLI) (Bowman et al., 2015), where the premise (evidence) is not given. Thus, the task of claim verification is to 8143 Moscovium is a transactinide element. SUPPORTED [Moscovium] Moscovium is a superheavy synthetic element with symbol Mc and atomic number 115. 0 In the periodic table, it is a p-block transactinide element. 7 [Transactinide_element] In chemistry, transactinide elements (also, transactinides, or super-heavy elements) are the chemical elements with atomic numbers from 104 to 120. 0 ID: Claim: Label: Evidence:

201459
A dynamic web page does not involve computer programming. REFUTED [Web_page] A static web page is delivered exactly as stored, as web content in the web server's file system, while a dynamic web page is generated by a web application that is driven by server-side software or client-side scripting. 14 [Dynamic_web_page] A dynamic web page is then reloaded by the user or by a computer program to change some variable content. 9 Figure 1: Examples from the FEVER dev set, where true evidence sentences are present in the selected sentences, and veracity relation labels are correctly predicted by our proposed model. Wikipedia article titles are in [italics]. Superscripts indicate the positions of the sentences in each article.
first retrieve documents relevant to a given claim from textual sources, then select sentences likely to contain evidence, and finally assign a veracity relation label to support or refute the claim. For example, the false claim "Rabies is a foodborne illness." can be refuted by the evidence "Rabies is spread when an infected animal scratches or bites another animal or human." extracted from the Wikipedia article "Rabies". Figure 1 shows other examples that require multiple evidence sentences to support or refute claims. All of these claims are taken from a benchmark dataset for Fact Extraction and VERification (FEVER) (Thorne et al., 2018).
A key challenge is to obtain a representation for claim and evidence sentences that can effectively capture relations among them.
Recent state-of-the-art approaches have attempted to meet this challenge by applying graphbased neural networks (Kipf and Welling, 2017;Velickovic et al., 2018). For example, Zhou et al. (2019) regard an evidence sentence as a graph node, while  use a more fine-grained node representation based on token-level attention. Zhong et al. (2020) use semantic role labeling (SRL) to build a graph structure, where a node can be a word or a phrase depending on the SRL's outputs.
In this paper, we argue that such sophisticated graph-based approaches may be unnecessary for the claim verification task. We propose a simple model that can be trained on a sequence structure. We also observe mismatches between training and testing. At test time, the model predicts the veracity of a claim based on retrieved documents and selected sentences, which contain prediction errors, while at training time, only ground-truth documents and true evidence sentences are available. We empirically show that our model, trained with a method that helps reduce training-test discrepancies, outperforms the graph-based approaches.
In addition, we observe that most of the previous work neglects sentence-selection labels when training veracity prediction models. Thus, we propose leveraging those labels to further improve veracity relation prediction through joint training. Unlike previous work that jointly trains two models (Yin and Roth, 2018;Hidey et al., 2020;Nie et al., 2020), our approach is still a pipeline process where only a subset of potential candidate sentences produced by any sentence selector can be used for joint training. This approach makes it possible to explore different sentence-selection models trained with different methods.
Our contributions are as follows. We develop a method for mitigating training-test discrepancies by using a mixture of predicted and true examples for training. We propose a multi-level attention (MLA) model that enables token-and sentencelevel self-attentions and that benefits from joint training. Experiments on the FEVER dataset show that MLA outperforms all the published models, despite its simplicity.
2 Background and related work 2.1 Problem formulation The input of our task is a claim and a collection of Wikipedia articles D. The goal is to extract a set of ...
(2) Evidence Sentence Selection  retrieving documents relevant to a given claim from Wikipedia, selecting sentences likely to contain evidence, and predicting a veracity relation label based on selected sentences. evidence sentences from D and assign a veracity relation label y ∈ Y = {S, R, N} to a claim with respect to the evidence set, where S = SUPPORTED, R = REFUTED, and N = NOTENOUGHINFO. The definition of our labels is identical to that of the FEVER Challenge (Thorne et al., 2018).

Overview of evidence-based fact checking
The process of evidence-based fact checking, shown in Figure 2, commonly involves the following three subtasks.

Document retrieval
Given a claim, the task is to retrieve the top K relevant documents from D. Thorne et al. (2018) suggest using the document retriever from DrQA (Chen et al., 2017a), which ranks documents on the basis of the term frequency-inverse document frequency (TF-IDF) model with unigrambigram hashing. Hanselowski et al. (2018) use a hybrid approach that combines search results from the MediaWiki API and the results of using exact matching on all Wikipedia article titles. In this paper, our main focus is to improve evidence sentence selection and veracity relation prediction, so we directly use the document retrieval results from Hanselowski et al. (2018). This allows us to fairly compare our model with a series of previous methods (Soleimani et al., 2019;Zhou et al., 2019;Ye et al., 2020) that also rely on Hanselowski et al. (2018)'s results.

Evidence sentence selection
The task is to select the top M sentences from the retrieved documents. Thorne et al. (2018)

Veracity relation prediction
Given a claim and a set of M selected sentences, the task is to predict their veracity relation label y. Previous work on the FEVER Challenge modified existing RTE/NLI models to deal with multiple sentences (Nie et al., 2019a;Yoneda et al., 2018;Hanselowski et al., 2018;Thorne et al., 2018), used heuristic rules to combine predictions from individual claim-sentence pairs (Malon, 2018), or concatenated all sentences (Stammbach and Neumann, 2019). A line of recent work has applied graphbased neural networks (Zhou et al., 2019;Zhong et al., 2020). Our model is simply trained on linear sequences by using self-and crossattention to learn inter-sentence interactions.

Pre-trained language models
A key to the success of state-of-the-art approaches is the use of pre-trained language models (Devlin et al., 2019;Lan et al., 2020). Here, we use BERT (Devlin et al., 2019), a bidirectional transformer (Vaswani et al., 2017), to obtain the vector representation of a token sequence. Each BERT layer transforms an input token sequence (one or two sentences) by using self-attention. The first hidden state vector of the final layer represents a special classification token (CLS), which can be used in downstream tasks. We denote the above process by BERT CLS (·) ∈ R d h , where d h means the dimensionality of BERT hidden state vectors.

Proposed method
In this section, we describe our contributions, including (1) our method for training the sentenceselection model and (2) our veracity prediction model that can be extended with inter-sentence attentions and joint training.

Learning to select sentences from mixed ground-truth and retrieved documents
Our goal is to select a subset of evidence sentences from all candidate sentences in the retrieved documents. We consider this task to be a binary classification problem that takes as input a pair of a claim c and a candidate sentence e j and maps it to the output z ∈ Z = {−1, +1}, where +1 indicates an evidence sentence and −1 otherwise. We train our sentence-selection model by minimizing the standard cross-entropy loss for each example: where 1{·} is the indicator function, and p φ is the probability distribution of the two classes generated by our model. We compute p φ by applying a multilayer perceptron (MLP) to the vector representation of e j followed by a softmax function: The MLP contains two affine transformations that map e j to the output space. Feeding the pair of c and e j to BERT allows us to obtain hidden state vectors that capture interactions between c and e j at the token level. This is due to the self-attention mechanism inside the BERT layers. We expect the final hidden state vector of the CLS token (i.e., e j ) to encode useful information from e j with respect to c. The parameters φ include those in MLP and BERT. Training our model seems straightforward. However, two technical issues exist. First, each document typically contains one or two (or no) evidence sentences. Training with a few positive examples (i.e., evidence sentences) against all negative examples (i.e., non-evidence sentences) may be neither efficient nor effective. Soleimani et al. (2019) use hard negative mining (HNM) to repeatedly select a subset of difficult negative examples for training their sentence selector. Second, at test time, the model must examine all candidate sentences in the relevant documents returned by the document retriever. However, at training time, the model has no chance to learn the characteristics of non-evidence sentences in the irrelevant but highly ranked documents if only the ground-truth documents are used.
We propose to mitigate the aforementioned issues by using both the ground-truth and retrieved documents to create negative examples for a claim. First, we randomly choose r non-evidence sentences from each ground-truth document, where r is twice the number of true evidence sentences. Then, we sample two other non-evidence sentences from each retrieved document. For positive examples, we use the true evidence sentences in the ground-truth documents. Our scheme is more efficient than HNM of Soleimani et al. (2019). At test time, we select the top M sentences according to the probabilities assigned to the positive class.

Multi-level attention and joint training for veracity relation prediction
Training-test discrepancies also occur in veracity relation prediction. At test time, the model predicts the veracity of a claim on the basis of the predicted evidence sentences. At training time, only true evidence sentences are available for SUPPORTED and REFUTED, but not for NOTENOUGHINFO. In other words, we have no example sentences that more or less relate to a claim but may not be sufficient to support or refute the claim to train the model. Thorne et al. (2018) simulate training examples for NOTENOUGHINFO by sampling a sentence from the highest-ranked page returned by the document retriever. We propose to reduce this discrepancy by using a mixture of true and predicted evidence sentences for training. First, we pair each claim with a list of the top M predicted sentences obtained through a sentence selector. At training time, we then prepend the true evidence sentences (if available) to the list and keep the number of all the sentences at most M . 2 At test time, we use the top M predicted sentences without requiring a predefined threshold to filter them. This is in contrast to previous work (Zhou et al., 2019;Nie et al., 2019b;Wadden et al., 2020) and helps reduce engineering effort. Our example sentences for NOTENOUGHINFO are from the sentence selector, not from the document retriever as in (Thorne et al., 2018). We expect our training examples to be similar to what our model may encounter at test time.
On the basis of the above scheme, each example is a pair of a claim c and a set of evidence sentences {e j } M j=1 . Our goal is to predict the veracity relation 2 True evidence sentences may already exist in the list because the sentence selector can correctly identify them. label y ∈ Y = {S, R, N}. We train our veracity prediction model by minimizing the class-weighted cross-entropy loss for each example: where β y is the class weight for dealing with the class imbalance problem (detailed in Section 4.2). Similar to Eq. (2), we compute the probability distribution p θ of veracity relation labels as: Here, a is the vector representation of aggregated evidence about a claim that is obtained through the multi-head attention (MHA) function: where c is the claim vector, E is the set of evidence vectors {e j } M j=1 , and Q, K, V denote the query, keys, and values, respectively. All the claim and evidence vectors are derived from BERT: The parameters θ are those in MLP, MHA, and BERT. Now let us explain the MHA function, because we use and/or modify it in other components. The MHA function is based on the scaled dotproduction attention (Vaswani et al., 2017): where γ = d h /n is the scaling factor. The above function is the weighted sum of the values (i.e., the evidence vectors), where the weight assigned to each value is the result of applying a softmax function to the scaled dot products between the query (i.e., the claim vector) and the keys (i.e., the evidence vectors).
The MHA function contains a number of parallel heads (i.e., attention layers). We expect each head to capture different aspects of the input. We achieve this by linearly projecting Q, K, and V to new representations and feeding them to the scaled dotproduct attention. Specifically, the MHA function is given by: where n is the number of parallel heads, and are the weight matrices of the linear projections.

Inter-sentence attentions
Although Eq. (5) helps aggregate the evidence from multiple selected sentences, our model still has no mechanism to learn interactions among these sentences. Unlike previous work that uses graph-based attention (Zhou et al., 2019;Zhong et al., 2020), our main tool is just the described MHA function.
Let H j = [h j,1 , ..., h j,L ] be a sequence of the hidden state vectors of e j generated by BERT, where L is the maximum sequence length. Let H be the concatenation of all the sequences {H j } M j=1 . We obtain a new representation G of the concatenated sequence by applying a residual connection between H and token-level self-attention: where MHA(·) is a simplified MHA function with one argument because Q, K, and V all come from the same H.
In practice, we also add the static (sinusoid) positional encodings (PE) to the input of MHA. 3 We adopt this procedure from the original Transformer's sub-layer (Vaswani et al., 2017). The computation cost of Eq. (9) is not high. Concretely, let L = 128 and M = 5. The length of the concatenated sequence is thus 640 (L×M ), which is slightly longer than the maximum length of BERT's input sequence (i.e., 512 tokens).
Next, we perform sentence-level self-attention using a similar procedure. First, we split G back into individual sequences {G j } M j=1 . Then, we pick the first hidden state vector from each G j , which corresponds to that of the CLS token. Let F be the concatenation of all the first hidden state vectors {g j,1 } M j=1 . We obtain the final representation E of the evidence sentences: We can use E as the keys and values in Eq. (5). Note that we do not share the parameters among the different MHA layers.  Figure 3: Architecture of our multi-level attention (MLA) model. The model takes as input a claim together with five evidence sentences. These sentences can be derived from any sentence selector. BERT encodes each sentence into a sequence of hidden state vectors, each of which is denoted by a squared box. The first hidden state vector (corresponding to the CLS token) is used for classification. MLA applies tokenand sentence-level self-attentions and combines all the hidden state vectors as well as the sentence-selection scores at the final attention layer.

Joint training
Since the sentence-selection label assigned to each evidence sentence is available at training time, we can use it to guide our veracity prediction model. We apply the idea of multi-task learning (MTL) (Caruana, 1993;Ruder, 2017), in which we consider veracity relation prediction to be our main task and evidence sentence selection to be our auxiliary task. Our goal is to leverage training signals from our auxiliary task to improve the performance of our main task. Note that the sentence-selection component here is independent of the stand-alone model (i.e., our model in Section 3.1 or an alternative model in Section 4.3).
Let s = [s 1 , . . . , s M ] be the vector of sentenceselection scores, where s m denotes the probability distribution of the positive class returned by our sentence-selection component. We propose using s as a gate vector to determine how much of the values should be maintained before applying a residual connection followed by a linear projection. Thus, we modify Eq. (8) with: where means the element-wise multiplication. Our modification is close to Shaw et al. (2018)'s method in which extra vectors are added to the keys and the values after applying the linear projections. During development, we found that their method does not work well in our task. We compare different strategies in Section 4.4, including applying the gate vector to the keys or both the keys and the values.
Finally, we combine Eqs. (1) and (3) to get our composite loss function: where λ is the weighting factor of the sentenceselection component.
To summarize, our model, shown in Figure 3, contains token-level attention over a claimevidence pair through BERT, token-and sentencelevel self-attentions across an evidence set, and claim-evidence cross-attention incorporating the sentence-selection scores through joint training. Hence, we call it the multi-level attention (MLA) model. Table 1 shows the statistics of the FEVER dataset. We used the corpus of the June 2017 Wikipedia dump, which contains 5,416,537 articles preprocessed by Thorne et al. (2018). We used the document retrieval results given by Hanselowski et al. (2018), containing the predicted Wikipedia article titles (i.e., document IDs) for all the claims in the training/dev/test sets. Following (Stammbach and Neumann, 2019;Soleimani et al., 2019;, we prefixed the Wikipedia article titles to the candidate sentences to alleviate the co-reference resolution problem.

Dataset and evaluation metrics
We evaluated performance by using the label accuracy (LA) and FEVER score. LA measures  the 3-way classification accuracy of veracity relation prediction. The FEVER score reflects the performance of both evidence sentence selection and veracity relation prediction, where a complete set of true evidence sentences is present in the selected sentences, and the claim is correctly labeled. We used the official FEVER scorer during development. 4 We limited the number of the selected sentences to five (M = 5) according to the FEVER scorer. The performance on the blind test set was evaluated through the FEVER Challenge site.

Training details
We implemented our model on top of Hugging-Face's Transformer (Wolf et al., 2020). The dimension of hidden state vectors d h and the number of heads n varied according to those of the pre-trained models. We used BERT-base (d h = 786; n = 12) for our stand-alone sentence-selection model and tried various BERT-style models for MLA. We trained all models using Adafactor (Shazeer and Stern, 2018) with a batch size of 256, a linear learning rate decay, a warmup ratio of 0.06, and a gradient clipping of 1.0. Following the default configuration of HuggingFace's Transformer, we initialized all parameters by sampling from N (0, 0.02) and setting the biases to 0, except for the pre-trained models. We set λ in Eq. (13) to 1. We trained each model for 2 epochs with a learning rate of 5e-5, unless otherwise specified.
For regularization, we applied dropout (Hinton et al., 2012) with a probability of 0.1 to the MHA layers, MLP layers, and gated values in Eq. (12). We computed the class weight β y in Eq. (3) by: whereβ y is the balanced heuristic used in scikitlearn (Pedregosa et al., 2011) and β y is normalized to sum to 1. In our case, N = 145,469 is the total number of training examples, |Y| = 3 is the number of classes, and N y is the number of training examples in y (i.e., the first row in Table 1). We interpretedβ y as the ratio of the balanced class distribution (N/|Y|) to the observed one (N y ). Here, we wanted to penalize the less-observed classes, like REFUTED and NOTENOUGHINFO, more.

Baselines
The use of different pre-trained and pipeline models in the previous work makes a fair comparison difficult. For this reason, we chose baseline models that use BERT-base for pre-training and Hanselowski et al. (2018)'s document retrieval results. We designed two sets of experiments. First, we required that all the models use the same sentence-selection model, which is Hanselowski et al. (2018)'s ESIM. 5 For the veracity relation prediction, Hanselowski et al. (2018) incorporate ESIM with attention and pooling operations to get a representation of a claim and top five selected sentences. Soleimani et al. (2019) make five independent predictions for each claimevidence pair and use a heuristic (Malon, 2018) to get a final prediction. GEAR (Zhou et al., 2019) is a graph-based model for evidence aggregating and reasoning. KGAT ) is a kernel graph attention model. Second, we allowed different sentence-selection models. Soleimani et al. (2019) use HNM to select negative examples with the highest loss values, while our negative examples are sampled once from both the ground-truth and retrieved documents, as described in Section 3.1. Table 2 shows the results of the two settings on the dev set. MLA outperforms the other baselines in both settings. Table 3 shows the sentence-selection results returned by the FEVER scorer. The precision, recall@5, and F1 are consistent across the three models. Hanselowski et al. (2018) use ESIM with a pairwise hinge loss, while Soleimani et al. (2019) use a pointwise loss with HNM. Our model is also a pointwise approach but simpler to train. Without sampling nonevidence sentences from the retrieved documents, all the scores drop by around 2%, indicating that our technique is useful. In the following sections,   we will refer to our BERT-base sentence-selection results with MLA.

Effect of pre-trained models
The next set of experiments examined the benefits of using different pre-trained models. AL-BERT (Lan et al., 2020) is a lite BERT training approach that uses cross-layer parameter sharing and replaces next sentence prediction with sentence ordering. RoBERTa ) is a robustly optimized BERT approach that introduces better training schemes, including dynamic masking, larger batch size, and other techniques. We chose these two BERT-style models because they can be easily plugged into our implementation without much modification. Table 4 shows the results of the different pretrained models on the dev set. For all the large pre-trained models, we decreased the learning rate to 2e-5 and trained them for 3 epochs. Additional results including training times can be found in Appendix A. As shown in the table, BERT and AL-BERT perform similarly, while ALBERT has fewer parameters. RoBERTa offers consistent improvements over the other two models and achieves the best performance with its large model. Therefore,   we applied MLA with RoBERTa-large to the blind test set.
Comparison with state-of-the-art methods Table 5 shows the results on the blind test set. 6 The results are divided into two groups. The first group represents the top scores of the FEVER shared task, including those of Hanselowski et al. (2018); Yoneda et al. (2018); Nie et al. (2019a). The second group contains recently published results after the shared task. GEAR (Zhou et al., 2019), KGAT , and DREAM (Zhong et al., 2020) are graph-based models. SR-MRS (Nie et al., 2019b) uses a semantic retrieval module for selecting evidence sentences. HESM (Subramanian and Lee, 2020) uses a multi-hop evidence retriever and a hierarchical evidence aggregation model. CorefRoBERTa (Ye et al., 2020) trains KGAT by using a pre-trained model that combines a co- 6 The results can also be found on the FEVER leaderboard: https://competitions.codalab.org/ competitions/18814#results

Ablation study
We conducted two sets of ablation studies on the dev set using MLA with BERT-base. First, we examined the effect of our proposed components. Table 6 shows that all the components contribute to the final results. Without class weighting, Eq. (3) falls back to the standard cross-entropy loss. Without joint training, MLA is a stand-alone veracity prediction model. These results suggest that tokenlevel self-attention and class weighting are the two most important components of our model.
Second, we explored a number of strategies for exploiting the sentence-selection scores s. MLA basically uses s as a gate vector and only applies it to the values, as described in Eq. (12). We can apply the same calculation to the keys or both the keys and the values. In addition, we can use s as a bias vector and add it to the scaled dot-product term, as done by Yang et al. (2018). Table 7 shows the results of the aforementioned strategies. These results indicate that applying s to the values produces the best results.

Error analysis
To better understand the limitations of our method, we manually inspected 100 prediction errors on the dev set, where the true evidence sentences are present in the predicted sentences but MLA failed to predict the veracity relation labels. Here, we required that both BERT-base and RoBERTa-large MLA models produce the same errors. Table 8(a) shows a prediction error requiring complex reasoning that our models are unable to deal with. The claim "Philomena is a film nominated for seven awards." is supported by the evidence "It was also nominated for four BAFTA Awards and three Golden Globe Awards.". In this case, the models must understand that four plus three equals seven.
Table 8(b) shows a possible annotation error. The claim "Mick Thomson was born in Ohio." is annotated as SUPPORTED, while the evidence "Born in Des Moines, Iowa, he is best known as ..." refutes the claim. Our models also predict REFUTED.
Table 8(c) shows the half-true claim "Heavy Metal music was developed in the United Kingdom.", which is annotated as REFUTED. However, the evidence "Heavy metal (or simply metal) is ... developed ... in the United Kingdom and the United States." would indicate that the claim is partly true. The half-true label is defined in some previous smaller datasets (Vlachos and Riedel, 2014;Wang, 2017), but not in the FEVER dataset.
Table 8(d) shows the questionable claim "Harvard University is the first University in the U.S.", which is annotated as SUPPORTED by the evidence "... Harvard is the United States' oldest institution of higher learning ...". However, this evidence does not directly support the claim. 7 Our models predict NOTENOUGHINFO. Our analysis results suggest that probing disagreements between an ensemble of models and annotators may help improve annotation consistency. Additional results on error analysis are given in Appendix C.

Conclusion
We have presented a multi-level attention model that operates on linear sequences. We find that, when trained properly, the model outperforms its graph-based counterparts. Our results suggest that a sequence model is sufficient and can serve as a strong baseline. Using better upstream components (i.e., a better document retriever or sentence selector) or larger pre-trained models would further improve the performance of our model. Training models that are robust to adversarial examples while maintaining high performance for normal ones is an important direction for our future work. Table 9 shows the results of different pre-trained models in detail. All the pre-trained models used in our experiments also come from HuggingFace. 8 We conducted each experiment on a single NVIDIA Tesla A100 GPU with 40 GB RAM. We used a batch size of 256 with gradient accumulation to control memory. Table 10 shows the results of various sentenceselection models on the test set. Not all published models report precision and recall. Our precision, recall@5, and F1 scores are slightly better than those of . Our sentence-selection model took 1 hour and 10 minutes to train. We find that getting high recall in evidence sentence selection is necessary to achieve good results in veracity relation prediction.

C Additional error analysis
Here, we provide additional examples of errors, including complex reasoning errors (Table 11), possible annotation errors (Table 12), half-true claims (Table 13), and questionable claims (Table 14).