FactKG: Fact Verification via Reasoning on Knowledge Graphs

In real world applications, knowledge graphs (KG) are widely used in various domains (e.g. medical applications and dialogue agents). However, for fact verification, KGs have not been adequately utilized as a knowledge source. KGs can be a valuable knowledge source in fact verification due to their reliability and broad applicability. A KG consists of nodes and edges which makes it clear how concepts are linked together, allowing machines to reason over chains of topics. However, there are many challenges in understanding how these machine-readable concepts map to information in text. To enable the community to better use KGs, we introduce a new dataset, FactKG: Fact Verificationvia Reasoning on Knowledge Graphs. It consists of 108k natural language claims with five types of reasoning: One-hop, Conjunction, Existence, Multi-hop, and Negation. Furthermore, FactKG contains various linguistic patterns, including colloquial style claims as well as written style claims to increase practicality. Lastly, we develop a baseline approach and analyze FactKG over these reasoning types. We believe FactKG can advance both reliability and practicality in KG-based fact verification.


Introduction
The wide spread risk of misinformation has increased the demand for fact-checking, that is, judging whether a claim is true or false based on evidence. Accordingly, recent works on fact verification have been developed with various sources of evidence, such as text (Thorne et al., 2018;Augenstein et al., 2019;Jiang et al., 2020;Schuster et al., 2021; and tables Aly et al., 2021). Unfortunately, knowledge graphs (KG), one of the large-scale data forms, have not yet been fully uti- * This work is not associated with Amazon. Figure 1: An example data from FACTKG. To verify the claim whether it is SUPPORTED or REFUTED, we use triples extracted from DBpedia as evidence.
lized as a source of evidence. A KG is a valuable knowledge source due to two advantages.
Firstly, KG-based fact verification can provide more reliable reasoning: since the efficacy of realworld fact-checking hinges on this reliability, recent studies have focused on justifying the decisions of a fact verification system (Kotonya and Toni, 2020a). In most existing works, the justification is based on the extractive summary of text evidence. Therefore, the inferential links between the evidence and the verdict are not clear (Kotonya and Toni, 2020b;Atanasova et al., 2020a,b). Compared to text and tables, a KG can simply represent reasoning process with logic rules on nodes and edges (Liang et al., 2022). This allows us to categorize common types of reasoning with the graphical structure, as shown in Table 1.
Secondly, KG-based fact verification techniques have broad applicability beyond the domain of factchecking. For example, modern dialogue systems (e.g. Amazon Alexa (Amazon Staff, 2018), Google Assistant (Kale and Hewavitharana, 2018)) maintain and communicate with internal knowledge graphs, and it is crucial to make sure that their con- tent is consistent with what the user says and otherwise update the knowledge graphs accordingly. If we model the user's utterance as a claim and the dialogue system's internal knowledge graph as a knowledge source, the process of checking their consistency can be seen as a form of KG-based fact verification task. More generally, KG-based fact verification techniques can be applied to cases which require checking the consistency between graphs and text. Reflecting these advantages, we introduce a new dataset, FACTKG: Fact Verification via Reasoning on Knowledge Graphs, consisting of 108k textual claims that can be verified against DBpedia (Lehmann et al., 2015) and labeled as SUP-PORTED or REFUTED. We generated the claims based on graph-text pairs from WebNLG (Gardent et al., 2017) to incorporate various reasoning types. The claims in FACTKG are categorized into five reasoning types: One-hop, Conjunction, Existence, Multi-hop, and Negation. Furthermore, FACTKG consists of claims in various styles including colloquial, making it potentially suitable for a wider range of applications, including dialogue systems.
We conducted experiments on FACTKG to validate whether graph evidence had a positive effect for fact verification. Our experiments indicate that the use of graphical evidence in our model resulted in superior performance when compared to baselines that did not incorporate such evidence.

Fact Verification and Structured Data
There are various types of knowledge used in fact verification such as text, tables, and knowledge graphs. Research on fact verification has mainly focused on text data as evidence (Thorne et al., 2018;Augenstein et al., 2019;Jiang et al., 2020;Schuster et al., 2021;. FEVER (Thorne et al., 2018), one of the representative fact verification datasets, is a large-scale manually annotated dataset derived from Wikipedia. Other recent works leverage ambiguous QA pairs , factual changes (Schuster et al., 2021), multiple documents (Jiang et al., 2020), or claims sourced from fact checking websites (Augenstein et al., 2019). Fact verification on table data is also studied Aly et al., 2021). Table-based datasets such as SEM-TAB-FACTS  or TabFact  require reasoning abilities over tables, and FEVEROUS (Aly et al., 2021) validate claims utilizing table and text sources. We refer the reader to Guo et al. (2022) for a comprehensive survey.
There have been several tasks that utilize knowledge graphs (Dettmers et al., 2018). For example, FB15K (Bordes et al., 2013), FB15K-237 (Toutanova and Chen, 2015), and WN18 (Bordes et al., 2013) are built upon subsets of largescale knowledge graphs, Freebase (Bollacker et al., 2008) and WordNet (Miller, 1995) respectively. These datasets only use a single triple as a claim, and thus the claims only require One-hop reasoning. However, FACTKG is the first KG-based fact verification dataset with natural language claims that require complex reasoning. In terms of the evidence KG size, FACTKG uses the entire DBpedia (0.1B triples), which is significantly larger than previous datasets (FB15K: 592K, FB15K-237: 310K, WN18: 150K).

WebNLG
As constructing a KG-based fact verification dataset requires a paired text-graph corpus, we uti- Figure 2: Two substitution methods utilized in FACTKG. In Entity substitution, we select a new entity located in outside 4-hops from all entities in the original claim. If the results of bidirectional NLI are both contradiction, we finish this process. In Relation substitution, we randomly extract a relation that takes the same entity types for the head and tail as the original relation. Then, substitution is performed based on a template specific to the selected relation.
lized WebNLG as a basis for FACTKG. WebNLG is a dataset for evaluating triple-based natural language generation, which consists of 25,298 pairs of high-quality text and RDF triples from DBpedia. WebNLG contains diverse forms of graphs and the texts are created by linguistic experts, which gives it great variety and sophistication. In the 2020 challenge 1 , the dataset has been expanded to 45,040 text-triples pairs. We used this 2020 version of WebNLG when constructing our dataset.

Data Construction
Our goal is to diversify the graph reasoning patterns and linguistic styles of the claims. To achieve this, we categorize five reasoning types of claims: One-hop, Conjunction, Existence, Multi-hop, and Negation. Our claims are generated by transforming the sentences in S w , a subset of WebNLG's text-graph pairs (Section 3.1). 2 Next, we also diversified the claims with colloquial style transfer and presupposition (Section 3.2).

One-hop
The most basic type of claim is one-hop, which covers only one knowledge triple. One-hop claims can 1 https://webnlg-challenge.loria.fr/challenge_ 2020/ 2 We found that 99.7% of claims in FEVER and FEVER-OUS consist of a single sentence. To reflect this result, we extract a subset Sw containing only single sentences from WebNLG. be verified by checking the existence of a single corresponding triple. In the second row of Table 1, the claim is SUPPORTED when the triple (AIDAstella, ShipBuilder, Meyer Werft) exists.
We take the sentences that consist of a single triple in S w as SUPPORTED claims. REFUTED claims are created by substituting SUPPORTED claims in two ways: Entity substitution and Relation substitution. In Entity substitution, we replace an entity e in SUPPORTED claim C with another entityẽ of the same entity type. In order to ensure that the label of the substituted sentenceC is RE-FUTED, the entityẽ should satisfy the following two conditions. i) To selectẽ that is irrelevant to C,ẽ is outside 4-hops from all entities in C on DBpedia, ii) the results of NLI (C,C) and NLI (C, C) are both CONTRADICTION. 3 In Relation substitution, we replace a relation in the SUPPORTED claim with another relation. We replace the relation of a triple in the claim with another relation that takes the same entity types for the head and tail as the original relation (e.g. currentTeam ↔ formerTeam). The four groups of compatible relations are listed in Table 6. The overall process of the substitution methods is illustrated in Figure 2.

Conjunction
A claim in the real world can include a mixture of different facts. To incorporate this, we construct a conjunction claim composed of multiple triples. Conjunction claims are verified by the existence of all corresponding triples. In the third row of Table 1, the claim can be divided into two parts: "AIDA Cruise line operated the AIDAstella." and "AIDAstella was built by Meyer Werft.". The claim is SUPPORTED when all the triples (AIDAstella, ShipOperator, AIDA Cruises), (AIDAstella, Ship-Builder, Meyer Werft) exist. To implement this idea, we extracted sentences consisting of more than one triple from S w and used them as the SUPPORTED claims. To create REFUTED claims, we use Entity substitution method on these SUPPORTED claims.

Existence
People may make claims that assert the existence of something (e.g. "She has two kids."). From the view of a triple, this corresponds to the head or tail missing. To reflect this scenario, we formulate a claim by extracting only {head, rela-tion} or {tail, relation} from a triple. Existence claims are generated using templates and they are divided into two categories: head-relation (e.g. template: {head} had a(an) {relation}.) and tailrelation (e.g. template: {tail} was a {relation}.). SUPPORTED claims are constructed by randomly extracting {head, relation} or {tail, relation} in triples from S w . The REFUTED claims are constructed using the same type of entities as represented in the claim, but with different relations. However, it is possible that unrealistic claims may be generated in this manner. For example, "Meyer Werft had a location." or "Papenburg was a location." can be created from the triple (Meyer Werft, location, Papenbug). Hence, we selected 22 relations out of all relations that lead to realistic claims. Templates used for both categories and examples of generated claims are in Table 7.

Multi-hop
We also consider multi-hop claims that require the validation of multiple facts where some entities are underspecified. Entities in this claim can be connected by a sequence of relations. For example, the multi-hop claim in Table 1 is SUPPORTED if the triple (AIDAstella, ShipBuilder, x) and the triple (x, location, Papenburg) are present in the graph. The goal is to verify the existence of a path on the graph that starts from AIDAstella and reaches Papenburg through the relations ShipBuilder and location. Figure 3 shows how a SUPPORTED multi-hop claim C M can be generated by replacing an entity e of the conjunction claim C with its type name. First, an entity e is selected from the green nodes. Then, the type name t of the entity e is extracted from DBpedia. However, each entity e in DBpedia has several types T = {t 1 , t 2 , ..., t N }, and it is not annotated which type is relevant when e is used in a claim. So it is necessary to select one of them. For each t n ∈ T , we insert it next to the entity e in the claim C and measure the perplexity score of the modified claim using GPT2-large (Radford et al., 2019). Then we replace e in the claim with the type name that had the lowest score. The REFUTED claim is generated by applying Entity substitution to the SUPPORTED claim.

Negation
For each of the four methods for generating claims, we develop claims that incorporate negations.
One-hop We use the Negative Claim Generation Model (Lee et al., 2021) which was fine-tuned on the opposite claim set in the WikiFactCheck-English dataset (Sathe et al., 2020). 4 To ensure the quality of the generated sentences, we generate 100 opposing claims for each original claim, then only use those that preserve all entities, and contain negations (e.g. 'not' or 'never'). Also, similar to Entity substitution method, we only use sentences whose NLI relation with the original sentences are CONTRADICTION bidirectionally. When a negation is added, the label of the generated claim is reversed from the original claim.
Conjunction The use of negations (i.e., 'not') in various positions within conjunction claims allows the generation of a wide range of negative claim structures. We employ the pretrained language model GPT-J 6B (Wang and Komatsuzaki, 2021) to attach negations to the claim. We construct 16 in-context examples, each with negations attached to the texts corresponding to the first or/and second relation. When a negation is added to the SUPPORTED claims, all the claims become RE-FUTED. However, when it is added to REFUTED claims, the label depends on the position of the negation. When negations are added to all parts with substituted entities, it becomes a SUPPORTED claim. Conversely, other cases preserve the label REFUTED since the negation is added to a place that is not related to entity substitution. A detailed labeling strategy is described in Appendix D.1.
Existence The claim is formulated by adding a negation within the templates presented in Section 3.1.3 (e.g. {tail} was not a {relation}.).

Multi-hop
A claim is formulated using the GPT-J with in-context examples, similar to conjunction. The truth of this claim is dependent on the presence of a distinctive path that matches the claim's intent. For example, the negative claim "AIDAstella was built by a company, not in Papenburg." is SUPPORTED if x exists where the triples (AIDAstella, ShipBuilder, x) and (x, location, y) are in DBpedia and y is not Papenburg. A more detailed labeling strategy is in Appendix D.2.

Colloquial Style Transfer
We transform the claims into a colloquial style via style transfer using both a fine-tuned language model and presupposition templates.

Model based
Using a similar method proposed by , we transform the claim obtained from 3.1 into a colloquial style. For example, the claim "Obama was president." is converted to "Have you heard about Obama? He was president!".
We train FLAN T5-large (Chung et al., 2022) to generate a colloquial style sentence given a corresponding written style sentence from Wizard of Wikipedia (Dinan et al., 2019). However, using sentences generated by the model could have several potential issues: i) the original and generated sentences are lexically the same, ii) some entities are missing in the generated sentences, iii) the generated sentences deviate semantically from the original, iv) the generated sentences lack a colloquialism, as mentioned in . To overcome this, we oversample candidate sentences and utilize an additional filtering process.
First, to make more diverse samples using the where none of the 500 generated sentences pass the filtering process, we include the original claim in the final dataset as a written style claim. Following the filtering process, the AFLITE method (Sakaguchi et al., 2019), which utilizes adversarial filtering, is applied to select the most colloquial style sentence among the surviving sentences. We include the selected claim in the final dataset as a colloquial style claim.

Presupposition
A presupposition is something the speaker assumes to be the case prior to making an utterance (Yule and Widdowson, 1996). People often communicate under the presupposition that their beliefs are universally accepted. We construct claims using this form of utterance. The claims in FACTKG are focused on three types of presupposition: factive, non-factive, and structural presuppositions.

Factive Presupposition
People frequently use verbs like "realize" or "remember" to express the truth of their assumptions. The utterance "I remembered that {Statement}." assumes that {State-ment} is true. Reflecting these features, a new claim is created by appending expressions that contain presupposition (e.g. "I realized that" or "I wasn't aware that") to the existing claim. We used eight templates to make factive presupposition claims: the details are appended in Table 8.

Non Factive Presupposition
The verbs such as "wish" are commonly used in utterances that describe events that have not occurred. For example, people say "I wish that {Statement}." when {State-ment} did not happen. Claims that are created by the non-factive presupposition method are labeled as the opposite of the original one. We used three templates to make these claims: the templates are appended in Table 8.

Structural Presupposition
This type is in the form of a question that presumes certain facts. We treat the question itself as a claim. For example, "When was Messi in Barcelona?" assumes that Messi was in Barcelona. To create a natural sentence form, only claims corresponding to one-hop and existence are constructed. For the one-hop claim, a different template was created corresponding to each relation reflecting its meaning (e.g. "When did {head} die from {tail}?" for the relation deathCause and "When was {head} directed by {tail}?" for relation director). Existence claims are also generated based on templates (e.g. "When was {tail} {relation}?") using pairs of head-relation or tail-relation, similar to Section 3.1.3. The templates used are described in Table 9.

Quality Control
To evaluate the quality of our dataset, the labeling strategy and the output of the colloquial style transfer model are assessed.

Labeling Strategy
When SUPPORTED claims are made in the manner described in Section 3.1, the labeling is straightforward, as all have precise evidence graphs. However, REFUTED claims are generated by random substitution, so there might be a small chance that they remain SUP-PORTED (e.g. "The White House is in Washington, D.C." to "The White House is in America."). To evaluate this substitution method, randomly sampled 1,000 substituted claims were reviewed by two graduate students. As a result, 99.4% of generated claims were identified as REFUTED by both participants.

Colloquial Style Transfer Model
We also evaluate the quality of the colloquial style claims generated by the model. A survey was conducted on all claims in the test set by three graduate students. As a result, only 9.8% of the claims were selected as Loss of important information by at least two reviewers. In addition, to ensure the quality of the test set, only claims that were selected as All facts are preserved by two or more reviewers are included in the test set. The survey details are in Appendix E.   Table 2 shows the statistics of FACTKG. We split the claims into train, dev, and test sets with a proportion of 8:1:1. We ensured that the set of triples in each split is disjoint with the ones in other splits.

Experimental Setting
We publish FACTKG with sets of claims, graph evidence and labels. The graph evidence includes entities and a set of relation sequences connected to them. For instance, when the claim is given as "AIDAstella was built by a company in Papenburg.", the entity 'AIDAstella' corresponds to a set of relation sequence [shipBuilder, location] and 'Papenburg' corresponds to [∼location, ∼shipBuilder]. 6 In the test set, we only provide entities as graph evidence.

Baseline
We conduct experiments on FACTKG to see how the graphical evidence affects the fact verification task. To this end, we divided our baselines into two distinct categories based on the input type, Claim Only and With Graphical Evidence.

Claim Only
In the Claim Only setting, the baseline models receive only the claim as input and predict the label. We used three transformer-based text classifiers, BERT, BlueBERT, and Flan-T5. BERT (Devlin et al., 2018) is trained on Wikipedia from which DBpedia is extracted. So we expect that the model will use evidence memorized in its pretrained weights (Petroni et al., 2019) or exploit structural patterns in the generated claims (Schuster et al., 2019;Thorne and Vlachos, 2021). Blue-BERT (Peng et al., 2019) is trained on biomedical corpus, such as Pubmed abstracts. We use Figure 4: Overall process of our baseline. In the subgraph retrieval step, each classifier respectively predicts the relations and hops related to the given entity and the claim. Subsequently, we check all the n-hop relation sequences obtained from each classifier to find all evidence paths. In the fact verification step, the claim is verified by leveraging all outputs obtained from the subgraph retrieval step. In this figure, we denote Transformer Encoder as TRM.
BlueBERT as a comparator for BERT since it has never seen Wikipedia during its pre-training. Flan-T5 (Chung et al., 2022) is an enhanced version of T5 (Raffel et al., 2022) encoder-decoder that has been fine-tuned in a mixture of 1.8K tasks. In all experiments, we fine-tune BERT and BlueBERT on our training set. Different from BERT and Blue-BERT, we use Flan-T5 in the zero-shot setting. For this setting, we use "Is this claim True or False? Claim: " as the prefix. Then, we measure the probability that tokens True and False will appear in the output. Among the two tokens, we choose the one with the higher probability.

With Graphical Evidence
In the With Graphical Evidence setting, the model receives the claim and graph evidence as input and predicts the label. The baseline we used is a framework proposed by GEAR ) that enables reasoning on multiple evidence texts. Since GEAR was originally designed to reason over text passages, we change components to suit KG. The modified GEAR consists of the subgraph retrieval module and the claim verification module. The pipeline of the modified GEAR is illustrated in Figure 4. Subgraph retrieval We replace document retrieval and sentence selection in GEAR with subgraph retrieval. To retrieve graphical evidence, we train two independent BERT models, namely a relation classifier and a hop classifier. The relation classifier predicts the set of relations R from the claim c and the entity e. The hop classifier is de-signed to predict the maximum number of hops n to be traversed from e. We take the subgraph of G that are composed only of the relations in R and where the terminal nodes are entities in C and less than n hops apart from e, allowing for duplicates and considering the order. By traversing the knowledge graph starting from e along the relation sequences in P , we choose the paths that can reach another entity that appears in the claim. If none of the paths is reachable to other entities, then we randomly choose one of the paths. The strategy we used enables the model to retrieve supported evidence and counterfactual evidence for the given claim. The following example is presented to assist the understanding of our subgraph retrieval method. The example claim in Section 4.2 consists of two entities, 'AIDAstella' and 'Papenburg'. In this setting, the hop classifier must predict 2 since those entities are connected by a sequence of two relations, namely shipBuilder and location. In addition, the relation classifier must predict correctly predict those two relations. After that, we find all 2-hop paths starting from 'AIDAstella' along the predicted relations in the knowledge graph. If there is a path that reaches 'Papenburg', we can use it as supporting evidence. If not, however, we randomly select a path.
Fact verification We directly employed the claim verification in GEAR and applied some changes to suit the KG setting. Since our evidence is a set of graph paths, we converted them to text by concatenating each triple with the special token <SEP>. We also found that ERNet in GEAR  is identical to the Transformer encoder, so we replaced it with a randomly initialized Transformer encoder. To make this paper self-contained, we provide further details about the claim verification of GEAR in Appendix F.

Fact Verification Results
We evaluated the performance of the models in predicting labels and reported the accuracy in Table 3 by different reasoning types.
As we expected, GEAR outperforms other baseline models in most of reasoning types because it used graph evidence. Especially, in existence and negation, GEAR substantially outperforms Claim Only baselines. Since the existence claims contain significantly less information than other types, having to search for evidence seems to increase fact verification performance. In addition, negation claims require additional inference steps compared to other types, thus logical reasoning based on graph evidence would help the model make correct prediction.
In the multi-hop setting, however, the accuracy of GEAR is lower than BERT, which may be due to the increased complexity of graph retrieval. When entities are far apart with many intermediate nodes being under-specified, it increases the probability of retrieving an incorrect graph. In GEAR, text and evidence paths are concatenated and used as input, so if many incorrect graphs are retrieved, they can lead to incorrect predictions. Also, the accuracy of BERT is the most superior in the multi-hop setting, which suggests that masked language modeling facilitates the model to robustly handle unspecified entities in the multi-hop claims.
In the Claim Only setting, all baselines outperform the Majority Class (51.35%), and the BERT model shows the highest performance. BlueBERT was pre-trained in the same manner, but BERT shows superior performance due to its pre-trained knowledge from Wikipedia.

Input Type
Model  Table 4: W refers to written style claims and C refers to colloquial style claims. W → C means that the model is trained on the written style claim set and tested on the colloquial style claim set. Flan-T5 is not used in this experiment because we use it only in the zero-shot setting.

Cross-Style Evaluation
We split the dataset into two disjoint sets, written style and colloquial style. We perform a cross-style fact verification task by using those datasets and the results are reported in Table 4. Initially, we anticipated that using different styles for the train and test set would result in a significant decrease in verification performance. However, contradict our expectation, in C→W setting, BERT and BlueBERT show an improvement in performance over C→C. Even in GEAR, the performance score only dropped slightly. Therefore, the results demonstrate that colloquial style is constructed in various forms which can be beneficial for generalization.

Conclusion
In this paper, we present FACTKG, a new dataset for fact verification using knowledge graph. In order to reveal the relationship between fact verification and knowledge graph reasoning, we generated claims corresponding to a certain graph pattern. Additionally, FACTKG also includes colloquialstyle claims that are applicable to the dialogue system. Our analysis showed that the claims in our dataset are difficult to solve without reasoning over the knowledge graph.
We expect the dataset to offer various research directions. One possible use of our dataset is as a benchmark for justification prediction. Most research on this task generate a text passage as justification, yet this approach lacked a gold reference. On the contrary, the interpretability of the knowl-edge graph allows us to employ it as an explanation for the verdict, such as question answering in the medical domain where explainability is important. Furthermore, using the KG structure for the claim generation allows us to generate a dataset with more complex multi-hop reasoning by design without relying on annotator creativity.

Limitations
Since WebNLG is derived from 2015-10 version of DBpedia, FACTKG does not reflect the latest knowledge. Also, another limitation of our work is that the claims of FACTKG are constructed based on single sentences, like other crowdsourced fact verification datasets. If the claim is generated by more than one sentences, the dataset will be more challenging. We remain this challenging point as a future work.

A Qualitative analysis
We report claims and the retrieved graphical evidence in Table A. We also report the correctness of the prediction of GEAR at the first column of our table, Result. We used subgraph retrieval to retrieve graph path visualize one of them. By checking the retrieved evidence, We can recognize why the model verdict the claims as refuted or supported. This shows that our graph evidence is fully interpretable.

B Relation Substitution
The four groups of compatible relations are listed in Table 6.

C.1 Existence
The templates to generate existence claims are described in Table 7.

C.2 Factive and Non Factive Presupposition
Factive and Non Factive presupposition templates are in Table 8.

C.3 Structural Presupposition
Structural presupposition templates are in Table 9.

D Negation Labeling D.1 Conjunction
When the negation is added to REFUTED claims, the label depends on the position of the negation. If negations are added to all parts with substituted entities, it becomes a SUPPORTED claim. Conversely, other cases preserve the label REFUTED since the negation is added to a place that is not related to entity substitution. Detailed examples are described in Table 10 and Table 11.

D.2 Multi-hop
The truth of this claim is dependent on the presence of a distinctive path that matches the claim's intent. For example, when verifying the claim in the fourth row of the Table 12, we check the existence of an entity which is connected to 'AIDAstella' with relation builder and not connected to 'New York' with relation location.

E Colloquial Style Claim Survey
A total of 9 graduate students participated in the survey to evaluate how much information was lost in the colloquial style claim compared to original claim. Since each person has different criteria for 'important information', the labels are divided into five rather than two. The labels are as follows, i) All facts are preserved, ii) Minor loss of information or minor grammatical errors, iii) Ambiguous whether the lost information is important, iv) It is ambiguous, but the lost information may be important, v) Loss of important information. And as a result, only 9.8% of the claims were selected as v) Loss of important information by at least two reviewers.

F Details of GEAR
To make this paper self-contained, we recall some details of the claim verification in GEAR . The authors of GEAR ) used sentence encoder to obtain representations for the claim and the evidence. Then they built a fully-connected evidence graph and used evidence reasoning network (ERNet) to propagate information between evidence and reason over the graph. Finally, they used an evidence aggregator to infer the final results.

Sentence Encoder
Given an input sentence,  employed BERT (Devlin et al., 2018) as a sentence encoder by extracting the final hidden state of the [CLS] token as the representation.
Specifically, given a claim c and N pieces of retrieved evidence {e 1 , e 2 , ..., e N }, they fed each evidence-claim pair (e i , c) into BERT to obtain the evidence representation e i . they also fed the claim into BERT alone to obtain the claim c. That is, (1)

Evidence Reasoning Network
Let h t = {h t 1 , h t 2 , ..., h t N } denote the hidden states of the nodes in layer t, where h t i ∈ R F ×1 and F is the number of features in each node. The initial hidden state of each evidence node h 0 i was initialized by the evidence: h 0 i = e i . The authors proposed an Evidence Reasoning Network (ERNet) to propagate information among the evidence nodes. They

Result Claim
Retrieved Path

Correct
Yeah! Alfredo Zitarrosa died in a city in Uruguay (Uruguay, country, Montevideo, deathPlace, Alfredo_Zitarrosa) I have heard that Mobyland had a successor.
(Mobyland, successor, "Aero 2") Wrong I realized that a book was written by J. V. Jones and has the OCLC number 51969173 (J._V._Jones, author, A_Cavern_of_Black_Ice, 'oclc', "39456030")  first used an MLP to calculate the attention coefficients between a node i and its neighbor j (j ∈ N i ), p ij = W t−1 1 (ReLU(W t−1 0 (h t−1 i h t−1 j ))), (2) where N i denotes the set of neighbors of node i, W t−1 0 ∈ R H×2F and W t−1 1 ∈ R 1×H are weight matrices, and · · denotes the concatenation operation.
Then, they normalized the coefficients using the softmax function, . (3) Finally, the normalized attention coefficients were used to compute a linear combination of the neighbor features and thus obtained the features for node i at layer t, The authors fed the final hidden states of the evidence nodes {h T 1 , h T 2 , ..., h T N } into their evidence aggregator to make the final inference.

Evidence Aggregator
The authors employed an evidence aggregator to gather information from different evidence nodes and obtained the final hidden state o ∈ R F ×1 . We used the mean aggregator in GEAR.
The mean aggregator performed the elementwise Mean operation among hidden states.
Once the final state o is obtained, the authors employed a one-layer MLP to get the final prediction l. l = softmax(ReLU(Wo + b)), where W ∈ R C×F and b ∈ R C×1 are parameters, and C is the number of prediction labels.