Jointly Identifying and Fixing Inconsistent Readings from Information Extraction Systems

Moral values as commonsense norms shape our everyday individual and community behavior. The possibility to extract moral attitude rapidly from natural language is an appealing perspective that would enable a deeper understanding of social interaction dynamics and the individual cognitive and behavioral dimension. In this work we focus on detecting moral content from natural language and we test our methods on a corpus of tweets previously labeled as containing moral values or violations, according to Moral Foundation Theory. We develop and compare two different approaches: (i) a frame-based symbolic value detector based on knowledge graphs and (ii) a zero-shot machine learning model fine-tuned on a task of Natural Language Inference (NLI) and a task of emotion detection. The final outcome from our work consists in two approaches meant to perform without the need for prior training process on a moral value detection task.


Introduction
Information Extraction (IE) systems read text to extract entities, and relations and create beliefs represented in a knowledge graph.Current systems though are far from perfect: e.g., in the 2017 Text Analysis Conference (TAC) Knowledge Base Population task, participants created knowledge graphs with relations like cause of death and city of headquarters from news corpora (Dang, 2017).When manually evaluated, no system had achieved an F1 score above 0.3 (Rajput, 2017).
One reason for such low scores is inconsistency between the text and the extracted beliefs.We consider a belief to be consistent if the text from which it was extracted linguistically supports it (regardless of any logical or real-world factual truth).We show the difference between consistent and inconsistent readings, along with a potential correction, in Fig. 1.In Fig. 1a, the system considered Harry Reid was charged with an assault, which is not Belief learned by IE system: per:charges(Harry Reid, assault) Provenance identified by IE system: Nevada's Harry Reid switches longtime stance to support assault weapon ban Analysis output: Is reading consistent: Inconsistent Suggested relation: no repair (a) An inconsistent reading with no correction.
Belief learned by IE system: per:cause_of_death(Edward Hardman, Typhoid fever) Provenance identified by IE system: The Western Australian government agreed to offer the Government Geologist post to Hardman shortly before news of his death reached them .Early in April , he contracted typhoid fever , and died a few days later in a Dublin hospital on 6 April Analysis output: Is reading consistent: Consistent Suggested relation: per:cause_of_death (b) A consistent reading not requiring a correction.Notice the relation is unchanged.We study two problems: (i) whether an extracted belief is consistent with its text (called consistency), and (ii) correcting it if not (called repair).We believe we are the first to study these problems jointly.We model these problems jointly, arguing that addressing both of these is important and can benefit one another.Our use of consistency here refers to a language-based sense that text supports the belief even if its contradicts world knowledge.
We are concerned with methods that can be standalone-that is, reliant on neither a precise schema (Ojha and Talukdar, 2017) nor an ensemble of IE systems, e.g., Yu et al. (2014); Viswanathan et al. (2015).Previous work on determining the consistency of an IE extraction was not standalone.We want a standalone approach because the results from non-standalone approaches cannot be applied when only the beliefs and associated provenance text is available without the IE ensemble systems and schema.(For this study we consider English beliefs and provenance sentences.)Parallel to the broad IE domain, schema-free and standalone systems have been developed to verify the credibility of news claims (Popat et al., 2018;Riedel et al., 2017a;Rashkin et al., 2017), but we are not aware of a study of their performance on IE system tasks.We incorporate these credibility systems into our study in order to determine their applicability for our tasks.We make the following contributions.
A study of real IE inconsistencies.We catalog and examine the understudied aspect of languagebased consistency ( §3).
A novel framework.To our knowledge we are the first to study and propose a framework for joint consistency and repair ( §4).

Analysis of techniques.
We show the effectiveness of straightforward techniques compared to more complicated approaches ( §5).
Study of different provenance settings.We consider and contrast cases where provenance sentences are retrieved by an IE system (as in TAC) vs. where they are curated by humans (as in Zhang et al. (2017, TACRED)).
2 Task Setup

Consistency and Repair
We say the belief was consistently read if the text lexically supports the belief.While this can be viewed as a lexical entailment, it is not a logical, causal, or broader inferential/knowledge entailment.For example the belief <Barack Obama,per:president_of,Kenya> is consistent with a provenance sentence "Barack Obama, president of Kenya, visited the U.S. for talks" even though the sentence falsely claims that Obama is president of Kenya. .The belief is considered repaired if the relation extracted by the IE system was not supported by the text, but when replaced by another relation that is supported by the text.

Datasets
We use three datasets: TAC 2015, TAC 2017, and a novel dataset we call TACRED-KG.All datasets use actual output from real IE systems.Each dataset is split into train/dev/test splits: in Table 2 (in the appendix) we show the size of each split, in terms of the number of provenance-backed beliefs.
TAC 2015 and 2017.These include the output of 70+ IE systems, from the TAC 2015 and TAC 2017 shared tasks, with belief triples supported by up to four provenance sentences.Each belief was evaluated by an LDC expert (Ellis, 2015a).We used these LDC judgments as the consistency labels for our experiments.For TAC 2015, 27% of the 34k beliefs are judged consistent; for TAC 2017, 36% of the 57k beliefs are judged consistent.
These TAC datasets do not, however, contain information on possible corrections when the belief is inconsistent.To overcome this limitation, we used negative sampling on the consistent beliefs with their provenance to create an inconsistent pair.We first selected an entity and then identified a set of relations that apply to the entity.We randomly chose one of the relations with uniform probability and shuffled it with another relation, keeping the provenance the same.For example, given two consistent beliefs Barack_Obama,president_of,US, and Barack_Obama,school_attended,Harvard, we swap president_of with school_attended, keeping the provenance unchanged.This yields inconsistent beliefs associated with corresponding provenance and the correct labels.
TACRED-KG.The TACRED-KG dataset is a novel adaptation from the existing TACRED (Zhang et al., 2017) (Zhang et al., 2017) system on the TACRED data to produce 5-tuples (subject, object, provenance sentence, correct relation, predicted relation).From these we created a provenance-backed KG dataset, TACRED-KG, as (subject, predicted relation, object, provenance sentence).In TACRED-KG, we treat the gold standard relation as the repair label.We consider beliefs consistent when the predicted and gold standard relations are the same.3 What Errors Do IE Systems Make?
We begin with an analysis of errors in the beliefs from actual IE systems.This analysis is enlightening, as each system used different approaches and types of resources to extract potential facts.We sampled 600 beliefs and their provenance text each from the training portions of three different knowledge graph datasets: TAC 2015, TAC 2017, and TACRED-KG.As described in §2.2, they all contain provenance-backed beliefs that were extracted from actual IE systems (but ones which are generally not available for subsequent downstream examination).All of the beliefs are represented as a relation between two arguments.The authors manually assessed these according to available and published guidelines (Ellis, 2015a,b;Dang, 2017) to understand the kinds of errors made by the IE systems.We identified four types of errors: the subject (first argument) not present in the provenance text; the object (second argument) not present in the provenance; an insufficiently supported relation between two present arguments; and relations that run afoul of formatting requirements, e.g., misformed dates.We show examples of these in Table 1.
Our analysis, summarized in Fig. 2, found that the most frequent error type is an incorrect relation, followed by missing subject, missing object and (at a trace level) formatting errors.Though it varied based on dataset, approximately two-thirds of the sampled belief-provenance pairs had errors.The prevalence of incorrect relations motivates the importance of the relation repair task.It should be noted that while TAC 2015 and 2017 have a number of instances of missing subjects and objects, this is not the case for TACRED-KG.This illustrates a fundamental difference in selecting provenance information manually vs. automatically, and one that we observe to be experimentally important ( §5.3), between TAC 2015/2017 and TACRED-KG.

Approach
Our approach computes both the consistency of a belief b i and a "repaired" belief with respect to a given set of provenance sentences.We represent b i as a triple ⟨subject i , predicate i , object i ⟩ and the set of provenance sentences as S i,1 , S i,2 , ...S i,n .The system outputs two discrete predictions: (1) a binary one indicating whether the belief is consistent with the sentences, and (2) a categorical one sug-Figure 3: Given a belief and a set of n provenance sentences, our framework determines its consistency and suggests a repair when if is deemed inconsistent.Our approach has three main modules: representation (4.1), combination (4.2), and feature learning and classification (4.3).gesting a repair.Fig. 3 illustrates our approach for representing and combining the beliefs and provenance sentences to jointly learn the two tasks.
Our approach has three main steps: embedding a belief and its provenance sentences in a vector space ( §4.1), combining/aggregating these representations ( §4.2), and using the result for additional feature learning and classification ( §4.3).We describe our loss objective in §4.4.As we show, our framework can be thought of as generalizing high performing credibility models, such as DeClarE (Popat et al., 2018) or LSTM-text (Rashkin et al., 2017).

Belief & Provenance Representation
We process and tokenize a belief's arguments and relation.For example, the belief ⟨Barack_Obama, per : president_of, United_States⟩ yields a subject span ("Barack Obama"), a relation span ("president of"), and an object span ("United States").We input processed text through an embedding function f belief to get a single embedding b for the belief.Here, f belief could be average of pretrained word embeddings, or final hidden state obtained from a sequence model (LSTM or Bi-LSTM) or the embedding from a transformer model (e.g., BERT (Devlin et al., 2019)).As we discuss in §5.2, we experiment with all of these.
We represent the provenance sentences at two granularities.The first is by representing each sentence separately.We get a representation s i for each provenance sentence via an embedding function f evidence that embeds and combines them into a single vector.We define f evidence similarly to f belief .
The second level considers all sentences at the same time.We refer to this as blob-level processing (rather than paragraph-or document-level) since the provenance sentences may come from different documents and we cannot assume any syntactic continuity between sentences.We obtain a representation of the blob from f blob .In principle any method of distilling potentially disjoint text could be used here: we found TF-IDF to be effective, especially as multiple sentences of provenance selectively extracted from different sources could result in lengthy, but non-narratively coherent text (which can be problematic for transformer models).

Belief and Provenance Combination
Given the belief and provenance representations, we compute their similarity α i as the cosine of the angle between their embedded representations: The intuition is that sentences that are more consistent with the belief will score higher than those which are less.Scoring is important, as each IE system may give multiple provenance sentences (e.g., TAC allowed four).The sentences can be correct and support the belief, or be poorly selected and unsupportive.Higher scores suggest the provenance is related to the belief and helps differentiate supportive from unsupportive provenance.We use the computed similarity scores to combine the provenance representations and take a weighted average as our final input, capturing the semantics of the belief and provenance, as x =1 n i α i • s i .We pass the created representation x as the input to the feature learning module.
Though our computation of α i and x operate at the sentence-level, our approach can also be applied to individual word representations.For this word-level attention, we replace each sentence representation s i with a word representation w ij in our computation of α i and x.While we experimented with this word-level attention we found the model had trouble learning, frequently classifying beliefs nearly all as consistent, or inconsistent with "no repair."We note that a similarly effective word-level attention was provided in DeClarE.
We selected a similarity-based, rather than position-based, attention.Applying position-based attention, as Zhang et al. (2017) did on the TA-CRED dataset, assumes that provenance sentences contain an explicit mention of the subject and object.In our setting that explicitly is not the case (recall the prevalence of missing arguments in our datasets, c.f. Fig. 2).There is also an assumption that there is exactly one provenance sentence as opposed to TAC, where an IE system can select up to four provenance sentences without explicitly mentioning either the subject or object.

Feature Learning and Classification
Prior to classification we may learn a more targeted representation z by, e.g., passing the combined representation x into a multi-layer perception.If we do not, then the consistency and repair classifiers operate directly on z = x.
We noticed through development set experiments that while adding additional layers initially helped, using more than three layers marginally decreased performance.For a k-layer MLP we obtained the projections h (j) , for 1 ≤ j ≤ k, as: h (j) = g W (j) h (j−1) + b (j) .h (0) = x indicates the input, W (j) and b (j) are each layer's learned weights and biases (respectively), and g is the activation function.Through dev set experimentation we set g to be ReLU (Glorot et al., 2011).We found the MLP gave better performance ( §5) and that it was parametrically and computationally efficient.We note that the effectiveness of an MLP was also noted by the two top systems from the Fake News Challenge (Hanselowski et al., 2018;Riedel et al., 2017b) for the verification task.On dev, we evaluated from one to five hidden layers and found the performance to be consistent after three layers, with the mean close the scores in Tables 3 and 4 and a maximum standard deviation across all the dataset and evaluation metrics to be less then one F1 point.
In addition to the learned features learned h (k) , we experiment with a lexically-based skip connection, where the input from the previous layer skips a few layers and is connected to a deeper one.We found this to be effective when making use of "blob" level features, computed via f blob .We further found computing f blob as the TF-IDF vector of all provenance text to be especially effective ( §5.5).When using this connection, we compute Classification.We use the final representation z as input to the consistency (ŷ c = sigmoid (W c z + b c )) and repair classifiers (ŷ r = softmax (W r z + b r )).The parameters W c and W r have sizes 1 × (d tf-idf + d hidden ) and d relations × (d tf-idf + d hidden ), respectively.Here d tf-idf , d hidden , and d relations are the dimension of the TF-IDF vector, hidden vector and number of relations considered by the IE systems.

Joint Optimization
We train the parameters using back propagation of both losses, L consistency and L repair , jointly: Each subloss is a cross-entropy loss between the true (y c , y r ) and predicted (ŷ c , ŷr ) responses, weighted inversely proportional to the prevalence of the correct label.The tasks are not independent.In our formulation they share the same provenance and belief representations so learning both tasks jointly helps in learning these shared parameters. 1  While in this paper we present a joint loss objective, we note that we separately experimented with alternative, non-joint approaches to Eq. ( 1).However, in development we found they performed worse than the joint approach.First we evaluated pipelined approaches, e.g., where the repair classifier also considered the output of the credibility model, but found its performance to be inferior to the joint approach.Second, we also tried using the repair output as input to the credibility classifier, and found that it resulted in high recall with poor precision, with inconsistent instances being classified as consistent.The shared abstract representation of belief and provenance used in our formulation presented above allows fine tuning for both subtasks.We also experimented on dev with other types of weighting, such as a uniform weighting.However, the inversely proportional weighting scheme we describe in the main paper is what performed best on dev experiments.
A Generalizing Framework.We note that we can represent DeClarE by defining the belief encoder f belief as averaging word embeddings, a provenance encoder f evidence to be a Bi-LSTM, combining these representations with word level attention, and passing them to a two layer MLP without lexical skip connections.To achieve this specialization, we can optimize either L consistency or L repair .Representing LSTM-text is similar.This shows that our framework encompasses prior work.

Experiments
We centered our study around four questions, answered throughout §5.3.(1) As our approach subsumes credibility models, can those credibility models also be used for the consistency and/or repair tasks ( §5.3.1)? (2) What features and representations are important for the consistency and repair tasks ( §5.3.2)? (3) How important is it to model the realized (sequential) order of words within the provenance sentences for our tasks ( §5.3.3)? (4) What are the differences between relation repair and extraction ( §5.3.4)?

Components
We evaluated the effect of each of the four major components mentioned below.We used Glove (Pennington et al., 2014) as pre-trained word embeddings, except for BERT models, where we used the uncased base model (Devlin et al., 2019).Representations (Rep.):We evaluated three ways to represent beliefs and provenance text (compute f belief and f evidence ): Bag-of-Words (BoW) embedding which is the average of Glove embeddings, the final output from the LSTM and Bi-LSTM models, and the BERT representation output.While an average of embeddings may seem simple, this approach has empirically performed well on other tasks compared to more complicated models (Iyyer et al., 2015).
Feature Learning (Feat.):In our primary experiments to do further feature learning we used a three layer multi-layer perceptron ("MLP") to do further feature learning.We indicate no further feature learning with a value of "None." "Blob" Sparse Connection ("Sparse"): If used, we set f blob to compute either a TF-IDF or binarylexical vector based on the blob (concatenation of all sentences for a belief).This computed representation skips the feature learning component and is provided directly to the classifier.

Results
The overall test results across our three datasets are shown in Table 3 for the consistency task and Table 4 for the repair task.Each of the selected models was, prior to evaluation on the test set, chosen due to its performance on development data.The results are averaged across three runs.

Can Credibility Models be Used?
We first examine and compare our proposed framework against two different strong performing credibility models.These external methods are our baselines and we indicate them in Tables 3 and 4 by "♣" (Popat et al., 2018) and "♠" (Rashkin et al., 2017).We find they both perform poorly compared to other models, indicating that while both tasks learn similar functions the credibility models cannot be used "as-is" for consistency.This highlights the fact that the consistency task is sufficiently different from the existing credibility task.
Moreover, in examining whether credibility models transfer to the repair task, word level attention with a Bi-LSTM sentence encoder, as in DeClarE Table 5: Consistency and repair performance ablation study, averaged over three runs."Comb." is belief and provenance combination, and "Skip" is the use of skip connection.All use an MLP for feature learning.For space, we only consider TAC 2017 in these experiments.(Popat et al., 2018, ♣), performs poorly in the repair task too (with one exception on TACRED-KG).These results highlight differences in the credibility vs. consistency tasks, and the applicability of existing credibility models to both consistency and repair, suggesting that a dedicated framework and study such as ours is needed.

What Representations are Effective?
Consistency: Both sentence attention and a TF-IDF sparse connection improve the overall F1 of our framework's embedding-based models.We noticed that precision and recall vary across the datasets due to their different characteristics.This can be seen with the two methods that rely only on the lexically-based sparse connections (the first two rows of Table 3): while performance was strong on TACRED-KG consistency, it was quite poor on TAC 2015 and 2017.These latter two datasets have more provenance sentences per belief, and make fewer assumptions about what must be contained in the provenance.Together, this results in greater lexical variety, which suggests that while non-neural lexical-based consistency approaches can be effective in settings with limited provenance, stronger approaches are needed for greater and more diverse provenance.Learning refined embeddings (rows 5 and 6) suggests that these pre-trained models are helpful in the task.BERT benefits from the less noisy provenance in TACRED-KG.However, similar or slightly better performance is achieved when simple word embeddings are used, especially for TAC 2015/2017, highlighting the difficulty of the consistency task with noisier provenance.
Repair: Perhaps surprisingly, an embedding model with a TF-IDF sparse connection yielded good performance.The sparse-based lexical features are most influential, as evident from when just TF-IDF or binary lexical features are used.Looking across the three datasets, we notice that a TF-IDF only model provides a surprisingly strong baseline, outperforming the existing credibility models in almost all cases.Using BoW embedding with sentence attention, MLP feature learning, and a TF-IDF sparse connection, we can surpass a sparseonly TF-IDF approach.The BERT-based representation, fine-tuned or not, performed nearly equally to a BoW embedding on the repair task, indicating both the effectiveness of its pre-trained model and highlighting the difficulty of this repair task.
Belief: Marty Walsh; org:city_of_headquarters; Neighborhood House Charter School Summary: (✓, fixed) Human(C): No; Predicted(C): No; Human(R): org:founded_by; Predicted(R): org:founded_by Provenance: Walsh was a founding board member of Dorchester's Neighborhood House Charter School, and makes clear that he would support lifting the cap on charters in the city, something that hardly wins him the favor of the Boston Teachers Union.Belief: Alan M. Dershowitz; per:title; professor Summary: (✗, incorrect_fixed) Human(C): Yes; Predicted(C): No; Human(R): per:title; Predicted(R): per:religion Provenance: Harvard Law professor Alan Dershowitz said Sunday that the Obama administration was naive and had possibly made a "cataclysmic error of gigantic proportions" in its deal to ease sanctions on Iran in exchange for an opening up of the Islamic Republic s nuclear program.

How Helpful Is Sequential Modeling?
As indicated by Zhang et al. (2017), the sentences in TACRED and TAC are long.Consistency and repair models must be able to handle that.Note that BoW representation methods do not consider word order, while LSTM, Bi-LSTM and BERT embeddings do.From Tables 3 and 4, we see that TF-IDF sparse features and a sentence level combination of the belief and provenance give the best performance on both tasks when using a BoW representation, as compared to an LSTM, Bi-LSTM with word attention, and BERT.This indicates that for consistency and repair, unordered lexical features can be sufficient to get better performance.
We further examine this in Table 5, where due to space we focus on TAC 2017.Notice that while sequence-based encodings can improve some aspects (e.g., precision and F1 for consistency), there are not across-the-board improvements.We experimented with replacing the BoW embedding with a sentence-level Bi-LSTM representation.A Bi-LSTM representation with just attention and TF-IDF sparse features gives better consistency precision and F1 compared to BoW embedding approaches.However, the Bi-LSTM results in overall lower performance for repair.While the differences are not very large, they indicate that simple methods can outperform, or perform competitively with, sequential and autoencoding methods.

Relation Repair vs. Re-Extraction
While the repair task can be viewed as relation re-extraction, we examine the implications of this.Tables.3 and 4 show a large performance drop for TACRED-KG vs. TAC 2015/2017.First, TA-CRED was created from a TAC dataset and modified and augmented by crowd-sourced workers.When the belief was found with abstract or generalized provenance, workers were shown a set of sentences containing the subject-object pairs and asked to pick the representative sentence which was most specific.Second, each sentence is guaranteed to include the subject and object mentions, which is not always true for TAC 2015 and 2017, where a significant number of TAC provenance sentences were missing one or both the subject and object mentions.This highlights some of the differences in the core assumptions made in the construction of a relation extraction dataset.

Prediction Error Analysis
Fig. 4 demonstrates our framework's performance on some examples from TAC 2015.The first example describes the case where the belief was consistent with the provenance information and there was no recommendation of an alternate relation.Depending on the provenance the fix may not be appropriate, as in the second example of per:title vs. per:religion where we believe an indicative word like "Islamic" influenced the repair prediction.

Ablation Study
Our results show the strength of attention with lexical features.We further examine the impact of lexical features, using the first four rows of Table 5.
Lexical Impact on Consistency.From the first row of Table 5, we see BoW embedding for both the belief and provenance results in low precision and recall.While adding attention does not help, using TF-IDF sparse features drastically improves performance.Meanwhile, removing sentencebased attention only has a small impact on performance.All together this indicates the provenance found by the IE system is more lexically systematic.
Lexical Impact on Repair.A similar trend is seen for the repair task: our combined representation with TF-IDF is better than relying only on embeddings.Combining belief and provenance sentences gets slightly better micro overall compared to macro.This affects the MRR score too.However, the best performance is achieved when all components are combined.

Related Studies
There has been research on determining the consistency of beliefs using either schemas or ensembles, but none that are language-based, do not require access to IE system details, or attempt to repair inconsistent facts.Our work addresses all these.
Schema and Ensemble Based approaches: Previous work by Ojha and Talukdar (2017) and Pujara et al. (2013) determined the consistency of the extracted belief using a schema as the side information and coupling constraints to satisfy the schema's axioms.Rather than applying schemas, Yu et al. (2014) proposed an unsupervised method applying linguistic features to filter credible vs. non-credible belief.However, it required access to multiple IE systems with different configuration settings that extracted information from the same text corpus.Viswanathan et al. (2015) used a supervised approach to build a classifier from the confidence scores produced by multiple IE systems for the same belief.These are not standalone systems, as they assume the availability of multiple IE systems.
Language based approaches: The FEVER (Thorne et al., 2018) fact-checking study proposes a framework for credibility task and performs provenance-based classification without attempting to repair errors.This task has inspired a number of efforts (Yin and Roth, 2018, i.a.,), including Ma et al. (2019) who tackle a problem similar to our consistency.Guo et al. (2022) outlines additional language-based approaches for consistency prediction (they term it "verdict prediction").However, a crucial difference is that we aim to operate on KG tuple outputs as the belief (not sentences).
Overall, our study differs from previous ones in two important ways.(1) We address the problem of determining consistency and potential corrections without access to an underlying semantic schema.
(2) Our standalone approach treats the underlying IE systems as blackboxes and requires no access to the original IE systems or detailed system output containing confidence scores.

Conclusions
We propose a task of refining the beliefs produced by a blackbox IE system that provides no access to or knowledge of its internal workings.First we analyze the types of errors made.Then we propose two subtasks: determining the consistency of an extracted belief and its provenance text, and suggesting a repair to fix the belief.We present a modular framework that can use a variety of representation, and learning techniques, and subsumes prior work.This framework provides effective techniques for the consistency and repair tasks.

Figure 1 :
Figure 1: Examples of beliefs extracted from real IE systems on the TAC 2015 English news corpus, demonstrating the consistency and repair tasks.Multiple sentences can contribute to a belief (1b).

Figure 2 :
Figure 2: Error categorization of 600 beliefs extracted by IE systems on three datasets.Multiple categories can apply as beliefs can have incorrect relations and incomplete provenance.

Figure 4 :
Figure 4: Examples of our model's predictions on the TAC 2015 datasets.Human: gold standard label, Predicted: our model's label, C: Consistency, R: Repair, Human(C): Human Consistency label, and Predicted(C): Predicted consistency label.Similarly for repair.Summary indicates overall prediction analysis of example.(✓, fixed) means consistency correctly predicted and incorrect belief was fixed.

Table 1 :
Various news outlets have reported that federal agents have probable cause to charge Reginald Wayne Miller with forced labor, a felony that can carry up to a twentyyear prison sentence per charge.Examples for each of the four identified error categories from the TAC 2015 dataset.

Table 2 :
Dataset statistics, in the number of provenancebacked beliefs, for the train/dev/test splits per dataset.

Table 4 :
Repair Performance (averaged over 3 runs) of models with abbreviations as in Table3.