WiCE: Real-World Entailment for Claims in Wikipedia

Textual entailment models are increasingly applied in settings like fact-checking, presupposition verification in question answering, or summary evaluation. However, these represent a significant domain shift from existing entailment datasets, and models underperform as a result. We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from Wikipedia. In addition to standard claim-level entailment, WiCE provides entailment judgments over sub-sentence units of the claim, and a minimal subset of evidence sentences that support each subclaim. To support this, we propose an automatic claim decomposition strategy using GPT-3.5 which we show is also effective at improving entailment models' performance on multiple datasets at test time. Finally, we show that real claims in our dataset involve challenging verification and retrieval problems that existing models fail to address.


Introduction
Textual entailment (Dagan et al., 2005) and natural language inference (MacCartney and Manning, 2009;Bowman et al., 2015;Williams et al., 2018) are longstanding problems in NLP that take many forms.The SNLI dataset has a stated purpose to use NLI "as a tool for the evaluation of domain-general approaches to semantic representation" (Bowman et al., 2015).However, this is far from how NLI is used today, e.g., to validate QA system outputs (Chen et al., 2021), evaluate generated summaries (Falke et al., 2019;Laban et al., 2022) or understand knowledge-grounded dialog (Honovich et al., 2021;Gupta et al., 2022;Dziri et al., 2022).
There are some major gaps when applying modern entailment systems to these tasks.First is the fact that many NLI datasets target short premises, often single sentences, such as VitaminC (Schuster   1 Our data is available at: https://github.com/ryokamoi/wice The Société de transport de Montréal (STM) 747 Shuttle Bus replaced the "Aerobus." The "Aerobus" was privately operated by Groupe La Québécoise.
…The Société de transport de Montréal (STM) 747 Shuttle Bus replaced the "Aerobus" that was privately operated by Groupe La Québécoise. [7]… Evidence: cited article [7]  Claim Subclaim 1 …The route is the 747 Express bus, which finally provides a direct, non-stop link between downtown and Dorval Pierre Elliott Trudeau International Airport…It also replaces La Québécoise's Aérobus shuttle service between the bus station and the airport that used to run every half hour and cost $16… Figure 1: WICE annotation for a claim in Wikipedia and its cited evidence.Claims are automatically broken into subclaims.WICE provides entailment labels and indices of evidence sentences that support a subclaim.Real-world claims are often partially supported (subclaim 2); unsupported tokens are annotated here.et al., 2021) and WANLI (Liu et al., 2022).As a result, existing frameworks for document-level entailment are built upon aggregating local entailment scores (Zhou et al., 2019;Laban et al., 2022) or using a retrieval stage (Nie et al., 2019;Schuster et al., 2022).There are a few exceptions such as DocNLI (Yin et al., 2021), but it features a large amount of synthetic negative data.This highlights the second weakness: the lack of ecologically valid negative examples.The process by which contradictory cases are authored leads to spurious correlations, including single-word correlations (Gururangan et al., 2018;Gardner et al., 2021), syntactic heuristics (McCoy et al., 2019), or a lack of reliance on the input (Poliak et al., 2018).Third, these datasets lack fine-grained annotations of what parts of a claim are supported or not.
Our work addresses these shortcomings by collecting WICE (Wikipedia Citation Entailment), a dataset for verification of real-world claims in Wikipedia.Given a sentence in Wikipedia and the corresponding article(s) it cites, we annotate the entailment label, a list of sentences in the cited article(s) that support the claim sentence, and tokens in the claim that are unsupported by the article(s) (see Figure 1).We show that the claims in WICE involve challenging verification and retrieval problems beyond the scope of current NLI datasets.
To aid the construction of WICE and provide fine-grained annotations, we introduce Claim-Split, a method of decomposing hypotheses into subclaims using few-shot prompting with GPT-3.52 (Brown et al., 2020;Ouyang et al., 2022), shown in Figure 2.This decomposition resembles past frameworks derived from OpenIE (Stanovsky et al., 2018;Ernst et al., 2021) or Pyramid (Nenkova and Passonneau, 2004;Shapira et al., 2019;Zhang and Bansal, 2021), but avoids relying on annotated data and achieves greater flexibility by using .By operating at the subclaim level, we simplify both our annotation process and the final entailment prediction task for automatic models.We also show that Claim-Split can improve the entailment classification performance of off-the-shelf models by simplifying long claims.
We evaluate a range of systems on our dataset, including existing short-paragraph entailment models "stretched" to make a document-level entailment judgment out of short-paragraph judgments (Laban et al., 2022;Schuster et al., 2022).We show that chunk-level processing of the long evidence and retrieval-based approach are a strong starting point for future systems, although current systems still perform below human level on this dataset and retrieval performance is poor.

The WICE Dataset
We aim to annotate claims that are: (1) naturallyoccurring; we extract claims from Wikipedia and its cited documents, where the noise in citations gives realistic negative examples, (2) in-context with surrounding text; this mirrors real use cases where claims occur in discourse context, and (3) fine-grained; we break down complex claims into multiple subclaims with Claim-Split and provide entailment judgments for both granularities.We also provide token-level annotation for nonsupported tokens.
The main altar houses a 17th-century fresco.
The fresco is of figures interacting with the framed 13th-century icon of the Madonna.
The icon of the Madonna was painted by Mario Balassi in 1638.

Original Sentence:
The main altar houses a 17th-century fresco of figures interacting with the framed 13th century icon of the Madonna (1638), painted by Mario Balassi.

Claim-Split using LLMs
A key idea we argue for in this paper is claim decomposition.Real world claims, such are those in Wikipedia that constitute WICE, often consist of multiple related pieces of information, each of which may or may not be supported by the evidence.We show an example of this in Figure 1 where the two subclaims, derived from the complex claim, have different entailment labels.By decomposing claims prior to annotation, and collecting annotations at the subclaim level, we can offer a more fine-grained view into which parts of claims are supported by cited documents.First, we describe our claim decomposition strategy, called Claim-Split.Then, we establish the validity of Claim-Split as a pre-annotation step by manually verifying the generated subclaims.
Claim-Split method Our method prompts an instruction-tuned GPT-3.5 model (textdavinci-002) (Brown et al., 2020;Ouyang et al., 2022) in a few-shot setting to automatically decompose a claim c into multiple subclaims: Claim-Split(c) = {c 1 , ...c m }.The prompt used includes K example splits along with the instruction "Segment the following sentence into individual facts."These examples are designed such that the subclaims provide complete coverage over the corresponding input claims (see prompt used for WICE in Appendix D.1). Figure 2 shows an example of decomposed claims using our approach.
Claim-Split generated subclaims correctly and subclaims completely cover the information in the source claim.We recruited Mechanical Turk workers to annotate subclaims decomposed by Claim-Split for 700 Wikipedia claims, 3 for two necessary criteria: (1) Completeness: subclaims {c 1 , . . .c m } must cover all information in c, and (2) Correctness: all subclaims c i must faithfully present a part of the information in c.
We found that 92.3% of the claims satisfy the completeness criterion and 97.7% of the generated subclaims satisfy the correctness criterion.Analyzing the errors further, we found that one-third were relatively trivial such as disregarding parentheses in the original claim, and could be solved with targeted prompts.Other errors were more complex and do not have straightforward solutions; see Appendix A for more analysis.In total, only 8.6% of the claims included one of the two types of errors.These results justify our strategy of seeking annotations at the subclaim level and show that building systems at the subclaim level is a viable path for further efforts.
The set of annotated claims from this experiment constitutes the complete dev and test set of WICE (details later in Section 2.3); therefore, we manually fix the errors detected in this annotation to provide a high quality evaluation dataset.

Tasks in WICE
Let claim c (analogous to the hypothesis in standard NLI terminology) be a sentence in a Wikipedia article, and evidence E = {e 1 , . . ., e n } (analogous to the premise) refer to sentences from web article(s) cited as a reference for the claim c.Let Claim-Split(c) = {c 1 , . . ., c m } be the automatically decomposed subclaims from c.
The human-annotated data we collect supports the following three tasks: 1. Entailment Classification: Given a claim c (or subclaim c j ) and evidence document E, is the claim (or subclaim) entailed by the document?We annotate three-way entailment: {SUPPORTED, PARTIALLY-SUPPORTED, NOT-SUPPORTED}.
2. Evidence Retrieval: Given a claim c (or subclaim c j ) and evidence document E, which subset of evidence sentences {e 1 , . . ., e k } ⊂ E support or partially support c (or c j )?
3. Detecting Non-Supported Tokens: Given subclaim c j and evidence document E, which tokens {t 1 , . . ., t p } ⊂ c j are not supported by E?
For each of these three sub-tasks, we only collect human annotations at the subclaim-level (as shown in Figure 1).Claim-level labels are obtained by automatically projecting subclaim annotations using a natural set of rules described in Section 2.3.

Dataset Collection
Base Data We use the same base set of Wikipedia claims as the SIDE dataset (Petroni et al., 2023).For each claim, we re-retrieve the cited web article(s) from Common Crawl4 and segment into sentences. 5This gives us our base set of claimevidence pairs (c, {e 1 , e 2 , . . .e n }).
Claim-Split Next, we split each claim c into subclaims {c 1 , c 2 , . . .c m } using Claim-Split, as described in Section 2.1.We use few-shot prompting with six examples (Appendix D.1).Also, we filter claims that are decomposed into either only one or more than six subclaims. 6For the development set, this filters 19.1% of examples.

Additional Filtering
We use a NLI model (RoBERTa-Large) trained on DocNLI (Yin et al., 2021) to remove datapoints (c, E) for which all subclaims c j ∈ Claim-Split(c) are classified as entailed.7By using a relatively weaker RoBERTa-Large model, we remove trivially entailed claims but avoid making a dataset that is adversarially difficult for larger models (e.g., T5-3B).
We retain 16.3% of the claims after applying all the filtering steps above.Despite this, we observe diverse entailment phenomena in the remaining subset, which we analyze further in Section 2.5.

Human Annotation
We recruited Mechanical Turk workers to annotate evidence-subclaim pairs for each of the three tasks outlined in Section 2.2.First, we ran a qualification task with 3 examples (chosen to include challenging annotations) and qualified 23 workers based on it.Annotators were paid $0.75 per HIT.Each HIT involved annotation of all subclaims corresponding to a single claim.
We collect annotations from 5 unique workers for each example in the development and test set, and from 3 for the train set.We observe reasonable inter-annotator agreement for the entailment classification task; Krippendorff's α = 0.62 on the development set.We describe how we aggregate annotations from these workers in Appendix B.

Deriving claim labels from subclaim labels
For entailment classification, if all subclaims are SUPPORTED or NOT-SUPPORTED, we assign that as the claim-level label.Else, we assign PARTIALLY-SUPPORTED.For supporting sentences, we take the union of subclaim level supporting sentences as the claim level annotation.When there are multiple sets of supporting sentences for each subclaim,8 we take the union of all combinations.

Dataset Statistics
Table 1 shows the overall statistics for the final WICE dataset.On average, claims were decomposed into 3.0 subclaims by the Claim-Split method.Our dataset contains approximately 5.9K subclaim level (or 2K claim level) examples for the entailment classification task.Each subclaim (or claim) is supported by an average of 1.9 (or 3.1) evidence sentences when the label is SUPPORTED or PARTIALLY-SUPPORTED.
Table 2 shows the distribution of entailment labels.Roughly 56% of the subclaims are labeled as SUPPORTED.This percentage is much lower at the claim level (33%) since all subclaims must be SUPPORTED.Consequently, a majority of the claims fall into the PARTIALLY-SUPPORTED category.In a typical NLI dataset, both PARTIALLY-SUPPORTED and NOT-SUPPORTED would simply be labeled as neutral, which is not very useful for a system designer attempting to verify a fact.For PARTIALLY-SUPPORTED subclaims, 25.2% of tokens were identified as not supported.

Analysis of Phenomena
We compare with FEVER and VitaminC to show that the in-the-wild claims and cited articles in WICE constitute diverse and challenging verification problems.We bucket verification problems from these datasets into the following categories.2021), we observe the best performance using T5 models trained on ANLI, although there remains a gap between these and human performance.† : Worse than T5-3B trained on ANLI (chunk-level) with p-value < 0.05 according to a paired bootstrap test.
lected from the development sets with these categories for the three datasets. 9In WICE, we annotated 127 subclaims in 50 claims.Table 3 shows the estimated distribution.We can see that natural claims in WICE involve difficult entailment classification problems often requiring some kind of inference even at the subclaim level.In contrast, relatively few claims in VitaminC involve inference, but mostly require narrower types of reasoning such as calculation.

Experiments on WICE
We have three main questions for our dataset.
(1) How well do existing NLI models perform offthe-shelf when using the "stretching" paradigm?
(2) Does fine-tuning on our dataset improve accuracy?(3) Would being able to retrieve the relevant supporting sentences improve accuracy further?

Entailment Classification on WICE
We benchmark the performance of NLI models on WICE in both off-the-shelf and fine-tuned settings.
9 If more than one category applies, we assign the most "difficult" category (latest in our list).Examples are given in Table 13 in the appendix.
Stretching NLI for document-level entailment WICE's evidence articles are generally much longer than the input length limits of NLI models.Therefore, we adopt the "stretching" technique from prior work (Laban et al., 2022;Schuster et al., 2022) for all our entailment models.
We divide the evidence document E into multiple partitions P E .These can be individual sentences (e i ) or chunks (contiguous e i 's).We restrict maximum chunk size to 256 tokens.For each subclaim/claim and partition pair, we use the NLI model to get a "local" entailment score: sc(c i , p) = P (y = entailed | c i , p).Then, as the first method of the "stretching" technique, we derive a document-level score by taking the maximum local score: We call this the MAX entailment strategy.
The PARTIALLY-and NOT-SUPPORTED labels in WICE correspond to the neutral (and sometimes contradiction) category in SNLI, MNLI, and ANLI.On the other hand, DocNLI only includes binary categories (entailed or not entailed).To evaluate models trained on these datasets on WICE, we consider a binary classification task: SUPPORTED or not.We use the predicted probability for entailment as the predicted score for the SUPPORTED class.
We evaluate GPT-3.5 and GPT-4 in Section 3.4 on the oracle retrieval dataset (explained later); the stretching approach requires invoking the model for every (document chunk, subclaim) pair, which becomes very expensive to test with large models.
Models fine-tuned on WICE WICE consists of entailment labels corresponding to entire evidence documents and also supporting sentences for SUPPORTED and PARTIALLY-SUPPORTED cases.To train models that can be stretched as described above, we derive sentence-and chunk-level entailment labels from these WICE annotations (details are in Appendix E.2).Although we evaluate the performance on the binary classification task, we fine-tune models on the three-way classification labels in WICE.
Human Performance We manually annotate 50 randomly selected test claims and report that as human performance.Similar to crowd annotation, we annotated at the subclaim level and aggregated them to obtain claim-level judgments.
Results Table 4 outlines off-the-shelf performance on WICE for the binary entailment classification task.It shows that predicting scores at the chunk-level works better than sentence-level using the MAX strategy.Overall, T5-3B trained on ANLI performs best, though it is still substantially lower than human performance (64.3 vs 83.3 F1 at the claim-level).This shows that the realistic claims and document-level setting of WICE differs substantially from previous NLI datasets.
For fine-tuning, we evaluate two settings: only fine-tuning on WICE or further fine-tuning a T5-3B model already fine-tuned on ANLI.Results are in Table 5.It shows that the chunk-level T5-3B model fine-tuned on WICE after ANLI achieves the best performance at both granularity levels.Although it improves over off-the-shelf results in Table 4, it is still substantially lower than human performance.This suggests that WICE is challenging even for fine-tuned models.

Evidence Retrieval on WICE
First, we benchmark the performance of sentencelevel NLI models fine-tuned on WICE on the retrieval task: given claim/subclaim c, retrieve all supporting sentences from evidence E.Then, we evaluate if a retrieve-then-predict pipeline can improve the performance of entailment classification.
Retrieval using NLI models Our strategy is as follows: derive a score for each evidence sentence and claim/subclaim pair.We use p(entailed) for prior NLI datasets and p(SUPPORTED) + 0.5 × p(PARTIALLY-SUPPORTED) for WICE as retrieval scores.10Evidence sentences with scores larger than a threshold τ are predicted as supporting sentences.
However, we saw in Table 5 that sentencelevel NLI models perform significantly worse than chunk-level models on classification, suggesting that a single sentence without context is insufficient for entailment evaluation.Therefore, we also train another sentence-level variant that includes additional evidence context (128 tokens) as input, in a format of "claim <SEP> evidence-context <SEP> evidence-sentence".11 Metric We evaluate supported or partiallysupported claims and subclaims, which include at least one gold supporting sentence.As there can be multiple gold sets of supporting sentences for each claim/subclaim in WICE, 12 we report the maximum F1 score over the gold sets for each claim/subclaim: max i F1( Ŝτ , S i ) where Ŝτ is the predicted set and S i is the i-th gold set.We choose the threshold τ that gives the best F1 score (calcu- Results: Including context from evidence sentences improves retrieval performance.Table 6 reports the performance of the baseline BM25, best off-the-shelf model from Table 4 (T5-3B on ANLI), and fine-tuned entailment models.Performance of models fine-tuned with evidence context on WICE is shown in the bottom half of the table.We find that these latter category of models perform best, with the best performance reported by the T5-3B model fine-tuned on WICE.14

Entailment Classification using Retrieval
Here, we use retrieve-then-predict rather than the MAX "stretching" strategy from Section 3.1.
Setup We retrieve the top-k (= 7, in our experiments15 ) sentences using the sentence-level retrieval scores in Section 3.2. 16These sentences are concatenated to construct a new premise/evidence; this is used by the chunk-level NLI model to make a document-level judgment: sc(c i , E) = P (y = entailed | c i , e i,1 . . .e i,k ), where e i,1 . . .e i,k are the top-k retrieved sentences.This strategy is similar to Nie et al. (2019).

Results
We report the performance of the chunklevel T5-3B model fine-tuned on ANLI+WICE in Table 7.It shows that the retrieve-then-predict strat- egy using the retrieval model without evidence context does not work well.However, adding context improves performance significantly.This mirrors our evidence retrieval results from Table 6.As an upper bound, we report results in the oracle setting, i.e., if a gold set of supporting sentences is provided as input to the NLI model. 17We see substantially improved performance in the oracle setting (71.1 vs 78.0 in oracle).The large gaps suggest that better retrieval can improve the entailment classification performance.

Entailment Classification by GPT
Table 8 shows the entailment classification performance of GPT-3.5 (Brown et al., 2020;Ouyang et al., 2022) and GPT-4 (OpenAI, 2023) on WICE with oracle retrieval, which uses the gold set of supporting sentences. 18We use a few-shot prompt with three examples (Appendix H).We see that GPT-4 is stronger at entailment classification than our best fine-tuned subclaim-level model, but claimlevel classification, which is expected to be more complex, is still challenging for simple few-shot prompting even with the oracle retrieval.
Although we evaluate the GPT models with oracle retrieval, we believe that future work can explore how to scale GPT-4 to work over longdocument entailment settings (e.g., tradeoffs of using stretching vs. feeding in contexts up to the maximum size allowed by GPT-4) and reduce its cost so it can be practically deployed for fact verification in real workflows.

Claim-Split
A broken shackle and chain lay at her feet as she walks forward, commemorating the recent national abolition of slavery.
A broken shackle and chain lay at her feet.She walks forward.This is commemorating the recent national abolition of slavery.

NLI Model
Figure 3: Entailment classification using Claim-Split.We decompose a claim into subclaims, predict subclaimlevel entailment scores, then aggregate these to get the score for the claim.
4 Is Claim-Split useful for entailment classification?
We introduced Claim-Split as a method to provide fine-grained annotation by decomposing claims into simpler subclaims, and we used it with human verification to ensure the quality of our dataset.However, human evaluation in Section 2.1 shows that Claim-Split makes mistakes less than 10% of the time, indicating its robustness.In this section, we aim to demonstrate the potential impact of Claim-Split beyond the dataset construction and its applicability to diverse domains.Specifically, we test the hypothesis that Claim-Split makes entailment classification easier and improves the performance of NLI models.To show this, we evaluate a strategy of entailment prediction by aggregating the subclaim level entailment score on four datasets.
Setup Given a claim-evidence (c, e) pair and target entailment label y, we compare the performance of two configurations that use identical entailment models for prediction: 1. Standard: As is the typical mode of inference with these models, we directly predict the entailment label for the original claim c.The experiments in Section 3 are also in this setting.
2. w/ Claim-Split: We first predict subclaimlevel entailment scores for each subclaim c j ∈ Claim-Split(c).These are combined into a claim-level entailment score using harmonic mean. 19Figure 3 describes this configuration. 19Although more intuitive, we found that aggregating using min is quite sensitive to mistakes made by the entailment models or the Claim-Split method.Harmonic mean was a good balance between min and arithmetic mean.Test Data In addition to WICE, we report results on the test sets of three datasets from Honovich et al. ( 2022): VitaminC (fact-verification) (Schuster et al., 2021), PAWS (paraphrase) (Zhang et al., 2019), and FRANK-XSum (summarization) (Pagnoni et al., 2021).We evaluate the 500 longest claims in VitaminC and PAWS, using length as a proxy for complexity that Claim-Split is designed for. 20To decompose claims using Claim-Split, we use a unique prompt for each dataset that includes 2-4 dataset-specific examples (Appendix D.1).

Models
We evaluate the performance of T5 models fine-tuned on ANLI21 for off-the-shelf settings and on WICE for fine-tune settings.For WICE, we use the MAX setting as described in Section 3.1.Note that we use the same trained models when comparing standard and w/ Claim-Split settings.

Results
We report the AUROC metric for the entailment classification task in Table 9.For most model-dataset pairs, "w/ Claim-Split" method outperforms the standard method of using off-theshelf models.For the smaller T5-Large model, we observe statistically significant improvements using Claim-Split for three datasets.This is intuitive as reducing the complexity of the problem likely benefits models with lower capability.Our results show that despite the introduction of noise (discussed in Section 2.1), Claim-Split is effective at simplifying the entailment classification task and improving performance.We expect that the use of better prompts and aggregation methods would lead to further improvement.
5 Discussion: Outstanding Challenges Explainable entailment.Our end goal is an explainable document-level entailment system that is able to localize non-factuality within claims, as our subclaims and token-level annotation allow.This requires surfacing the right evidence, which we show remains a hard problem.
Unsupported token detection.Although, we did not conduct experiments on this, we believe that localization of errors is an important problem and difficult to address with existing methods (Kamoi et al., 2023).This problem should be further studied in context of decomposition techniques like Claim-Split.
Better understanding of contextualization.Finally, we believe that the nature of contextualization remains a major unsolved problem.While decontextualizing claims is an appealing possibility (Choi et al., 2021), we found that not all claims were easy to succinctly decontextualize.For example, The fresco is of figures... in Figure 2  Fact-verification datasets A separate line of datasets designed for multi-hop reasoning comes from fact verification (Thorne et al., 2018;Schuster et al., 2021).However, in practice, claims in these datasets rarely require multiple evidence sentences (Thorne et al., 2018, FEVER) or are skewed towards statements about quantities (Schuster et al., 2021, VitaminC).In recent work, Petroni et al.
(2023) looked at the attribution task in the context of Wikipedia citations, but only at the coarse level of finding a better supporting document.
Hypothesis Decomposition In summarization, the Pyramid method (Nenkova and Passonneau, 2004) and its recent automated variants (Shapira et al., 2019;Zhang and Bansal, 2021;Liu et al., 2023) decompose a summary into semantic content units, but is primarily aimed at understanding what content is covered rather than the reliability of that content.More recent frameworks have looked at breaking statements down into propositions (Stanovsky et al., 2018;Ernst et al., 2021;Chen et al., 2022Chen et al., , 2023)); our approach is similar but does not rely on supervised judgments and is not restricted to token extraction.Also, purely extractive methods are not suitable for use with off-the-shelf entailment models compared to Claim-Split.Work on factuality in summarization has looked at entailment of sub-sentence units like dependency arcs (Goyal andDurrett, 2020, 2021) or using question-answer pairs to isolate specific pieces of information (Wang et al., 2020;Durmus et al., 2020;Scialom et al., 2021).However, these and related frameworks like QA-SRL (He et al., 2015) are too fine-grained for our annotation scheme.

Conclusion
We collect WICE, a new NLI dataset constructed from Wikipedia.By comparing sentences in Wikipedia against cited evidence documents, we find a rich set of real-world entailment phenomena distinct from those in prior NLI datasets.We also show that decomposing complex claims into subclaims can be a valuable pre-processing step for both annotation and entailment prediction.

Limitations
Scope and Diversity of the Dataset Although we propose WICE to evaluate and improve models for evaluating real-world entailment, this dataset only includes claims in English Wikipedia articles and evidence in the cited websites.We observe that WICE includes diverse claims and evidence, but there are many types of real-world claims that are very different from claims in WICE in both style and content, such as political claims and social media posts.
Furthermore, almost all recent language models are pre-trained on Wikipedia articles.As a result, our dataset cannot evaluate truly zero-shot performance, given that models have been exposed to this text before.However, note that pretraining does not necessarily enable a model to know whether this fact on Wikipedia is supported by this particular document; we believe that many unsupported claims in our dataset are true, just not supported by the particular evidence documents.The fact that all models in our experiments are pre-trained on Wikipedia, yet they do not all perform uniformly well, supports this point.Developing datasets that are based on brand-new texts is a promising direction for future work, in order to evaluate the performance in a truly zero-shot condition.

Baseline Models
The experiments in this paper are mainly conducted on T5-3B, which is smaller than recent large language models (LLMs).Although we evaluate GPT-3.5 and GPT-4 on the oracle retrieval dataset in Section 3.4, we do not evaluate LLMs on the full-pipeline experiments (retrieve supporting sentences from evidence articles for entailment classification, or feed the whole evidence articles to the models).Nevertheless, our dataset can provide a realistic testbed for experiments evaluating the ability of LLMs on long documents.
Context for Claims Although many existing NLI datasets target short and independent claims and evidence, claims and evidence in the WICE dataset are in-context with the surrounding text.We experimentally show that the context for evidence sentences would be useful for supporting sentence retrieval.However, our experiments do not show that providing the context of claims improves the entailment evaluation performance on the WICE dataset, in spite of our observation that some claims require anaphora resolution.We hypothesize that this observation can be attributed to the nature of our dataset, where only relevant, cited evidence documents are used.Context would be much more important if in a case like Figure 2, we were attempting to substantiate these claims based on evidence documents discussing different altars or different frescoes.These mismatches would likely be more prevalent if we used automatic retrieval to find relevant documents.Instead, the linked documents from Wikipedia that perform the basis of our dataset are implicitly about the same entities as in the claims, so our models do not need to understand the context as thoroughly to evaluate the entailment relationships.

A Human Evaluation of Claim-Split
In Section 2.1, we evaluated the subclaims obtained using Claim-Split for completeness and correctness criteria.Figure 9 shows the instructions provided to the crowd annotators for this task.2) and ( 4) to improve and check the annotation quality, but we do not use the answers for these questions in the analysis.echelon of women's tennis" is missing in the decomposed sentences.
Original Sentence: Before they established themselves in the upper echelon of women's tennis, Dominique Van Roost was the only player in Belgian history to be ranked in the top ten of the ATP or WTA rankings, a mark she did not achieve until 1998 after Clijsters and Henin turned professional.
Decomposed Sentences: • Dominique Van Roost was the only player in Belgian history to be ranked in the top ten of the ATP or WTA rankings.• Dominique Van Roost achieved this mark in 1998.

• Clijsters and Henin turned professional before
Van Roost achieved this mark.
Another relatively simple mistake is removing parentheses (6.7%).In the following example, decomposed sentences ignore "(Center for Predictive Engineering and Computational Sciences)" in the original sentence.
Original Sentence: In 2009 he was appointed deputy director of the PECOS center (Center for Predictive Engineering and Computational Sciences) at the University of Texas.
Decomposed Sentences: • In 2009 he was appointed deputy director of the PECOS center.• The PECOS center is at the University of Texas.
We expect the above mistakes and "Missing And" mistakes, which ignore some words or clauses connected by "and", could be solved by changing the prompt to include examples featuring these formats.
However, we also find errors that would be difficult to solve.For example, in the following example, the fact that Howard lives with his wife Cerys is missing although a decomposed sentence Howard's wife Cerys is a doctor mentions his wife.Claim-Split with our prompt sometimes make mistakes when multiple decomposed sentences should be generated for a specific span in the original sentence; for his wife Cerys, decomposed sentences about two facts Howard lives with his wife Cerys and Howard's wife Cerys is a doctor should be generated.
Original Sentence: Howard lives in Camden, London with his wife Cerys, a doctor, and their dog, a Jack Russell Terrier named Archie.
Decomposed Sentences: • Howard lives in Camden, London.• Howard's wife Cerys is a doctor.
• Howard and Cerys have a dog named Archie.
• Archie is a Jack Russell Terrier.
Another challenging type of mistake is caused by over-splitting.In the following example, as a result of decomposition, the information of "after" is lost.In this case, a candidate of correct decomposition is not to decompose the latter two sentences as "Shine Limited was set up by former BSkyB executive Elisabeth Murdoch after she quit as broadcaster".
Original Sentence: The production company that was selected was Shine Limited, which was set up by former BSkyB executive Elisabeth Murdoch after she quit as broadcaster.
Decomposed Sentences: • The production company that was selected was Shine Limited.• Shine Limited was set up by former BSkyB executive Elisabeth Murdoch.• Elisabeth Murdoch quit as broadcaster.
We note that we have manually fixed these mistakes in the dev and test set of the WICE dataset.

B Additional Data Collection Details
Base Dataset As mentioned in Section 2.3, we use the same articles-claims pairs from Wikipedia as the SIDE dataset (Petroni et al., 2023) but do not use their annotations or any other aspects of their pipeline.We re-retrieve the citations from Wikipedia directly because SIDE only contains one supporting evidence even when there are multiple in the raw data.For our dataset, we use the August 2019 version for both Wikipedia and Common Crawl.We automatically parse the cited articles' HTML to extract the article text.This process is often quite noisy and may include extraneous sentences like "Click for more" that are not part of the main article body; however, we included them in WICE because real-world entailment classification often requires dealing with noisy data.
Note that for claim sentences with multiple citations, we only include claims with 1 or 2 citations positioned at the end of the sentence.Cases with larger numbers of citations are infrequent (approximately 8.1%) and typically represented either lists or multiple articles all in support of the same base fact.
Additional Filtering In Section 2.3, we outlined additional filtering using a RoBERTa-Large model to filter out trivially entailed claims, i.e., those for which all subclaims are predicted as entailed.Here, we provide more details of our process.
We use a pre-trained model fine-tuned on the DocNLI dataset provided by the authors of the dataset. 23To deal with the long WICE evidences, we split the documents into chunks of less than 256 tokens each.We predict entailment scores for each chunk-subclaim pair using the NLI model.For aggregation across chunks, we classify a subclaim as entailed if it is classified as entailed by at least one chunk.
Task Interface Figure 4 shows our annotation interface.The left panel shows the evidence articles which are split into sentences and numbered.The right panel shows the claim along with its preceding context.In the bottom half of the right panel, the sublclaims derived using Claim-Split are shown; all annotation is performed for these.
Each HIT includes annotation of one claim, i.e., 2-6 subclaims.The median work time for each HIT was about 5 minutes ($9/h). 24For each subclaim, the annotators first select the entailment classification label and, if applicable, the supporting sentences.If they select "Partially Supported / Not Supported" in the first step, the annotation interface for non-supported tokens is shown to them (see Figure 5).For these cases, we consider a subclaim as NOT-SUPPORTED if the annotator highlights all subclaim tokens, else PARTIALLY-SUPPORTED.The "Confirm" button allows annotators to move to the next subclaim.
We build our annotation interface based on the FALTE annotation tool (Goyal et al., 2022).

Filtering Steps
We perform filtering at different stages of our dataset collection process, e.g., while retrieval of evidence articles from Common Crawl, cleaning the base dataset or during filtering using the NLI model.Table 11 shows the number of data points removed at each step of the filtering process for the development set.
Aggregation of subclaim level labels from different workers For entailment classification, we take a majority vote between worker labels.
If no majority exists but 2 annotators each select PARTIALLY-SUPPORTED and NOT-SUPPORTED for the dev or test set, we choose PARTIALLY-SUPPORTED as the final label.In all other scenarios, we remove the subclaim (and the corresponding claim c) from our dataset as these cases tend to be quite subjective.This filters out 12.5% of the claims in the development set.
For supporting evidence sentences, we retain individual sets of supporting sentences by all workers who chose the majority entailment label (the  label selected in the above step).We prefer this over union as there can be multiple different sets of sentences with identical information (e.g., the date can sometimes be ascertained from several different sentences).We found that this subset of workers chose the exact same set of supported sentences for 56.1% of SUPPORTED cases and 34.4% of PARTIALLY-SUPPORTED cases, which shows high inter-annotator agreement.
To aggregate unsupported tokens in subclaims, we take a token-level union of all workers who chose PARTIALLY-SUPPORTED as the entailment label.For this task, we remove data points if any annotator disagrees with the final set of tokens by more than three tokens (25.3% PARTIALLY-SUPPORTED subclaims in the develop-

C Additional Dataset Statistics and Analysis
Supporting Sentences Figure 6 shows the distribution of the number of supporting sentences f PARTIALLY-SUPPORTED, SUPPORTED and the combined set in WICE.Supported and partially supported subclaims have almost the same number of supporting sentences annotated (1.9 on average), but supported claims have more supporting sentences annotated compared to partially supported claims (averages are 3.4 and 2.9).
Non-Supported Tokens After filtering out data points with low inter-annotator agreement on the annotation for non-supported tokens, there are 374 partially supported subclaims in the training data with an average of 12.8 tokens per subclaim in (Table 12).Partially supported subclaims have 3.3 nonsupported tokens (25.2% of tokens in subclaims) on average in the development set.
Word Overlap Figure 7 shows the word overlap between claims and evidence (the recall of claim bigrams that are also in the evidence) for WICE, VitaminC, and FEVER.Although VitaminC manually created claims so that this overlap is lower, we observe that the real claims in WICE have competitively low claim-evidence overlap.Furthermore, examples corresponding to different entailment labels have similar word overlaps, especially at the subclaim level.This suggests that WICE does not suffer from spurious biases.
Analysis of Phenomenon In Section 2.5, we categorized entailment relationships between claims and evidence into several categories.We provide examples for each category in Table 13.Note that the FEVER example, a character is a person, represents a typical type of the inferences required in this dataset, which are simple hypernymy.Additionally, we also characterize distribution of semantic roles that form each claim.We use a taxonomy drawn from Davidsonian event semantics (Truswell, 2019) and semantic roles to characterize what each claim is about: this consists of events, properties (attributes describing participants in an event; these include name, occupation, nationality, quantity, ordinal), location, time, reason (why something happened), manner, and evidentials.Each claim may carry multiple labels, and subclaims consist of not necessarily disjoint subsets of these categories, e.g., "Jones buttered his toast slowly" (event, manner) and "Jones buttered his toast in the bathroom" (event, location).Table 14 shows this distribution.

D Claim-Split
This section provides further details and results of Claim-Split.

Sentence: <input claim>
Facts: For the experiments in Section 4 on VitaminC, PAWS, and FRANK, we use the following instruction with three or four examples to generate subclaims that are suitable for the off-the-shelf evaluation by entailment classification models: Please decompose the following sentence into decontextualized sentences while ensuring all information is retained and the wording is as unchanged as possible (please return the original sentence if it cannot be decomposed): ... Full prompts with few-shot examples are provided in our GitHub repository.Schuster et al. (2021).These models are also used in experiments in Laban et al. (2022).

F Effect of Training Data Size on Entailment Classification Performance
Figure 8 shows the performance of T5-3B finetuned on different amounts of training data from WICE.This result suggests that the performance of fine-tuning on WICE is saturating with this training procedure and models we used in this paper, or that much larger amounts of training data are needed to improve the performance.

G Oracle Retrieval Dataset
WICE is a dataset that is designed for evaluating entailment classification on long evidence articles, so it requires supporting sentence retrieval as a first step for language models that cannot evaluate very long inputs.To solely evaluate the entailment classification capability of language models, we make the oracle retrieval dataset, which is designed to simulate the situation in which we have an ideal retrieval model.This dataset provides a gold set of supporting sentences as input to NLI models.The oracle retrieval dataset is used in experiments in Section 3.3 and 3.4.The oracle retrieval dataset consists of oracle chunks that include all sentences in a gold set of supporting sentences. 28We note that there can be multiple oracle chunks for each claim/subclaim because different annotators can annotate different sets of supporting sentences, which include sufficient information to support (or partially support) the claim/subclaim.
To avoid biases caused by the number of supporting sentences (e.g.supported claims may have a larger number of supporting sentences than partially-supported claims), we add randomly selected sentences from the evidence article to the oracle chunks.Specifically, we add randomly selected sentences until the size of the chunks reaches 256 tokens (the chunks are equal to or shorter than 256 tokens).
For non-supported cases, which do not have any gold supporting sentences, we provide chunks as in the MAX setting in Section 3.1.The chunks in this setting also include at most 256 tokens in most cases. 29inally, to avoid biases caused by the number of oracle chunks, we randomly select three oracle chunks for each claim/subclaim.A straightforward way of performing entailment classification on the oracle retrieval dataset is to evaluate the entailment score for every oracle chunk for each claim/subclaim and take the maximum entailment score.Therefore, claims/subclaims with a large number of oracle chunks are likely to receive higher entailment scores.To avoid this bias, we make every claim/subclaim in the oracle retrieval dataset has the same number of oracle chunks.

H Entailment Classification by GPT
We provide details of the entailment classification experiment on GPT-3.5 and GPT-4 in Section 3.4.
Models We use gpt-3.5-turbo-0613and gpt-4-0613 for this experiment. 30  Dataset We use the first 100 claims/subclaims of the oracle retrieval dataset (Appendix G) for this experiment.
Prompts We provide the prompt used in this experiment below.We use XML as an output format as in (Das et al., 2023) to make post-processing easy.Our prompt includes three examples from the development set of WICE, but we omitted two examples and evidence sentences in the first example.Our GitHub repository includes the full prompt.
Your task is to evaluate if a claim is supported by a provided evidence article snippet.
You need to present your explanation first, and then choose your conclusion from the options [supported, partially_supported, not_supported].
We provide several examples.Your response must be in the same format as the XML in the examples.

I Datasheet for WICE
In this section, we provide a datasheet (Gebru et al., 2021) for the WICE dataset.

I.1 Motivation for Datasheet Creation
The information regarding the individuals or organizations who created or funded the dataset will be included in the camera-ready version.
For what purpose was the dataset created?There are some major challenges when applying modern entailment models to measure real-world attribution and factuality consistency.Specifically, existing natural language inference (NLI) models and datasets target relatively short claims and evidence, negative examples are often artificially created, and fine-grained labels have not been studied well.We create the WICE dataset to address these limitations in the existing NLI datasets.

I.2 Dataset Composition
What are the instances?Each instance in WICE is a group of subclaims derived from a claim (a sentence in a Wikipedia article) and evidence (cited websites).
Is there a label or target associated with each instance?The annotation for each subclaim includes the entailment label (supported, partiallysupported, or not-supported), supporting sentences (a subset of evidence sentences that support or partially support the subclaim), and non-supported tokens (tokens in the subclaim that are not supported by the evidence).
How many instances are there?WICE includes 1, 260, 349, and 358 claims in the train, development, and test data, which are decomposed into 3, 470, 949, and 958 subclaims.Detailed dataset statistics are provided in Table 1.

Does the dataset contain all possible instances
or is it a sample of instances from a larger set?The claims in WICE are a sub-set of sentences in Wikipedia articles.The sentences are randomly selected from those used in the SIDE dataset (Petroni et al., 2023).
Is the dataset self-contained?Yes, all resources are included in our release.

I.3 Data Collection Process
How was the data associated with each instance acquired?We acquired claims from Wikipedia Entailment label: SUPPORTED Supporting sentence indices: 9, 11 Entailment label: PARTIALLY-SUPPORTED Supporting sentence indices: 11 Not-Supported Tokens: privately, Groupe

Figure 2 :
Figure 2: Claim-Split automatically breaks claims into subclaims using a LLM (GPT-3.5 in this work).
It includes three examples, one correct split of claims into subclaims, one example where the subclaims fail the completeness criterion and one where they fail the correctness criterion.Given these instructions, each HIT asks annotators to verify 4 claims (and their corresponding subclaims from Claim-Split).For each tuple of context, claim, and decomposed subclaims, we ask the following questions: (1) Do all decomposed sentences correctly convey the information in the original sentence?[Yes/No].(2) If you selected No: Which decomposed texts include mistakes?[Free Text].(3) Decomposed sentences cover ALL information in the original sentence.[Yes/No].(4) If you selected No: What information is missing?[Free Text].We include (

Figure 6 :
Figure 6: Distribution of the number of supporting sentences for each claim and subclaim in the development set of WICE.The figures shows this distribution for PARTIALLY-SUPPORTED, SUPPORTED and the combined set.

Figure 7 :
Figure 7: Word overlap (recall of bigrams) between claims and evidence.

Figure 8 :
Figure 8: Relation between the size of train set and performance on the entailment classification task of WICE on T5-3B.

Table 1 :
Statistics of the WICE dataset.

Table 4 :
Off-the-shelf binary entailment classification performance of existing NLI models on WICE using the MAX strategy to combine local entailment scores.

Table 9 :
Comparison of AUROC scores for claim-level entailment classification task using the standard and "w/ Claim-Split" method.Table15includes results in F1 and accuracy.
* : improvement from the standard method is statistically significantly with p-value < 0.05 according to paired bootstrap test.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617-6632, Online and Punta Cana, Dominican Republic.Association for Computational Linguistics. In

Table 10 :
Characterization of Claim-Split errors We manually characterized the Claim-Split errors in 30 dev examples of WICE.These statistics are shown in Table10.Some of the mistakes are relatively simple and we expect that better prompts (few-shot examples) can fix them.For example, we observe that 30% of the mistakes are caused by removing the first or intermediate clauses.In the following example, "Before they established themselves in the upper Error analysis of Claim-Split on WICE.We annotated 30 mistakes in the development set.Each example can be assigned more than one category or left uncategorized (33.0%).

Table 11 :
Statistics of the filtering of the development set.The original size is 4, 545 and the final size is 739.Note that we did not annotate all these data and the number of claims in development set is 349.
And cutting the ribbon will be Brian Wilkins, 82, who has been using the beach since 1938 and been an enthusiastic supporter of the campaign to get it back open....

Table 13 :
Examples for categories of entailment classification problems in Table3.

Table 14 :
Estimated distribution of subclaims types in WICE.Each subclaim may have multiple properties.

Table 16 :
Publicly available models in Hugging Face Hub used in our experiments in Section 3. VitaminC models are provided by