PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition

The widely studied task of Natural Language Inference (NLI) requires a system to recognize whether one piece of text is textually entailed by another, i.e. whether the entirety of its meaning can be inferred from the other. In current NLI datasets and models, textual entailment relations are typically defined on the sentence- or paragraph-level. However, even a simple sentence often contains multiple propositions, i.e. distinct units of meaning conveyed by the sentence. As these propositions can carry different truth values in the context of a given premise, we argue for the need to recognize the textual entailment relation of each proposition in a sentence individually. We propose PropSegmEnt, a corpus of over 45K propositions annotated by expert human raters. Our dataset structure resembles the tasks of (1) segmenting sentences within a document to the set of propositions, and (2) classifying the entailment relation of each proposition with respect to a different yet topically-aligned document, i.e. documents describing the same event or entity. We establish strong baselines for the segmentation and entailment tasks. Through case studies on summary hallucination detection and document-level NLI, we demonstrate that our conceptual framework is potentially useful for understanding and explaining the compositionality of NLI labels.


Introduction
Natural Language Inference (NLI), or Recognizing Textual Entailment (RTE), is the task of determining whether the meaning of one text expression can be inferred from another (Dagan and Glickman, 2004).Given two pieces of text (P, H), we say the premise P entails the hypothesis H if the entirety of H's meaning can be most likely inferred true after a human reads P .If some units of meaning in H are contradicted by, or cannot be determined Table 1: An example instance from the PROPSEG-MENT dataset with propositions (marked as token subsets highlighted in blue) and their entailment labels.
from P , we describe the relation between the two as contradiction or neutral (de Marneffe et al., 2008) respectively.This fundamentally challenging natural language understanding task provides a general interface for semantic inference and comparison across different sources of textual information.
In reality, most naturally occurring text expressions are composed of a variable number of propositions, i.e. distinct units of meaning conveyed by the piece of text.Consider the sentence shown in Table 1: "The Andy Warhol Museum in his hometown, Pittsburgh, Pennsylvania, contains an extensive permanent collection of art."Despite the sentence being relatively compact, it still contains (at least) three propositions, as listed in Table 1.While the entire hypothesis would be classified as neutral or not-entailed to the premise, one of its proposi-tions "Andy Warhol's hometown is in Pittsburgh, Pennsylvania" is in fact entailed by the premise, while the premise provides no support for the other two propositions.This phenomenon, namely partial entailment (Levy et al., 2013), is a blind spot for existing sentence-or paragraph-level NLI formulations.When a hypothesis is compositional, NLI labels coarsely defined on the sentence/paragraphlevel cannot express the difference between partial entailment from the non-entailment cases.
This work argues for the need to study and model textual entailment relations on the level of propositions.As NLI tasks and applications typically involve different genre of text with variable length and number of propositions (Yin et al., 2021), decomposing textual entailment relation to the propositional level provides a more fine-grained yet accurate description of textual entailment relation between two arbitrary text expressions.Modeling propositional textual entailment provides a more unified inference format across NLI tasks, and would potentially improve the generalization capabilities of NLI models, e.g. with respect to the variability in input lengths (Schuster et al., 2022).
We propose PROPSEGMENT, a multi-domain corpus with over 45K human-annotated propositions. 1We define the tasks of proposition-level segmentation and entailment.Given a hypothesis sentence and a premise document, a system is expected to segment the hypothesis into the set of propositions, and recognize whether each proposition can be inferred from the premise.
Interestingly, we observe that existing notions of proposition adopted by Open Information Extraction (OpenIE) or Semantic Role Labeling (SRL) (Baker et al., 1998;Kingsbury and Palmer, 2002;Meyers et al., 2004) often fail to account for the complete set of propositions in a sentence, partly due to the fact that predicates and arguments in different propositions do not necessarily follow the same granularity ( §2).We therefore adopt a more flexible and unified way of representing a proposition as a subset of tokens from the input sentence, without explicitly annotating the semantic role or predicate-argument structure within the proposition, as illustrated in Table 1.We discuss the motivation and design desiderata in §2.
We construct PROPSEGMENT by sampling clusters of topically-aligned documents, i.e. docu-ments focusing on the same entity or event, from WIKIPEDIA (Schuster et al., 2022) and the news domains (Gu et al., 2020).We train and instruct expert annotators to identify all propositions exhaustively in a document, and label the textual entailment relation of each proposition with respect to another document in the cluster, viewed as the premise.
We discuss the modeling challenges, and establish strong baselines for the segmentation and entailment tasks.We demonstrate the utility of our dataset and models through downstream use case studies on summary hallucination detection (Maynez et al., 2020), andDocNLI (Yin et al., 2021), through which we show that recognizing and decomposing entailment relations at the proposition-level could provide fine-grained characterization and explanation for NLI-like tasks, especially with long and compositional hypotheses.
In summary, the main contributions in our paper include: (1) Motivating the need to recognize textual entailment relation on proposition level; (2) Introducing the first large-scale dataset for studying proposition-level segmentation and entailment recognition; and (3) Leveraging PROPSEGMENT to train Seq2Seq models as strong baselines for the tasks, and demonstrating their utility in documentlevel NLI and hallucination detection tasks.

Motivations & Design Challenges
Our study concerns the challenges of applying NLI/RTE task formulations and systems in realworld downstream applications and settings.As textual entailment describes the relation between the meanings of two text expressions, one natural type of downstream use cases for NLI systems is to identify alignments and discrepancies between the semantic content presented in different documents/sources (Kryscinski et al., 2020;Schuster et al., 2021;Chen et al., 2022).
Our study is motivated by the task of comparing the content of topically-related documents, e.g.news documents covering the same event (Gu et al., 2020), or Wikipedia pages from different languages for similar entities (Schuster et al., 2022).As existing NLI datasets typically define the textual entailment relations at the sentence or paragraph level (Bowman et al., 2015;Williams et al., 2018), NLI systems trained on such resources can only recognize whether or not the entirety of a hypothesis sentence/paragraph is entailed by a premise.However, we estimate that, in these two domains, around 2 8875  90% of the sentences that convey any informational propositions contain more than one proposition (Figure 1).In the presence of multiple propositions, partial entailment (Levy et al., 2013) describes the phenomenon where only a subset of propositions in the hypothesis is entailed by the premise.
Partial entailment is 3× more common than full-sentence entailment.In our corpus, we observe that, given two topically related documents from news or Wikipedia, 46% of sentences in one document have at least some information supported by the other document (Figure 2).But 74% of such sentences are partially entailed, with only some propositions supported by the other document.In this sense, a sentence-level NLI model can only detect a quarter of sentences that have meaningful entailment relations.In applications that seek a full understanding of cross-document semantic links, there is thus 4× headroom, a significant blind spot for sentence-level NLI models.
As we observe that most natural sentences are compositional, i.e. contain more than one proposition, we argue for the need to decompose and recognize textual entailment relation at the more granular level of propositions.In other words, instead of assessing the entire hypothesis as one unit in the context of a premise, we propose to evaluate the truth value of each proposition individually, and aggregate for the truth value of the hypothesis.
Current predicate-argument based methods often fail to extract all propositions in a sentence.The linguistic notion of a proposition refers to a single, contextualized unit of meaning conveyed in a sentence.In the NLP community, propositions are usually represented by the predicate-argument structure of a sentence.For example, resources like FrameNet (Baker et al., 1998), PropBank (Palmer et al., 2005), NomBank (Meyers et al., 2004), among others, represent a proposition by a predicate (verbal, nominal, etc.), with arguments filling its thematic proto-roles.Such resources facilitate the development of SRL systems (Palmer et al., 2010) for proposition extraction, with a closed, predefined set of proto-roles.To increase the coverage of propositions extracted, OpenIE formulations (Etzioni et al., 2008;Del Corro and Gemulla, 2013;Cui et al., 2018) were proposed to forgo the limits on fixed semantic roles and account for both explicit and implicit predicates.However, we observe that OpenIE systems often fail to account for the complete set of propositions in a sentence.In many cases, e.g. the Andy Warhol's hometown example in Table 1, arguments of a proposition might not follow the same granularity as the ones in the sentence, e.g.Andy Warhol vs Andy Warhol Museum.Also, as OpenIE triples are still defined on direct predicate-argument relations, they often fail to produce a decontextualized (Choi et al., 2021) view of a proposition.For example, an OpenIE system would recognize the possessive relation "he has a hometown", but fail to resolve the references of he → Andy Warhol, and hometown → Pittsburgh.
Furthermore, Gashteovski et al. (2020) and Fatahi Bayat et al. (2022) observe that neural Ope-nIE systems tend to extract long arguments that could potentially be decomposed into more compact propositions.For textual entailment, we argue for the need to extract the complete set of propositions in their most compact form, due to the fact that their truth value could vary individually.
To illustrate the difference between OpenIE and our approach, we offer a list of example propositions from our proposed PROPSEGMENT dataset, and compared them to extractions from rule-based and neural OpenIE systems, in Appendix D.

PROPSEGMENT Dataset
We propose PROPSEGMENT, a large-scale dataset featuring clusters of topically similar news and 3 8876 Wikipedia documents, with human annotated propositions and entailment labels.

Task Definitions
We formulate the task of recognizing propositional textual entailment into two sub-tasks (Fig. 3).Given a hypothesis sentence and a premise document, a system is expected to (1) identify all the propositions within the hypothesis sentence, and (2) classify the textual entailment relation of each proposition with respect to the premise document.
T 1 : Propositional Segmentation Given a sentence S with tokens [t 0 , t 1 , ..., t l ] from a document D, a system is expected to identify the set of propositions P ⊆ 2 S , where each proposition p ∈ P is represented by a unique subset of tokens in sentence S. In other words, each proposition can be represented in sequence labeling format, per the example from Table 1.Each proposition is expected (1) to correspond to a distinct fact that a reader learns directly from reading the given sentence, (2) include all tokens within the sentence that are relevant to learning this fact, and (3) to not be equivalent to a conjunction of other propositions.We opt for this format as it does not require explicit annotation of the predicate-argument structure.This allows for more expressive power for propositions with implied or implicit predicates (Stern and Dagan, 2014).Also, representing each proposition as a separate sequence could effectively account for cases with shared predicate or arguments spans, and make evaluation more readily accessible.Since the propositions, as we demonstrated earlier, do not necessarily have a unique and identifiable predicate word in the sentence, the typical inference strategy, e.g. in SRL or OpenIE, which first extracts the set of predicates, and then identifies the arguments with respect to each predicate would not work in this case.For this reason, given an input sentence, we expect a model on the task to directly output all propositions.In such one-to-set prediction setting, the output propositions of the model are evaluated as an unordered set.
T 2 : Propositional Entailment Given a hypothesis proposition p from document D hyp and a whole premise document D prem , a system is expected to classify whether the premise entails the proposition, i.e. if the information conveyed by the proposition would be inferred true from the premise.

Dataset Construction
We sample 250 document clusters from both the Wiki Clusters (Schuster et al., 2022) and New-SHead (Gu et al., 2020) datasets.Each cluster contains the first 10 sentences of three documents, either news articles on the same event, or Wikipedia pages in different languages (machine-translated into English) of the same entity.For each sentence, we train and instruct three human raters to annotate the set of propositions, each of which represented by a unique subset of tokens from the sentence.Conceptually, we instruct raters to include all the words that (1) pertain to the content of a proposition, and (2) are explicitly present in the sentence.For example, if there does not exist a predicate word for a proposition in the sentence, then only include the corresponding arguments.Referents present within the sentence are included in addition to pronominal and nominal references.We provide a more detailed description of our rater guidelines and how propositions are defined with respect to various linguistic phenomena in Appendix B.
Given the three sets of propositions from the three raters for a sentence, we reconcile and select one of the three raters' responses with the highest number of propositions that the other raters also annotate.Since the exact selection of tokens used to mark a proposition may vary across different raters, we allow for fuzziness when measuring the match between two propositions.Following FitzGerald et al. ( 2018) and Roit et al. (2020) sets of selected tokens, to measure the similarity between two propositions.We say two propositions match if their Jaccard similarity is greater or equal to a threshold θ = 0.8, and align two raters' responses using unweighted bipartite matching between propositions satisfying the Jaccard threshold.
Next, for all propositions in a document, we sample one other document from the document cluster as premise, and ask three raters to label the textual entailment relation between each proposition and the premise, i.e. one of {Entailment, Neutral, Contradiction}.We take the majority vote from the three as the gold entailment label.Interestingly, we observe that only 0.2% of all annotated labels from the rater are "contradictions".We speculate that the low presence of contraditions can in part be attributed to the difficulty in establishing reference determinacy (Bowman et al., 2015) between the premise and hypothesis.We discuss more details in Appendix C. For this reason, we choose to only consider two-way label ({Entailment, Non-Entailment}) for the entailment task evaluation.
We create the train/dev/test splits based on clusters, so that documents in each cluster exclusively belong to only one of the splits.Overall, the dataset features 1497 documents with ∼45K propositions with entailment labels; More statistics in Table 2.

Inter-Rater Agreement
For the propositional segmentation task (T 1 ), as the inter-rater agreement involves set-to-set comparison between the propositions annotated by a pair of raters, we report two different metrics.
First, between each pair of raters, we use the same Jaccard similarity with θ = 0.8 and find the matched set of propositions between the raters with bipartite matching for each example.We measure the coverage of the matched set by either rater with F 1 score.We observe 0.57 F 1 among all raters.As comparison, we use the same metric for model evaluation and human performance estimation, as we will discuss in § 5.1.In addition, we mea-sure the token-level agreement on the matched set of propositions among raters with Fleiss' kappa (Fleiss, 1971), i.e. whether raters agree on whether each token should be included in a proposition or not.We observed κ = 0.63, which indicates moderate to substantial agreement among raters.

Propositional Segmentation Baselines
The key challenge with the proposition extraction task lies within the one-to-set structured prediction setting.Our one-to-set prediction format is similar to QA-driven semantic parsing such as QA-SRL (He et al., 2015;Klein et al., 2022), as both involve generating a variable number of units of semantic content under no particular order between them.As in propositions, there might not necessarily be a unique and identifiable predicate word associated with each proposition, extracting predicates first (e.g. as a sequence tagging task), and later individually produce one proposition for each predicate would not be a sufficient solution in this case.
For this particular one-to-set problem setup, We introduce two classes of baseline models.
Seq2Seq: T5 (Raffel et al., 2020) When formatting a output set as a sequence, Seq2Seq models have been found to be a strong method for tasks with set outputs, as they employ chain-rules to efficiently model the joint probability of outputs (Vinyals et al., 2016).The obvious caveat for representing set outputs as sequences is that we need an ordering for the outputs.Having a consistent ordering helps seq2seq model learn to maintain the output set structure (Vinyals et al., 2016), and the best ordering scheme is often both model-and taskspecific (Klein et al., 2022).In our experiments, we observe that sorting the propositions by the appearance order of the tokens in the sentence, i.e.  positions of the foremost tokens of each proposition in the sentence, yields the best performance.
We start from the pretrained T5 1.1 checkpoints from the T5x library (Roberts et al., 2022).Given a sentence input, we finetune the T5 model to output the propositions in a single sequence.For each input sentence, we sort the output propositions using the aforementioned ordering scheme, and join them by a special token In addition, we evaluate the setting where the model is also given the premise document D prem , and learns to output the entailment label along with each proposition (T5 w/ Entail. in Table 3).
Encoder+Tagger: BERT (Devlin et al., 2019) For comparison, we provide a simpler baseline that does not model joint probability of the output propositions.On top of the last layer an encoder model, i.e.BERT, we add k linear layers that each correspond to one output proposition.Given an input sentence, the i th linear layer produces a binary (0/1) label per token, indicating whether the token is in the i th proposition or not.k is set to be a sufficiently large number, e.g.k = 20 in our experiments.We use the label of the [CLS] token of the i th linear layer to indicate whether the i th proposition should exist in the output.For such, we follow the same ordering of the output propositions as in the seq2seq (T5) baseline setup.

Propositional Entailment Baselines
We formulate the task as a sequence labeling problem, and finetune T5 model as our baseline.The inputs consist of the hypothesis proposition p with its document context D hyp , plus the premise document D prem .The output is one of the three-way labels {Entailment, Neutral, Contradiction}.Due to low presence of contradictions, we merge the neutral and contradiction outputs from the model as non-entailments during evaluation.To ensure that the model has access to the essential context information, our task input also include the document D hyp of the hypothesis proposition p, so that model has a decontextualized view of p when inferring its textual entailment relation with D prem .

Evaluation Metrics
Propositional Segmentation We measure the precision and recall between the set of predicted and gold propositions for a given sentence.As the set of gold propositions do not follow any particular ordering, we first produce a bipartite matching between them using the Hungarian algorithm (Kuhn, 1955).We again use the Jaccard similarity over θ = 0.8 as a fuzzy match between two propositions ( § 3.2).We also use exact match, an even more restrictive measure where two propositions match if and only if they have the exact same tokens.We report the macro-averaged precision and recall over sentences in the test set.
Propositional Entailment We report the baseline performance under two-way classification re-6 8879 sults in accuracy.We also report the balanced accuracy, i.e. average of true positive and true negative rate, due to label imbalance (Table 2).To understand the per-label performance, we also report the F 1 score w.r.t. each of the three-way label.

Baseline Results
Table 3 shows the evaluation results for the segmentation (T 1 ) and entailment task (T 2 ) respectively.
For the segmentation task (T 1 ), the seq2seq T5 model setup yields superior performance compared to the simpler encoder+tagger BERT setup.As the encoder+tagger setup predicts each proposition individually, and does not attend on other propositions during inference, we observe that the model predicts repeated/redundant propositions in > 20% of the input sentences.In the seq2seq T5 setup, the repetition rate is < 1%.For both setups, we remove the redundant outputs as a post processing step.We also evaluate the multi-task setup (i.e.T5 w/ Entail. in Table 3) where the model jointly learns the entailment label with each proposition, and observe no significant improvements.For the entailment task (T 2 ), we see that T5-Large yields the best overall performance.We observe that the performance with respect to the entailment label is lower compared to the neutral label.
For both tasks, we estimate the averaged human expert performance by comparing annotations from three of the authors to ground truth on 50 randomly sampled examples from the dataset.We observe that for the segmentation task T 1 , we observe that the human performance increases after reconciling and selecting the ground truth response (0.57 → 0.67 F 1 ).We see that there remains a sizable gap between the best model, T5-Large, and human performance.On the entailment task T 2 , T5-Large exceeds human performance, which is not uncommon among language inference tasks of similar classification formats (Wang et al., 2019).

Document:
The incident happened near Dr Gray's Hospital shortly after 10:00.The man was taken to the hospital with what police said were serious but not life-threatening injuries.The A96 was closed in the area for several hours, but it has since reopened.

Summary w/ human labeled hallucinated spans:
A man has been taken to hospital following a one-vehicle crash on the A96 in Aberdeenshire.
Predicted propositions (blue) and entailment labels #1: A man has been taken to hospital following a onevehicle crash on the A96 in Aberdeenshire.#2: A man has been taken to hospital following a onevehicle crash on the A96 in Aberdeenshire.#3: A man has been taken to hospital following a onevehicle crash on the A96 in Aberdeenshire.#4: A man has been taken to hospital following a onevehicle crash on the A96 in Aberdeenshire.
Predicted hallucinated spans (union ofunion of ) A man has been taken to hospital following a one-vehicle crash on the A96 in Aberdeenshire.Table 5: An example model generated summary on the XSum dataset, with human-annotated hallucination spans from Maynez et al. (2020).We show that we can infer the hallucinated spans from the set of four propositions and their entailment labels (entail=, not-entail=), predicted by our T5-Large models.More examples can be found in Appendix E

Cross-Domain Generalization
On the propositional segmentation (T 1 ) task, we evaluate the how the best baseline model generalizes across the Wikipedia (Wiki) and News domains.Table 4 shows the results of T5-Large models finetuned on data from each domain, and evaluated on the test split of both domains.
When applying a model trained on Wiki, we see a larger drop in performance when tested on News, as the News domain features more syntactic and stylistic variations compared to the Wiki domain.

Analysis and Discussion
We exemplify the utilities of our propositional segmentation and entailment framework, which we refer to as PropNLI, through the lens of two downstream use cases, e.g.summary hallucination detection ( § 6.1), and document-level NLI w/ variablelength hypotheses ( § 6.2).

Application: Hallucination Detection
We first look at the task of summary hallucination detection, i.e. given a summary of a source document, identify whether the summary's content is faithful to the document.have been shown effective on the task (Kryscinski et al., 2020;Chen et al., 2021).As summaries can be long and compositional, recognizing partial entailment, and identifying which part(s) of a summary is hallucinated becomes important (Goyal and Durrett, 2020;Laban et al., 2022).
To show that PropNLI can be used for hallucination detection, we experiment on the model generated summaries on the XSum dataset (Narayan et al., 2018), where Maynez et al. (2020) provide human annotations of the sets of hallucinated spans (if they exist) in the summaries.Table 5 illustrates our idea.If a proposition in a summary is entailed by the document, then all spans covered by the proposition are faithful.Otherwise, some spans would likely contain hallucinated information.
Following such intuitions, we first evaluate the performance of our method in zero-shot settings as a hallucination classifier , i.e. binary classification for whether a summary is hallucinated or not.For baseline comparison, we use a T5-large model finetuned on MNLI (Williams et al., 2018) to classify a full summary as entailed (→ faithful) or not (→ hallucinated).As ˜89% of the summaries annotated by Maynez et al. (2020) are hallucinated, we again adopt balanced accuracy ( § 5.1) as the metric.
Next, we study whether the entailment labels of propositions can be composed to detect hallucinated spans in a summary.As in Table 5, we take the union of the spans in non-entailed propositions, and exclude the spans that has appeared in entailed propositions.The intuition is that the hallucinated information likely only exists in the non-entailed propositions , but not the entailed ones.
We evaluate hallucinated span detection as a token classification task.For each summary, we evaluate the precision and recall of the faithful and hallucinated set of predicted tokens respectively against the human-labeled ground truth set.We report the macro-averaged precision, recall and F 1 score over all 2,500 summaries.We compare our method to a T5-Large model finetuned on MNLI, where we label all tokens as faithful if the summary is predicted to be entailed, and all tokens as hallucinated otherwise.We report the performance with respect to each of the two labels in Table 6.As the MNLI model don't distinguish partial entailment from non-entailment cases, it predicts more tokens to be hallucinated, and thus having low precision and high recall on the hallucinated tokens, and vice versa.On the other hand, we observe our model can be used to detect the nuance between faithful and hallucinated tokens with good and more balanced performance for both cases.Table 5 shows one example summary and PropNLI's predictions, and we include more examples in Appendix E.

Proposition-Level →
Sentence/Paragraph-Level Entailment We would like to see whether proposition-level entailment labels can potentially be composed to explain sentence/paragraph-level NLI predictions.Given a hypothesis sentence/paragraph and a premise, our PropNLI framework takes three steps.First we segment the hypothesis into propositions.For each proposition, we infer its entailment relation with the premise.In cases where multiple propositions exist in the hypothesis, the proposition-level entailment labels can be aggregated to obtain the entailment label for the entire hypothesis, similar to ideas presented in Stacey et al. (2022).As a starting point, we assume logical conjunction as the aggregation function, and hypothesize that this will offer a more fine-grained and explainable way of conducting NLI inference.
To demonstrate the utility of the idea, we conduct a case study on DocNLI (Yin et al., 2021), which features premise and hypothesis of differ-8 ent length, and so varying number and compositions of propositions.We take the baseline T5-Large segmentation and entailment models respectively, and use logical conjunction to aggregate the proposition-level entailment prediction.We compare PropNLI in a zero-shot setting against the T5-Large MNLI model.The MNLI model takes the entire hypothesis and premise and input without any segmentation or decomposition.
The results are shown in Figure 4. We take the development set of DocNLI and split examples into buckets according to number of tokens in the hypothesis.We examine the zero-shot performance of the PropNLI setup versus the finetuned MNLI model.We observe that with shorter hypotheses (< 100 tokens), the two setups demonstrated similar performance, as the hypothesis length is similar to the distribution of MNLI training set (avg.21.73 tokens ±30.70).As the length of the hypothesis increases, the performance of MNLI model starts to drop, while PropNLI's performance remains relatively stable.Such observations suggest the potential of using the PropNLI framework to describe the textual entailment relations between a pair of premise and hypothesis in a more precise and finegrained manner.In the realistic case where input hypotheses are compositional, the PROPSEGMENT present an opportunity for developing more generalizable NLI models and solutions.

Conclusion
In this paper, we presented PROPSEGMENT, the first large-scale dataset for studying propositionlevel segmentation and entailment.We demonstrate that segmenting a text expression into propositions, i.e. atomic units of meanings, and assessing their truth values would provide a finer-grained characterization of the textual entailment relation between two pieces of text.Beyond NLI/RTE tasks, we hypothesize that proposition-level segmentation might be helpful in similar ways for other text classification tasks as well.We hope that PROPSEG-MENT will serve as a starting point, and pave a path for research forward along the line.

Limitations
Since the PROPSEGMENT dataset feature entailment labels for all propositions in a document, the label distribution are naturally imbalanced, which would potentially pose challenge for modeling.We observe low presence of contradiction examples in our dataset construction process, which could be a limiting factor for the utility of the dataset.Unlike previous NLI datasets (Bowman et al., 2015;Williams et al., 2018), we speculate that reference determinacy, i.e. whether the hypothesis and premise refer to the same scenario at the same time, cannot be certainly guaranteed and safely assumed in our case, which in part leads to low presence of contradictions during annotation.We offer a detailed discussion on the implications of reference determinacy and contradictions in Appendix C. We leave the exploration on natural contradictions for future work.
As the annotation complexity and cost scales quadratically w.r.t. the number of propositions in a document, we truncate the documents in PROPSEG-MENT to the first ten sentences of the original document.

Ethical Considerations
In the proposition-level entailment task (T 2 ), the inference of the entailment relation between a premise document and a hypothesis proposition uses the assumption that the premise document is true.The assumption is common to NLI datasets (Dagan et al., 2005;Bowman et al., 2015;Williams et al., 2018), and is necessary for the task's structure.With the documents in PROPSEGMENT, we make the assumption only for the experimental purpose of T 2 , and make no claim about the actual veracity of the premise documents.
for natural language understanding.In 7th International Conference on Learning Representations, ICLR 2019.
2018.A broad-coverage challenge corpus for sentence understanding through inference.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112-1122, New Orleans, Louisiana.Association for Computational Linguistics.
A Model Implementation T5 We use T5 1.1 checkpoints from the T5x library (Roberts et al., 2022), with Flaxformer2 implementation.For all sizes of T5 model and all tasks, we finetune the model for three epoch, with 1e − 3 learning rate, 0.1 dropout rate, batch size of 128.We train the models on 16 TPU v3 slices.
BERT We use the BERT English uncased models from Tensorflow (Abadi et al., 2016), in large (24 layers, 16 attention heads, 1024 max sequence length) and base (12 layers, 12 attention heads, 768 max sequence length) sizes.For both sizes, we finetune the model for five epoch, with 1e − 5 learning rate, 0.1 dropout rate, batch size of 16.We train the models on 8 TPU v3 slices.

B.1 Segmentation annotation guidelines
There is no unequivocally unique definition for precisely how to segment an English sentence in the context of a document into propositions defined as token subsets, due to a variety of complex language phenomena.Our raters were instructed to follow the following overall guidelines for the segmentation task: 1.Each proposition is expected to correspond to a distinct fact that a reader learns directly from reading the given sentence.
(a) The raters are instructed to focus on the text's most literal denotation, rather than drawing further inferences from the text based on world knowledge, external knowledge, or common sense.(b) The raters are instructed to consider factivity, marking only propositions that, in their judgement, the author intends the reader to take as factual from reading the sentence.(c) With regard to quotes, raters are asked to estimate the author's intent, including the proposition quoted when the reader is expected to take it as factual, and/or the proposition of the quote itself having been uttered if the reader is expected to learn that a speaker uttered that quote.(d) The raters are instructed to omit text that are clearly non-factual, such as rhetorical flourishes or first-person account of an article author's emotional response to the topic.This rule is specific to the news and Wikipedia domains, since in other domains of prose, first-person emotions may well be part of the intended informational payload.
2. Each proposition should include all tokens within the sentence that are relevant to learning this fact.
(a) Specifically, the raters are asked to include any tokens in the same sentence that are antecedents of pronouns or other endophora in the proposition, or relevant bridging references.(b) Raters are asked to ignore punctuation, spacing, and word inflections when selecting tokens, though a number of other minutiae, such as whether to include articles, are left unspecified in the rater instructions.
3. Choose the simplest possible propositions, so that no proposition is equivalent to a conjunction of the other propositions, and so that the union of all of the sentence's proposition gives us all the information a reader learns from the sentence.
The raters are also asked to omit propositions from any text that doesn't constitute well-formed sentences, typically arising from parsing errors or from colloquialisms.
Note that the resulting subsets of tokens do not, generally, constitute well-formed English sentences when concatenated directly, but can, in our ad hoc trials, easily be reconstituted into stand-alone sentences by a human reader.

B.2 Entailment annotation guidelines
For the propositional entailment task, our instructions are somewhat similar to the RTE task (Dagan and Glickman, 2004), but specialized to the proposition level.
The raters are asked to read the premise document and decide whether a specific hypothesis proposition is entailed by it, contradicted, or neither.In the first two cases, the raters are asked to mark a proposition in the premise document that most closely supports the hypothesis proposition, using the same definition of proposition as above.
The interface nudges the raters to select one of the propositions marked by the segmentation rater, but allows the entailment rater to create a new proposition as well.Note that the choice of a specific supporting proposition is sometimes not well defined.
To judge entailment, the raters are asked "from reading just the premise document, do we learn that the hypothesis proposition is true, learn that it's false, or neither?"More specifically, the raters are asked: 1.To consider the full document of the hypothesis as the context of the hypothesis proposition, and the full premise document.
2. To allow straightforward entailment based on "common sense or widely-held world knowledge", but otherwise avoid entailment labels whenever "significant analysis" (any complex reasoning, specialized knowledge, or subjective judgement) is required to align the two texts.
3. To assume that the two documents were written in the same coarse spatiotemporal context -same geographical area, and the same week.
Raters have the option of marking that they don't understand the premise and/or the hypothesis and skipping the question.

C Reference Determinacy and Contradictions
The PROPSEGMENT dataset is constructed in document-to-document comparison settings.Even though the document clusters are sampled so that documents in a cluster target the same event or event, the documents typically have different focus.Besides the factual information, which are mostly consistent across documents, the focus or specific perspective of each document varies largely, which is in part why we observe very few contradictions.Apart from such, We speculate that the low presence of contradictions can also be in part attributed to the difficulty in establishing reference determinacy, i.e. whether the entities and events described in a hypothesis can be assumed to refer to the same ones or happening at the same point in the premise.To illustrate the importance of this, consider the following example from SNLI (Bowman et al., 2015).
Premise: A black race car starts up in front of a crowd of people.13 Hypothesis: A man is driving down a lonely road.
In SNLI, reference determinacy is assumed to be true.In other words, the human raters assume that the scenario described in the premise and hypothesis happens in the same context at the same time point.Therefore, the example pair is labeled as contradiction, as "lonely road" contradicts "a crowd of people" if we assume both happen on the same road.Without such assumption, the example would likely be labeled as neutral, since there is no extra context that would indicate the two events happen in the same context.
In reality, reference determinacy is often difficult to establish with certainty.Unlike existing NLI/RTE datasets (Dagan et al., 2005;Bowman et al., 2015;Williams et al., 2018), in the creation process of PROPSEGMENT, we do not assume reference determinacy between the hypothesis proposition and premise document, but rather relay the judgement to human raters by reading context information presented in the documents.We observe that it is often hard to tell if a specific proposition within a document can establish reference determinacy with the other document, unless the proposition describes a property that is stationary with respect to time.For this reason, most contradictions, among the few that exist in our dataset, are factual statements.Here is an example from the development split.
Premise: ... The team was founded in 1946 as a founding member of the All-America Football Conference (AAFC) and joined the NFL in 1949 when the leagues merged.. Hypothesis: The 49ers have been members of the NFL since the AAFC and National Football League (NFL) merged in 1950...We view the lack of contradictions as a potential limitation for the dataset for practical purposes.We argue for the need to circumscribe the exact definition of contradiction (from the practical perspective) when reference determinacy cannot be simply assumed.We leave this part for future work.

D Example Propositions From OpenIE
vs. PROPSEGMENT   Sentence: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.PROPSEGMENT #1: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.#2: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.#3: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.#4: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.#5: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.#6: The Cleveland Cavaliers got the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.ClausIE #1: (The Cleveland Cavaliers, got, the first choice in the lottery) #2: (the lottery, was used, on 20-year-old forward Anthony Bennett) #3: (Anthony Bennett, is, a freshman from the University of Nevada) Neural Bi-LSTM OIE #1: (The Cleveland Cavaliers, got, the first choice in the lottery, which was used on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.)#2: (the lottery, was used, on 20-year-old forward Anthony Bennett, a freshman from the University of Nevada.)16

Figure 1 :
Figure 1: Distribution of proposition counts in sentences with at least one informational propositions from Wikipedia and news in PROPSEGMENT.

Figure 2 :
Figure 2: The percentage of sentences with partial entailment relation to another topically-related document from Wikipedia or news in PROPSEGMENT.Typically, NLI/RTE datasets do not distinguish partial entailment from the non-entailment categories.
[TARGET].The spans of tokens included in each proposition is surrounded by special tokens [M] and [/M].For instance, "[M]Alice[/M] and Bob [M]went to the Zoo[/M].[TARGET] Alice and [M]Bob went to the Zoo.[/M] ".

Figure 4 :
Figure 4: Zero-shot performance of T5-large MNLI model compared to our PropNLI T5-large models (i.e.segmentation '→ entailment → aggregation) with respect to varying hypothesis length in DocNLI dev set.The shaded region shows 95% confidence interval.

Table 2 :
, we use Jaccard similarity, i.e. intersection over union of the two Notable Statistics from the PROPSEGMENT dataset.

Table 3 :
Performance of the baseline models on the full (WIKI + NEWS) test set.Due to the low presence of contradictions (32/8643 = 0.4% of test), F 1 for contradiction does not reflect statistically significant improvement.

Table 7 :
Stanovsky et al. (2018, 2013) PROPSEGMENT with extractions with ClausIE(Del Corro and Gemulla, 2013), and the neural Bi-LSTM OIE model fromStanovsky et al. (2018).15 Sentence: The Andy Warhol Museum in his hometown, Pittsburgh, Pennsylvania, contains an extensive permanent collection of art.PROPSEGMENT #1: The Andy Warhol Museum in his hometown, Pittsburgh, Pennsylvania, contains an extensive permanent collection of art.#2: The Andy Warhol Museum in his hometown, Pittsburgh, Pennsylvania, contains an extensive permanent collection of art.#3: The Andy Warhol Museum in his hometown, Pittsburgh, Pennsylvania, contains an extensive permanent collection of art.ClausIE #1: (his, has, hometown) #2: (his hometown, is, Pittsburgh Pennsylvania) #3: (The Andy Warhol Museum in his hometown, contains, an extensive permanent collection of art) Neural Bi-LSTM OIE #1: (The Andy Warhol Museum in his hometown Pittsburgh Pennsylvania, contains, an extensive permanent collection of art)