R2F: A General Retrieval, Reading and Fusion Framework for Document-level Natural Language Inference

Document-level natural language inference (DOCNLI) is a new challenging task in natural language processing, aiming at judging the entailment relationship between a pair of hypothesis and premise documents. Current datasets and baselines largely follow sentence-level settings, but fail to address the issues raised by longer documents. In this paper, we establish a general solution, named Retrieval, Reading and Fusion (R2F) framework, and a new setting, by analyzing the main challenges of DOCNLI: interpretability, long-range dependency, and cross-sentence inference. The basic idea of the framework is to simplify document-level task into a set of sentence-level tasks, and improve both performance and interpretability with the power of evidence. For each hypothesis sentence, the framework retrieves evidence sentences from the premise, and reads to estimate its credibility. Then the sentence-level results are fused to judge the relationship between the documents. For the setting, we contribute complementary evidence and entailment label annotation on hypothesis sentences, for interpretability study. Our experimental results show that R2F framework can obtain state-of-the-art performance and is robust for diverse evidence retrieval methods. Moreover, it can give more interpretable prediction results. Our model and code are released at https://github.com/phoenixsecularbird/R2F.


Introduction
Natural Language Inference (NLI) is the task of determining whether a hypothesis is entailed or not in a premise.While earlier works (Bowman et al., 2015;Williams et al., 2018;Wang et al., 2019;Nie et al., 2020) assume that both hypothesis and premise are single sentences, recent research pays more attention on document-level task, namely Document-level NLI (DOCNLI) (Yin et al., 2021).The task can enlarge the task scope to judge the * Corresponding Author.variability of semantic expression for many Natural Language Processing (NLP) tasks, e.g., exposure bias (Bengio et al., 2015;Arora et al., 2022) alleviation for text summarization (Sandhaus, 2008;Narayan et al., 2018), and human-manipulated news articles recognition for automatic fake news detection (Jawahar et al., 2022;Huang et al., 2022).
Compared with sentence-level NLI, DOCNLI poses many new challenges, while there are only a few datasets and models.In terms of datasets, Yin et al. (2021) reformat some mainstream NLP tasks, e.g., text summarization and question answering, and build the first large scale dataset DOCNLI with over 1 million document pairs1 .However, the dataset does not provide evidence annotation about how the labels are inferred, i.e., which hypothesis sentences lead to semantic inconsistency, or which premise sentences help to decide the entailment relationship.As shown in Figure 1, although the sample is annotated as not entailment, most of the hypothesis sentences are actually entailed.By contrast, the detailed disinformation of "in 1989" in the third hypothesis sentence eventually decides the entailment relationship between the documents.For each hypothesis sentence, only several premise sentences are enough to serve as the exact evidence to judge its own sentence-level entailment label.
In this paper, we argue that evidence discovery is important and challenging for DOCNLI.Our pilot experiments in Section 4.3 and 4.5 show that randomly selected evidences can still contribute to comparable performance.Thus, only the black-box models may be not so convincing.However, to annotate evidence for evaluation is non-trivial.For each hypothesis sentence, on one hand, it may refer to multiple premise sentences.On the other hand, there may be several evidence groups, where each group can independently serve the label prediction.We highlight this as interpretability challenge.

Label: not_entailment
Hypothesis: xUS cities along the Gulf of Mexico from Alabama to eastern Texas were on alert last night as Hurricane Andrew headed west after hitting southern Florida leaving at least eight dead, causing severe property damage, and leaving 1.2 million homes without electricity.yGusts of up to 165 mph were recorded.zIt is the fiercest hurricane to hit the US in 1989.{As Andrew moved across the Gulf there was concern that it might hit New Orleans, which would be particularly susceptible to flooding, or smash into the concentrated offshore oil facilities.|President Bush authorized federal disaster assistance for the affected areas.
Premise: xUS CITIES along the Gulf of Mexico from Alabama to eastern Texas were on storm watch last night as Hurricane Andrew headed west after sweeping across southern Florida, causing at least eight deaths and severe property damage.zThe hurricane was one of the fiercest in the US in decades and the first to hit Miami directly in a quarter of a century.• • • {However, Hurricane Andrew gathered fresh strength as it moved across the Gulf of Mexico and there was concern last night that it might head towards New Orleans, which is especially low lying and could suffer severe flood For each sample, only the entailment label between the documents is annotated.For display, we mark each hypothesis sentence and its corresponding premise sentences (namely the evidences, not annotated in the original dataset) with the same number and color.The sample is annotated as not entailment due to the disinformation of "in 1989" in the third hypothesis sentence.The premise is partly omitted.
In term of models, current baselines (Yin et al., 2021;Zhong et al., 2020) still largely follow sentence-level NLI.They either concatenate two documents for mutual information interaction for classification, or encode them separately for semantic match with document-level representations.However, except for the interpretability issue, they will leave the following challenges unexplored: • Long-range Dependency Modeling The task requests to handle a pair of long documents at the same time, where we observe that 29.81% samples of DOCNLI dataset (Yin et al., 2021) contain more than 500 words 2 , while 10.47% samples contain more than 1000 words.This will not only exceed the input limit of Pre-trained Language Models (PLMs), but also make it far more difficult to capture long-range dependency.Necessary information interaction between the hypothesis and some key premise sentences may not be guaranteed.Besides, most contexts are uninformative for entailment inference and will only serve as noise.
• Cross-sentence Inference To judge the relationship between the documents, it is supposed to consider all hypothesis sentences, where the detailed disinformation issue still remains unsolved.Besides, the verification of one hypothesis sentence may request to combine multiple and distant premise sentences, different from the sentence pair mode in sentence-level NLI.As shown in Figure 1, 2 A word may correspond to multiple tokens for PLMs.
to process the first one, it needs to take both the first and sixth premise sentences (all in red fonts).
In this paper, we establish a general solution, named Retrieval, Reading and Fusion (R 2 F) framework, and a new setting for the task.The basic idea of the framework is to simplify documentlevel classification task into a set of sentence-level tasks, and then improve both performance and interpretability with the power of evidence.Specifically, the framework splits the hypothesis document into sentences.Then for each hypothesis sentence, it retrieves evidence sentences from the premise, and reads to estimate its credibility score upon the evidences.Finally, it fuses the sentence-level results and judge the entailment relationship between the two documents.For the setting, we contribute complementary fine-grained annotations for interpretability study.For each hypothesis sentence, we manually annotate entailment label and several evidence groups, where each group is enough to independently infer the label.In summary, our contributions are as follows: • We propose a Retrieval, Read and Fusion framework as a general solution for DOCNLI task.
• We contribute complementary evidence and entailment label annotation for each hypothesis sentence on a subset of DOCNLI dataset for interpretability study.
• Our experimental results on DOCNLI dataset indicate that the framework obtains state-of-the-Figure 2: R 2 F framework.For each hypothesis sentence, the framework firstly retrieves evidence sentences from the premise, and then reads to estimate the credibility upon the evidences.Finally it fuses the sentence-level results, and judges the entailment relationship between these two documents.ŷi and ŷHP are the credibility score of the i-th hypothesis sentence and the sample, while Evi ij is the j-th evidence of the i-th hypothesis sentence.art performance.Besides, it is robust for diverse retrieval methods.Moreover, the framework can give more interpretable prediction results.

R F Framework
Our R 2 F framework aims at a general solution for DOCNLI task with interpretability, i.e., to obtain corresponding evidence and predict entailment label for each hypothesis sentence.As shown in Figure 2, the framework consists of 3 components, namely evidence retrieval, reading for credibility estimation, and credibility fusion.For efficiency, the retrieval component is an independent unit to provide evidence input for the other two components, which are optimized jointly.

Task Formulation
Similar to previous sentence-level NLI tasks, for each sample in DOCNLI task, given a hypothesis document H and a premise document P, it is requested to judge the entailment relationship R between these two documents.Here, R ∈ {"entailment", "not_entailment"} for DOCNLI dataset, but may not be restricted to binary classification.

Evidence Retrieval
Given each sample, we split the hypothesis into sentences and retrieve evidence sentences from the premise.Formally, we split the hypothesis H and the premise Here, m and n are the sentence numbers.
For each hypothesis sentence H i , we respectively utilize the following retrieval methods to calculate the relevance score with all premise sentences.Then according to the scores, we remain top 3 https://www.nltk.org/K sentences as the corresponding evidence.The value of K is a trade-off between evidence recall and precision.A lower value pursues higher evidence precision, but may lead to evidence missing, while a higher value guarantees higher evidence recall, but may introduce too many uninformative sentences as noise.Moreover, to keep and utilize contextual information, for each hypothesis sentence, we reorder the evidence sentences according to their original order in the premise.
To calculate the relevance score, we take several sparse and dense retrieval methods into consideration: • ROUGE-1 Inspired by Mao et al. ( 2022) and Zhang et al. (2022), we adopt ROUGE-1 retrieval.For a pair of sentences, this sparse retrieval method focuses on n-gram match of the pair to calculate ROUGE-1 score as the relevance metric.We take it as the main retrieval method.
• BM254 BM25 is one of the most advanced sparse retrieval methods.We take all premise sentences as the corpus.For a pair of sentences, BM25 involves not only the pair itself but also the whole corpus, to count term frequency and inversedocument frequency to obtain the relevance score.
• SimCSE5 Inspired by Gao et al. (2021), we utilize SimCSE (Gao et al., 2021), a strong sentence embedding model, as dense retrieval method for semantic match.For a pair of sentences, we take the cosine similarity of the sentence embeddings as the relevance score.
Except for above retrieval methods, we also adopt another simple but effective strategy.If a hypothesis sentence is a substring of the premise, then it is naturally entailed in the premise through string match and does not need further study.

Reading & Credibility Estimation
For each hypothesis sentence, we concentrate informative premise sentences and filter out most noisy ones through evidence retrieval.Then this component aims to estimate its credibility, which involves the hypothesis sentence itself and several evidence sentences.This is different from conventional sentence-level NLI, which studies the relationship between a pair of sentences.Hence we adopt reading models.Our general R 2 F framework is compatible to any arbitrary reading models, which may be enhanced by several advanced learning technologies, e.g., graph neural network (Kipf and Welling, 2017;Velickovic et al., 2018), commonsense knowledge injection (Zhang et al., 2019;Wang et al., 2021), and syntactic structure analysis (Kitaev et al., 2022).Herein, we adopt a simple and straightforward one without loss of generality.
Specifically, for a hypothesis sentence H i , and its corresponding evidence {Evi i1 , Evi i2 , • • • , Evi iK }, we concatenate them all as the input: (1) Then we leverage transformer-based pre-trained language model, i.e., RoBERTa (Liu et al., 2019), to encode the input sequence.Through the multihead self-attention mechanism (Vaswani et al., 2017), token-level information interaction among the hypothesis sentence and all its evidences is conducted.Besides, since evidence are concentrated, it is much easier to handle the multiple evidence combination issue for cross-sentence inference.
Then the credibility score is calculated through a Multi Layer Perceptron (MLP) with sigmoid activation function: where h i is the hidden state of the special [CLS] token, and is taken as the inference vector of H i .Besides, ŷi ∈ [0.0, 1.0] is the credibility score of H i , and a higher score means that the sentence is more likely to be entailed by the premise.

Credibility Fusion
After reading, the inference vector h i and credibility score ŷi of each hypothesis sentence H i is obtained.Nevertheless, the reading model cannot be trained directly since the detailed entailment label of H i is not available.To this end, we fuse the sentence-level results to judge the entailment relationship between the documents, and indirectly train the model through document-level entailment label.Besides, the fusion process is also expected to solve the detailed disinformation issue for crosssentence inference, and expand the interpretability of the framework.Herein, we design three fusion methods for comparison.
• Credibility Score Minimum Pooling The logic basis for this method is that if the premise entails the hypothesis, then it will entail all the hypothesis sentences, even the one with the lowest credibility score.By contrast, if the premise does not entail the hypothesis, then it will conflict to at least one hypothesis sentence.This one is expected to be assigned with the lowest credibility score.
Formally, for a pair of documents H and P, the credibility scores of the hypothesis sentences are {ŷ 1 , ŷ2 , • • • , ŷm }.Then the credibility score of the sample is calculated as: For this fusion method, the sample prediction result comes from that of the least credible hypothesis sentence, i.e., with the lowest credibility score.The framework is requested to conduct internal contrast among the hypothesis sentences to decide the least credible one.Thus, it is forced to understand the entailment relationship between each hypothesis sentence and its corresponding evidences although without direct entailment label.During prediction, the credibility score ŷi is utilized to predict the entailment label of hypothesis sentence H i .We take this as the main fusion method.
• Inference Vector Minimum Pooling For this method, we conduct minimum pooling on the inference vector h i rather than the credibility score ŷi .For the inference vector of the sample h HP , the j-th dimension is: Then the credibility score of the sample is: Many conventional neural models tend to adopt this fusion method for better performance, but suffer from low interpretability, since the practical meaning of each dimension of the inference vector can hardly be probed.
• Gaussian Kernel Pooling To further explore the influence of fusion methods, we adopt Gaussian Kernel Pooling (Xiong et al., 2017;Liu et al., 2020;Sheng et al., 2022).Specifically, we utilize C Gaussian kernels {K j } C j=1 .For a credibility score ŷi , the output of the j-th kernel is: where µ j and σ j are respectively the mean and width of the j-th kernel.In this way, the score ŷi is projected to a kernel vector V i ∈ R C .The kernel vector of the sample is: And the credibility score of the sample is: This is a mean pooling method, which conducts mean pooling on the corresponding kernel vectors, rather than the credibility scores.
During training, the loss function of the sample is set as binary cross entropy loss: where y HP is the sample entailment label, 1 for entailment samples while 0 for not entailment ones.During prediction, we set a threshold on the sample credibility score ŷHP to obtain the result.

Dataset
We conduct our experiments on DOCNLI dataset (Yin et al., 2021)

Complementary Annotation
For interpretability study, we contribute complementary annotation6 .The hypothesis sentences may involve cross-sentence inference and request multiple evidences.Besides, they may also correspond to several evidence groups, where each group itself is enough to independently infer the entailment label.Thus the annotation process needs heavy workload and comes with great complexity.Herein, we adopt a proposal and correction annotation strategy.Specifically, for each hypothesis sentence, we firstly retrieve candidate evidences through the diverse methods in Section 2.2.Then we manually check, remove repeated or unrelated ones and add missing ones, and combine several evidence groups.Finally, we decide the entailment label according to the evidences.

Label: entailment
Hypothesis Sentence: Tony Abbott will withdraw Australia's ambassador to Indonesia.
Evidence Group 1: Prime Minister Tony Abbott said Australia will withdraw its ambassador to Indonesia in an unprecedented diplomatic response to the executions of Myuran Sukumaran and Andrew Chan.
Evidence Group 2: Mr Prasetyo shrugged off diplomatic backlash from Australia after Prime Minister Tony Abbott slammed the executions as "cruel and unnecessary" and announced he would withdraw Australia's ambassador to Indonesia Paul Grigson.The hypothesis sentence is annotated as entailment with two evidence groups.
Due to the heavy workload and great complexity, we manually annotate 100 longer samples (all over 800 words) randomly selected from the test set, which contain more than 350 hypothesis sentences in total.An annotation example is shown in Figure 3.The hypothesis sentence is annotated as entailment with two evidence groups.More detailed statistic information is summarized in Appendix A.

Evaluation Metric
For DOCNLI task, we adopt micro and macro F1 scores, and attach more importance to the latter.Due to the unbalanced label distribution, even majority guess model will obtain high micro F1 score, but pretty low macro F1 score.For sentence-level evaluation, we adopt evidence precision, recall and F1 for retrieval, while micro and macro F1 scores for label prediction.Moreover, inspired by Thorne et al. (2018), we propose a more strict metric full accuracy.Herein, evidence recall requests to find at least one complete evidence group, and full accuracy further requests correct label prediction.

Experiment Setup
Our R 2 F framework is implemented through Pytorch 1.8.0.We adopt AdamW optimizer, keep a random number seed of 42, set max input length as 256, and set mini batch size as 8 with gradient accumulation step as 4. For base encoder, we choose initial learning rate of 1e-5, while for large encoder, we choose 5e-6.For evidence retrieval, we set K as 5.During prediction, we adopt a threshold of 0.5.More setup is shown in Appendix B.

Baseline
Since DOCNLI is still a new task, we adopt the concatenation model from Yin et al. (2021), and modify the semantic match model from Zhong et al. (2020) for comparison.Please refer to the original papers and Appendix C for detailed information.

Main Results
Main results of our framework on the test set are displayed in Table 2.The framework obtains the best performance with the highest micro and macro F1 scores, with both base and large encoders, indicating its strength.The situation is similar on the development set in Appendix D. All models show far better performance (higher F1 score) on not entailment samples than entailment samples, while entailment samples still come with even higher recall.This may be due to the label distribution difference among different sets (in Table 1).Among the baselines, semantic match model conducts coarse-grained information interaction between the hypothesis and premise through document-level vector representations.Thus it shows pretty low performance although it can avoid possible key information missing caused by the truncation of overlong samples.This indicates that fine-grained information interaction is essential for this task.Besides, concatenation model with Longformer encoder, although it is able to handle much longer inputs, shows much lower performance than that with RoBERTa encoder, which truncates the inputs.The simplified multi-head self-attention mechanism in Longformer encoder (Beltagy et al., 2020) seems not competent for fine-grained information interaction in the task.

Performance vs Sample Length
To examine the ability of our framework on processing long inputs, the performance with varying sample length on the test set is shown in Fig- ure 4. Here, the length of a sample is counted in the number of words.Both models obtain far better performance on shorter samples than longer ones (more than 10% absolute difference on both micro F1 and macro F1 scores).Hence it is still a relatively difficult problem to process longer samples.Moreover, our R 2 F framework consistently and greatly outperforms the strongest concatenation baseline on samples with varying length.Especially, it shows much higher performance on longer samples.These indicate the framework is able to handle longer ones efficiently through breaking the document-level task into sentence-level task with the retrieval, reading and fusion process.On the de-velopment set, the tendency is similar, and detailed results are displayed in Appendix D.

Influence of Evidence Retrieval
To investigate the influence of evidence retrieval, we focus on different retrieval methods, and take those mentioned in Section 2.2 for comparison.Besides, we also adopt a random baseline, which adopts K randomly selected premise sentences as the evidences.As shown in Table 3, the random baseline can still obtain relatively high performance although it cannot outperform the concatenation baseline.This may be due to the evidence dependency bias discussed in Section 4.5.Besides, for shorter samples with only a few premise sentences, evidence retrieval will show less importance.All other retrieval methods can contribute to higher performance than the concatenation baseline, indicating the strength of the framework on the task and its robustness for diverse retrieval methods.However, supervised SimCSE, although trained on humanannotated NLI benchmarks, shows the lowest performance, which may suffer transfer issue caused by the domain difference between its own training data and DOCNLI dataset.

Influence of Fusion Method
Performance of different fusion methods are compared in Figure 5. Gaussian kernel pooling tends to incorrectly predicts all samples as not entailment and can hardly recognize entailment samples.Thus it comes with much higher micro F1 score but far lower macro F1 score under the pretty unbalanced label distribution (in Table 1).This mean pooling fusion method seems not feasible for the task.On both sets, credibility score minimum pooling outperforms all other fusion methods.This fusion method also comes with clear logic basis to expand the interpretability of the framework.

Interpretability Study
To investigate the interpretability of our framework, we conduct sentence-level evaluation on the subset annotated by ourselves.As shown in Table 4, we focus on evidence retrieval and entailment label prediction for hypothesis sentences.
• Evidence Retrieval For these longer samples (all over 800 words), the random baseline can hardly obtain the evidence, with extremely low evidence recall.All other retrieval methods can find at least one complete evidence group for most of the hypothesis sentences, with relatively high evidence recall around 85%.However, with pretty low evidence precision and F1 score, all these methods will introduce plenty of uninformative sentences as noise.Thus, the evidence retrieval component may need further improvement.• Entailment Label Prediction Regardless of poor retrieval performance, the random baseline still shows relatively high performance on entailment label prediction.This is due to evidence dependency bias.Entailment samples strictly request at least one complete evidence group.However, not entailment samples are insensitive to evidence retrieval.With complete evidence group obtained, they are taken as conflict to the premise, while with evidence missing, they will be taken as not mentioned.Both situations are considered as not entailment.This kind of bias will also contribute to the high document-level performance of the random baseline in Table 3.All other retrieval methods obtain much higher performance with the power of evidence.However, it seems that high evidence recall is not promising for more accurate entailment label prediction for hypothesis sentences.This may be also due to evidence dependency bias.Besides, with evidence precision at only around 30%, evidence noise is also an important issue, the influence of which is difficult to estimate.Moreover, the high full accuracy score means for more than 65% hypothesis sentences, our framework can find at least one complete evidence group and correctly predict their entailment label.Therefore, taking that sentence-level annotation is not available during training, our R 2 F framework is able to give more interpretable prediction results and help to locate the semantic inconsistency.

Case Study
To further display the interpretability of our framework, we conduct case study on the sample in Figure 1.As shown in Figure 6, for this long sample (about 667 words in total), the random baseline is pretty weak and cannot find complete evidence group for any hypothesis sentence, although it can obtain correct prediction for the sample.This also demonstrates the importance of evidence discovery for interpretability study.By contrast, all other retrieval methods can obtain complete evidence groups, and contribute to correctly predict sentencelevel entailment label for most of the hypothesis sentences.Thus the framework can give more interpretable prediction results, and accurately locate the semantic inconsistency in the third hypothesis sentence.Furthermore, with the two sparse retrieval methods, our framework can even successfully handle all the hypothesis sentences.However, the two dense methods fail to obtain complete evidence group for the first hypothesis sentence.As shown in Figure 1, this one requests two evidences, whose most words are related to the first evidence while only a few words are related to the second one.Moreover, it rephrases the second one with totally different expressions.Therefore, it is difficult to find the second one.This may be the exact situation that leads to the lower evidence recall of dense retrieval methods than sparse ones in Table 4.

Related Work
Natural language inference is a fundamental yet important task in natural language processing.For sentence-level NLI, several benchmarks (Bowman et al., 2015;Williams et al., 2018;Wang et al., 2019;Nie et al., 2020) have been proposed and they have been attracting research attention.Moreover, Koreeda and Manning (2021) propose Contract NLI targeting the legal and business domain, which is a small scale benchmark.The premises are long contract documents while the hypotheses are actually single sentences.Besides, Tian et al. (2022) suggest to debias natural language inference and understanding models through causal intervention and counterfactual reasoning.Lin et al. (2022) adopt commonsense inference to enhance future event generation.For document-level NLI, Yin et al. (2021) propose the first large scale benchmark DOCNLI based on a set of early benchmarks.Current models for the task still largely follow sentence-level NLI.Differently, in this paper, we emphasize the importance of evidence discovery and aim at a general solution for the task.

Conclusion
In this paper, we propose R 2 F framework as a general solution for DOCNLI task and contribute complementary annotation on DOCNLI dataset.Our experimental results show that our framework can obtain state-of-the-art performance.Besides, the framework is robust for diverse retrieval methods, and consistently obtains higher performance on samples with varying length, especially longer samples.Moreover, the framework can give more interpretable prediction results and help to locate the semantic inconsistency.In the future, we will explore an end-to-end framework for the task.

Limitations
The main limitation of our R 2 F framework is that the framework is a pipeline one rather than an end-to-end one.For efficiency, the evidence retrieval component is an independent unit, which provides evidence input for the jointly trained reading and fusion components.However, the evidence retrieval component in our pipeline framework will not bring additional heavy computation.First, it can be conducted offline efficiently.The sparse retrieval methods only involve item frequency counting, and the dense ones are based on pretrained sentence embeddings without further fine-tuning.Second, there are many acceleration techniques for retrieval in industrial field.
Furthermore, we will try to improve the evidence retrieval component in the future.On one hand, we will try to utilize document-level entailment label to improve sentence-level evidence retrieval.On the other hand, we will also explore an efficient end-to-end model, which may benefit from reinforcement learning (Williams, 1992;Lei et al., 2016), reparameterization trick (Maddison et al., 2017;Jang et al., 2017), or Expectation-Maximum algorithm (Dempster et al., 1977).However, the evidence dependency bias issue, discussed in Section 4.5, will pose a great challenge.That is, entailment samples strictly request complete evidence groups, while not entailment samples are insensitive to evidence retrieval.

B Detailed Experiment Setup
Our R 2 F framework is implemented through Pytorch 1.8.0 and hugging face transformers 7 .All experiments are conducted on a computation node with Nvidia 40G A100 GPUs.For evidence retrieval, we set K as 5 to remain top 5 sentences as evidences during retrieval.We adopt RoBERTa (Liu et al., 2019) as encoder, including base and large version.For all experiments, we adopt AdamW optimizer, keep a random number seed of 42, set max input length as 256, and set mini batch size as 8 with gradient accumulation step as 4. For base encoder, we choose initial learning rate as 1e-5, while for large encoder, we choose initial learning rate as 5e-6.We train 5 epochs, evaluate each 3750 steps, and choose the model parameters with the highest performance on the development set.For Gaussian Kernel Pooling, we keep 11 kernels with the same width of 0.01.However, the mean values of the kernels come from a uniform

C Detailed Baseline
Since DOCNLI is still a new task, we adopt the concatenation model from Yin et al. (2021), and modify the semantic match model from Zhong et al. (2020) for comparison.
• Concatenation We concatenate the hypothesis and the premise documents into a sequence as input.Overlong samples will be truncated to max input length.Then we adopt the hidden state of the special [CLS] token as the sample representation for binary classification.
• Semantic Match We respectively encode the hypothesis and premise documents to obtain their own document-level vector representation.For documents exceeding the max input length, we split them into chunks with sliding window.Then inspired by Chen et al. (2017), we enhance the vector representations of these two for classification.

D Performance on Development Set
The detailed model performance on the development set are shown in Table 5 and Figure 7.The situations are similar to those on the text set.Our framework obtains the highest performance on the development set.Furthermore, except for the slight performance decrease on samples no longer than 150 words, it greatly and consistently outperforms the strongest concatenation baseline on samples with varying length, especially on longer samples.These also show the strength of our R 2 F framework on the task.

E Influence of K Value
To investigate the influence of evidence retrieval, we also pay attention to the value of K, which is the hyperparameter about how many sentences to remain during retrieval.As shown in Table 6, the framework is pretty sensitive to the value.Specifically, the highest performance is obtained with K as 5. Besides, with K as 3, 6 or 7, the framework can also obtain promoted performance than the concatenation baseline.However, with K as 4, it shows similar performance with the baseline.As discussed in Section 2.2, K is to balance evidence precision recall.Lower values pursue higher evidence precision, but may lead to evidence missing, while higher values guarantee higher evidence recall, but may introduce too much noise.Besides, the choice of the value is also closely related to data distribution.The high sensitivity to the value indicates that the evidence retrieval process will need further improvement, where a possible way is to utilize the document-level entailment label to improve it.

F Related Work on Long Document Processing
For long document processing, two common methods are to truncate the input sequence to the maximum length, or cut into several independent segments with sliding window.Dai et al. (2019) propose segment-level recurrence mechanism for information interaction among segments.Beltagy et al. (2020) simplify the self-attention mecha- Figure1: A sample of DOCNLI dataset.For each sample, only the entailment label between the documents is annotated.For display, we mark each hypothesis sentence and its corresponding premise sentences (namely the evidences, not annotated in the original dataset) with the same number and color.The sample is annotated as not entailment due to the disinformation of "in 1989" in the third hypothesis sentence.The premise is partly omitted.

Figure 3 :
Figure 3: Annotation example for hypothesis sentence.The hypothesis sentence is annotated as entailment with two evidence groups.

Figure 4 :
Figure 4: Model performance with varying sample length on the test set.Concatenation baseline and R 2 F framework with RoBERTa base encoder are compared.The horizontal axis is the sample length in the number of words.The vertical axis is the performance.

Figure 6 :
Figure6: Case study on the sample in Figure1.For each hypothesis sentence, the elements in the bracket denote whether at least one complete evidence group is obtained, T for True while F for False, and entailment label prediction result, E for entailment while NE for not entailment.Wrong predictions are in red.The number below is credibility score.The Golden branch denotes the groundtruth.

Figure 7 :
Figure 7: Model performance with varying sample length on the dev set.Concatenation baseline and R 2 F framework with RoBERTa base encoder are compared.The horizontal axis is the sample length in the number of words.The vertical axis is the performance.

Table 1 :
, which is a newly proposed and the only large scale dataset in the field.The detailed statistic information is shown in Table1.The training set comes with balanced label distribution, while the development and test sets come with pretty unbalanced ones.Statistic information of DOCNLI dataset.

Table 3 :
DOCNLI performance with different evidence retrieval methods.* and # respectively denote unsupervised and supervised version, and the same below.

Table 5 :
Zhong et al. (2020)n the dev set.♣ and ♠ denote the original and reproduced results of the model fromYin et al. (2021).denotes the reproduced results of the model modified fromZhong et al. (2020).