LawngNLI: A Long-Premise Benchmark for In-Domain Generalization from Short to Long Contexts and for Implication-Based Retrieval

Natural language inference has trended toward studying contexts beyond the sentence level. An important application area is law: past cases often do not foretell how they apply to new situations and implications must be inferred. This paper introduces LawngNLI, constructed from U.S. legal opinions with automatic labels with high human-validated accuracy. Premises are long and multigranular. Experiments show two use cases. First, LawngNLI can benchmark for in-domain generalization from short to long contexts. It has remained unclear if large-scale long-premise NLI datasets actually need to be constructed: near-top performance on long premises could be achievable by fine-tuning using short premises. Without multigranularity, benchmarks cannot distinguish lack of fine-tuning on long premises versus domain shift between short and long datasets. In contrast, our long and short premises share the same examples and domain. Models fine-tuned using several past NLI datasets and/or our short premises fall short of top performance on our long premises. So for at least certain domains (such as ours), large-scale long-premise datasets are needed. Second, LawngNLI can benchmark for implication-based retrieval. Queries are entailed or contradicted by target documents, allowing users to move between arguments and evidence. Leading retrieval models perform reasonably zero shot on a LawngNLI-derived retrieval task. We compare different systems for re-ranking, including lexical overlap and cross-encoders fine-tuned using a modified LawngNLI or past NLI datasets. LawngNLI can train and test systems for implication-based case retrieval and argumentation.


Introduction
This work proposes a new natural language inference (NLI) benchmark LawngNLI constructed from U.S. legal opinions via the Caselaw Access Project (The President and Fellows of Harvard University, 2018) that have been largely cleaned of inline citations in order to read more naturally. 1It follows the general three-label NLI formulation: given a pair of texts (premise and hypothesis), the goal is to predict the label of whether the premise entails, is neutral toward, or contradicts the hypothesis.LawngNLI's premises are especially long and are multigranular.Its automatic labels derive from the dataset construction using (negation-based) contradiction and (similarity-based) neutralization algorithms.These labels exhibit an accuracy of 88.8% (94.7% for high-confidence human labels) on a subset with human-validated gold labels.Examples are derived from actual inferences from past cases that judges wrote to apply usefully to new situations, relying on their expertise while avoiding crowdsourced labeling.
We conduct two sets of experiments on LawngNLI.First, regarding in-domain generalization to long contexts, we compare top longsequence and short-sequence NLI models on LawngNLI.Models are transferred with and without intermediate fine-tuning on existing NLI benchmarks.We find that absent fine-tuning directly, models fall substantially short of top performance on our long premises.This continues to hold true when we leverage LawngNLI's multigranularity to control for domain shift, by further fine-tuning models on our own short premises.Top performance on our long premises is by short-sequence models prepended with a standard retrieval method (BM25 (Robertson and Zaragoza, 2009)) filtering across each premise.Absent their fine-tuning that uses long premises as inputs, however, these models would underperform on long premises at inference time.We provide evidence that this gap is likely robust to any deviations between the automatic and gold labels.Thus a large-scale longpremise dataset like LawngNLI is needed.
Second, regarding legal retrieval systems, we conduct a comparison of leading models by Recall@k for selecting target cases via implication/NLI-based retrieval (when a user provides arguments entailed or contradicted by target cases as queries), in our domain.Implicationbased retrieval is an underexplored subtask (see, e.g., Schuster et al., 2022).Our zero-shot panel comprises Sentence-Transformers (Reimers and Gurevych, 2021) lightweight bi-encoders pretrained using short-sequence NLI, user web queries, and general semantic relatedness.While baseline models transfer reasonably to our domain zero shot, we further compare re-ranking with models finetuned using several previous NLI datasets along with an adjusted retrieval version of LawngNLI.Future improvements can put more evidence within the range of human users' cognitive reach beyond the top result, including for law which could help make legal work more affordable and equitably accessible (see Section 6).
Overall, our main contributions are: (1) A new NLI benchmark with multigranular premises much longer than in most existing NLI benchmarks across percentiles (see Table 2).
(2) A benchmarking of models' ability to generalize from short context to long context on the same domain and examples.It shows that (in our domain at minimum) short-premise NLI models fall substantially short of top performance on long-premise NLI at inference time, unless a large-scale longpremise dataset is created to fine-tune on.
(3) A benchmarking of leading retrieval models on case retrieval with entailed or contradicted arguments as queries, comparing lexical overlap and fine-tuning using a modified LawngNLI or previous NLI datasets.LawngNLI provides a benchmark for future implication-based case retrieval systems.

Significance for NLI and for Law
Regarding in-domain generalization to long contexts, our work stands within a fast-growing research area on how models can learn to reason over long text.Benchmarks for NLI, or Recognizing Textual Entailment (RTE), stretch back to Dagan et al. (2005).Recently, various "efficient" Transformer architectures have been proposed to address the obstacle of quadratic self-attention complexity in scaling to long sequences (Tay et al., 2020).LawngNLI's long premises frequently exceed the limits of such long-sequence models with efficient attention mechanisms, although on our long premises not exceeding those limits, the longsequence models included here are outperformed by short-sequence models both with and without filtering of premise paragraphs by relevance to the hypotheses.Most existing NLI benchmarks, meanwhile, contain largely short premises.The recent ContractNLI (Koreeda and Manning, 2021) is an exception, containing premises with lengths similar to ours though with a much smaller number of examples in a different domain (607 contracts as premises, paired with each in a common set of 17 shared hypotheses). 2egarding legal retrieval systems, law is an important area where humans perform long-context NLI in practice.The core of legal advocacy is articulating what existing applicable law (cases, legislation, regulations, etc.) implies for a new situation.Practitioners must move between case text and the entailed and contradicted arguments that they aim to support or counter.Automatic systems are developed to help find relevant precedents, which requires filtering the millions of U.S. cases.

LawngNLI Dataset
We construct LawngNLI beginning with all citations with parentheticals in U.S. state and federal case opinions, via the Caselaw Access Project (The President and Fellows of Harvard University, 2018).Its hypotheses derive from the actual inferences from past cases applied by judges to decide between competing parties' positions across a large cross-section of U.S. opinions.Thus we hope its examples capture the variety of reasoning underlying the useful arguments that legal practitioners aim to support or undermine during their daily work.

Sample twin Entail/Contradict examples with same premise from LawngNLI
Twin hypotheses with same premise, from "analysis" subset • Contradict: city acted affirmatively to create or increase risk of harm on city street by ignoring residents' requests to reduce speed limit or by taking down residents' signs indicating drivers should adhere to a lower speed limit • Entail: city did not act affirmatively to create or increase risk of harm on city street by ignoring residents' requests to reduce speed limit or by taking down residents' signs indicating drivers should adhere to a lower speed limit Additional hypotheses with same premise • Entail: failing to enforce or lower the speed limit on a residential street "did not create a 'special danger' to a discrete class of individuals.. [ed.: excerpted]..as opposed to a general traffic risk to pedestrians and other automobiles" • Contradict: traffic laws and enforcement practices did not pose "a general traffic risk to pedestrians and other automobiles" Relevant excerpts of shared premise • [ed.: Plaintiffs] ...submit that the City of Fort Thomas..violated their son's substantive due process rights by failing to act upon their request (and the requests of others) to lower the speed limit on the street..The police also removed signs posted by residents indicating that drivers should adhere to a 15 mile-per-hour speed limit.. : Plaintiffs] fail to satisfy any of the three requirements for establishing our circuit's "state-created danger" exception to DeShaney.First, the creation of a street and the management of traffic conditions on that street are too attenuated and indirect to count as an "affirmative act"..
Distractor excerpts of same premise • ...After all, the City was told about the risks of not lowering the speed limit to 15 miles per hour (more accidents); it intentionally chose not to heed this warning (taking on the risk of more accidents); and the alleged risk came to pass when..was killed (an accident).. • ...For in one sense, it could be said that all governing bodies act with deliberate indifference when they consider and reject a traffic-safety proposal of this sort that comes with known risks..  9 and 10 for detailed view.

Construction Steps
Key steps are outlined below.They are presented in Figure 3 and further detailed in Appendix Section A.1.When judges cite another case in an opinion, they may highlight content or takeaways from that case in a parenthetical which we use as an initial hypothesis. 31.We begin with Entail examples: long premises are the majority opinion cited alongside the parenthetical, and short premises are the cited opinion pages (extracted using Eyecite (Cushman et al., 2021)).Filters screen for hypotheses that are not flagged as conflicting with premises and are valid inputs for our later contradiction algorithm.(Arredondo, 2017).a method from Bilu et al. (2015)).All examples are filtered for NLI difficulty.Then the neutralization algorithm re-pairs the hypotheses assigned Neutral with alternative similar premises not adjacent in the citation network.
3. The dataset is rebalanced on labels and pivotal negation, citation spans are removed from the premises so that they read more naturally, and long premises are prepended with copied paragraphs from the end (the minimum number including at least 512 tokens) to limit models from relying on cues near the start of the underlying opinion.
Table 1 shows sample examples from our dataset.Descriptive statistics (Table 2) show that its long premises skew much longer than premises in other key existing NLI datasets.
Screened Amazon Mechanical Turk workers provided 300 gold labels for LawngNLI examples.Because our long premises are lengthy, there is a particular risk of partly random guessing by workers.Even small frequencies can erroneously multiply the estimated error rates of our labels.As such, for each example, two workers independently chose a label and confidence level ("probably" or "definitely").Gold labels were adopted on examples where both workers chose the same label (unanimity).Where the workers chose different labels, no label carries unanimous confidence and the example is not included.This process continued until 300 gold labels were obtained.Detailed steps are outlined in Appendix Section A.3.We find a 88.8% human-validated accuracy (94.7% for high-confidence labels, where both workers chose "definitely").Table 4 shows humanassessed accuracies of our automatic labels.
To better understand where dataset errors arise, in Appendix Table 11 we present several examples where automatic and gold labels differ.In the first, the citation parser linked the wrong citation to the parenthetical underlying the hypothesis.So the example was incorrectly automatically labeled as Entail, while workers correctly labeled the example as Neutral.In the second example, the hypothesis is the conjunction of two possible conditions for a non-medical source to be given the weight of an acceptable medical source under the applicable regulations.Yet along with the first condition, the premise describes the opposite of the first condition.Thus the automatic label Contradict is arguably correct, while the worker label Entail may have arisen if both workers saw that the second condition is correct while misassessing the first condition.And in the third example, the premise and hypothesis were re-paired by our neutralization algorithm and so automatically labeled Neutral.However, despite safeguards such as excluding majority opinions from adjacent cases in the citation network as candidate premises, the workers recognized that the premise arguably contradicts the hypothesis despite deriving from a case not cited by the hypothesis's underlying parenthetical.Thus the automatic label is incorrect and the worker label is correct.

Experimental Evaluation
Our experiments demonstrate two applications of LawngNLI. 4See implementation details in Appendix Section A.2. 1. Generalization First, we evaluate whether models perform competitively on LawngNLI's long premises, before and after fine-tuning on existing NLI benchmarks and/or its short premises (with LawngNLI's multigranularity, we can evaluate on long premises within the same domain).RQ1: Can top NLI models approach top performance on LawngNLI with long premises, absent fine-tuning directly with our long premises?
RQ1A: Is our answer to RQ1 robust to any error rate in LawngNLI's automatic labels? 2. Retrieval Second, we evaluate retrieval and NLI models on a retrieve-and-re-rank approach to NLI-based (entailed or contradicted arguments as queries) case retrieval.
RQ2: How do models compare on implicationbased case retrieval?

In-Domain Generalization to Long Contexts
To compare generalization from short to long contexts, we build a panel of 28 NLI models as follows: Table 4: Human assessment of a stratified random sample of LawngNLI's "analysis" subset (sequence length of long premise at most 4096).The split refers to pivotal negation.Provided accuracies are balanced (macro-averaged recall).High-confidence labels are when the two workers who labeled a given example both clicked "definitely" (versus "probably") the label.
1. We begin with 7 pretrained models that are top performing on key existing NLI benchmarks.HuggingFace (Wolf et al., 2020) model names with leaderboard positions on existing NLI benchmarks are listed in Appendix Section A.4.
Past NLI benchmarks have included artifacts spuriously correlated with their labels.To check for LawngNLI, we fine-tune and evaluate our models on hypotheses and premises only in Appendix Table 12 (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;Yin et al., 2021).Our labels show some modest predictability above random from our hypotheses at around 0.55 at the highest (in line with other NLI datasets 7 ) and from our long premises at 0.45 at the highest (slightly higher than another imperfectly symmetric benchmark (ConTRoL (Liu et al., 2021)) at 38.56%) and short premises at around 0.52 at the highest.
Especially for short-sequence models, some difficulty on long premises may be due to distraction by less relevant text.We address this by adding a version of each model with a filtering step prepended, selecting the 5 paragraphs with highest BM25 (Robertson and Zaragoza, 2009) score across the premise when querying the hypothesis.This filtering step allows our models to rival their performance on LawngNLI's short premises, when inputting only its unexcerpted long premises.

RQ1: Are NLI Models Competitive on
LawngNLI's Long Premises, Absent Fine-Tuning with Our Long Premises?
Among our NLI models fine-tuned with existing benchmarks only and/or LawngNLI with short premises, the top model trails the top model finetuned with our long premises in accuracy by 5 percentage points (bolded in columns ( 4) and ( 6) among all 21 three-label models in Appendix Table 13; the top 6 models among these on both our short and long premises plus all long-sequence models are shortlisted in Table 5). 8Thus these models fall substantially short of top performance on long-premise NLI, unless a largescale long-premise NLI dataset is constructed and used for fine-tuning.Models have more to learn here from long contexts than short contexts alone teach.
Generalizing from short to long premises, we control for much example structure by keeping the same domain and examples.However, judge excerpting could create some long versus short premise differences besides length, and it is possible that these differences are impacting our results.Such differences may be somewhat unavoidable when excerpting NLI premises that can cover many arguments.Relevant passages depend on the given hypothesis and so may be variably imbalanced relative to the long premise.In contrast to multilevel summaries, our motivation is comparing nested premises that read like premises naturally occurring in the domain at their respective lengths.
We would thus argue that differences arising 8 Appendix Tables 14 and 15 show minimal differences on examples with hypotheses with versus without pivotal negation and slightly lower performance on examples with abovemedian length premises.from human excerpting need not confound drawing wider parallels to our results.Counterpart (if differently structured) differences between naturally occurring short contexts versus long contexts likely exist in various domains.These differences are arguably properly includable in our experiments since they impact model generalization from short to long contexts in such other domains as well.Still it remains to be studied if large-scale long-context datasets are needed to perform competitively on other long-context NLP tasks and domains.

RQ1A: Is RQ1 Robust to Any Error
Rate in LawngNLI's Automatic Labels?
On our human-validated subset, we check how the gap in balanced accuracies from RQ1 (i.e., between our top models fine-tuned using long premises versus those fine-tuned using short premises) is impacted by evaluating against our automatic labels, as compared to against our gold labels.While sampling variation leads to some divergence of this human-validated subset from the test set, this bias in balanced accuracies due to automatic labels (on our human-validated subset) provides a noisy estimate of the same bias for the test set.9 In Appendix Table 16, we can check this impact of the automatic labels on the performance gap between the top models fine-tuned using long premises (column 6) versus top models fine-tuned using short premises (column 4) from Table 5.For the single model (among these top 6) with the largest performance gap from Table 5, this gap is biased on the human-validated subset by the automatic labels by -0.066.Averaging across all top 6 models, this bias is -0.016.These non-positive biases provide evidence that the performance gap from RQ1 likewise likely does not arise from bias due to the automatic labels.

Implication-Based Legal Retrieval
Major legal retrieval systems (including leading commercial systems, according to publicly available information) rely on signals from lexical keywords, semantic similarity, and question answering (Section 5).But to our knowledge, such systems do not utilize signals from legal NLI.In contrast, LawngNLI's hypotheses are derived from the actual implications from the premise case that judges write within new cases when applying the premise case as precedent.
Suppose a user wished to retrieve documents that support or refute an argument: Can leading retrieval and NLI models perform well (zero shot or with fine-tuning) on retrieving the target document when given that argument as the query?We test this question by comparing models on implicationbased retrieval in our domain.

RQ2: How Do Models Compare on
Implication-Based Case Retrieval?
We proceed in three steps: cheap retrieval, biencoder ranking, and cross-encoder re-ranking.
(1) Building our test set, using cheap retrieval across all candidate cases: Our retrieval test set comprises the LawngNLI test set's Entail and Contradict long premises (majority opinions, here before dataset processing) as positive examples, pooling each with 999 other majority opinions from the same state (or the federal level) selected by highest all-mpnet-base-v2 (Song et al., 2020) embedding dot-product similarity with the hypothesis as negative examples.An effective retrieval system would rank the premise case highly as correct, and we can compare models on Recall@k.So each non-Neutral LawngNLI test set hypothesis is a query paired with 1000 candidate documents.This step corresponds to the initial retrieval step in a standard retrieve-and-re-rank approach to the retrieval problem.We control for this first step by uniformly applying this method and bracket a search for improvements.This allows our comparison to focus on which models rank target cases most highly in steps 2 and 3, with the aspiration of rankings high enough to be within users' reach with minimal skimming beyond the top result.
(2) Zero-shot bi-encoder ranking of retrieved top 1000: Our zero-shot panel comprises a BM25 (Robertson and Zaragoza, 2009) baseline and four Sentence-Transformers (Reimers and Gurevych, 2021) lightweight bi-encoders: msmarco-distilroberta-base-v2, nli-distilrobertabase-v2, all-distilroberta-v1, and all-mpnet-base-v2 (Sanh et al., 2019;Liu et al., 2019;Bajaj et al., 2016;Song et al., 2020).The first three were chosen for comparable setups besides their training task, while the last is a top model across semantic search evaluations.10They are respectively pretrained for actual user queries from the Bing search engine, NLI, and a combined dataset of over 1 bil-lion pairs of related sentences covering many tasks.As in Section 4.1, we also evaluate each model version while prepending with a module that filters each candidate document to the 5 paragraphs with highest BM25 similarity to the query.
Appendix Table 17 shows that leading, lightweight bi-encoders and BM25 (Robertson and Zaragoza, 2009) can rank the target case reasonably well zero shot while processing a large number of candidate documents.Prepending with BM25 filtering improves performance.Since Entail and Contradict hypotheses are twinned, it is not so surprising that performance is similar between those two subsets.Still, these retrieval models seem capable of recognizing long-context inference (as a subtype of relevance) close to equally between entailed and contradicted queries.While Recall@10 of the target premise case reaches about 0.3, Recall@100 still only reaches about 0.5 (versus expected value of 0.1), showing that leading retrieval systems would miss important documents for legal retrieval zero shot.
Our model panel starts with albert-xxlarge-v2 (Lan et al., 2019) fine-tuned on each of our three included previous NLI datasets (ANLI (Nie et al., 2020), ConTRoL (Liu et al., 2021), and Doc-NLI (Yin et al., 2021)).We also add crossencoder/ms-marco-MiniLM-L-6-v2 (Wang et al., 2020;Reimers and Gurevych, 2021;Bajaj et al., 2016) as an additional zero-shot baseline.Among our dense baselines, fine-tuning on ANLI provides the top re-ranking performance.We then fine-tune the ANLI model on an adjusted retrieval version of LawngNLI. 11We use the same setup as for our NLI intermediate fine-tuning on ANLI except with a learning rate of 1e-6.Table 5: Accuracy on LawngNLI's "analysis" subset with long premises: Top 6 three-label models on both short and long premises plus long-sequence models (for all 28 models, see Appendix Table 13).The error provided is the larger deviation of the Clopper-Pearson (Clopper and Pearson, 1934) exact binomial 95% confidence bounds.
All p-values round to zero (<0.0005) from an exact binomial McNemar's (McNemar, 1947) test for a statistically significant difference in accuracies between each version fine-tuning using short premises as inputs (1-4) and the best version fine-tuning using long premises as inputs (6).For (2), 512 tokens is the overall sequence limit.Table 6: Recall@k of model panel, when re-ranking the bi-encoder top 100 (ranked by all-distilroberta-v1 (Sanh et al., 2019;Liu et al., 2019) prepended with BM25 (Robertson and Zaragoza, 2009) filtering from Appendix Table 17) for implication-based retrieval.The error provided is the larger of the two deviations of the Clopper-Pearson (Clopper and Pearson, 1934) exact binomial 95% confidence bounds from the point estimate.
BM25 (Robertson and Zaragoza, 2009) on Recall@1 and Recall@10 (compare Appendix Table 17).However, fine-tuning on our adjusted version of LawngNLI draws re-ranking performance toward BM25.Refining our approach to finetuning using LawngNLI may yield future improvements.
5 Related Work  Zheng et al., 2021) construct datasets for a distinct task of predicting holdings from other cases that support the arguments in the nearby context in the citing case.These holdings exhibit an argument support relation with respect to their surrounding context, as opposed to necessarily any NLI relation.Recently, Shen et al. (2022) introduced a multi-document summarization dataset for documents around civil rights cases, with multiple granularities of summaries.The legal tasks closest to ours are from the annual COLIEE workshop. 13owever, these tasks do not fully map to threelabel NLI.For, e.g., relevant 2021 Tasks 2 and 4, their training corpora (in the hundreds of examples) are ballpark 1000 times smaller than usual single-sentence benchmarks, making supervised learning alone insufficient for reliably training models to generalize (Hudzina et al., 2020;Rabelo et al., 2021;Kim et al., 2021;Schilder et al., 2021).

Conclusion and Future Work
This work presents LawngNLI, a new NLI benchmark with multigranular long premises, each containing a shorter version.Experiments demonstrate some use cases.First, we show that leading NLI models fall substantially short of competitive performance when generalizing to LawngNLI with its long premises, even after fine-tuning using existing NLI benchmarks and/or LawngNLI with short premises (with the same domain and examples as the long-premise evaluation).Unconfounded by domain shift, these results show the need for a large-scale long-premise dataset like ours at finetuning time.
Second, we show that leading lightweight retrieval models can reasonably handle implicationbased retrieval on LawngNLI zero shot with both entailed and contradicted arguments as queries.We then compare re-ranking by lexical overlap and models fine-tuned using a modified LawngNLI or several previous NLI datasets.Multiple other aspects of LawngNLI are left for future study.

Limitations
LawngNLI contains automatic labels, derived from the construction process.Its Entail labels are effectively annotated by judges, who wrote Entail hypotheses as parentheticals asserted by the cited premise.Neutral and Contradict examples are derived from Entail examples by, respectively, repairing with a different non-adjacent premise in the citation network and by adding or removing pivotal negation (Appendix Section A.1).These steps could introduce some error rate, which we validate by human assessment (Section 3.2).And using a human-validated subset, Section 4.1.2evinces that our conclusions from our generalization experiment likely do not arise from such differences between automatic and gold labels.
Both experiments test standard approaches when applied to distinct challenges: first, short-premise models on long-premise NLI and, second, semantic search models on implication-based retrieval.For the experiment on generalization to long contexts, we demonstrate that these standard approaches do not always suffice, but only in one (albeit important) counterexample domain: law.We have not established if these shortcomings extend to other domains more broadly.

Ethics Statement
Considerations for general NLI have been explored elsewhere (e.g., for gender bias by Sharma et al. (2021)).We discuss some considerations for the legal aspect.On the benefit side, NLI is a principal cognitive task in law, so progress here also stands to benefit the legal community: Building court cases and advising clients essentially is arguing for and against different natural language inferences from legal texts and facts.Implications may not be directly stated in the text or annotations (e.g., those at a different level of specificity or requiring compositional reasoning).Instead, holdings and rules inferable from case text must be extracted through costly human annotation and curation.All around the legal system, the pay grade and spare bandwidth of legal counsel is frequently starkly imbalanced between parties with adversarial interests: whether people in the courtroom or settlement conference, consumers or companies in a negotiation boardroom, or in everyday society where behavior is shaped by prospects of legal action.Anything that makes legal research and thus legal counsel cheaper, including more lightweight or task-tailored case retrieval systems, can contribute toward fairer access to legal representation and justice regardless of financial means.Models that perform well on LawngNLI's retrieval setup could crosswalk between cases as premises and implications as hypotheses, performing implicationbased retrieval automatically.We describe the state of legal retrieval systems (including limited public information about leading proprietary commercial algorithms) relative to an NLI-based approach in Section 5. Legal services overall comprise about 1.3% of U.S. GDP. 14The legal research industry's annual revenue meanwhile is in the multiple billions of dollars. 15And the full societal cost of suboptimal case retrieval should include the time and resources expended by human legal researchers in the loop (paralegals and lawyers) in unnecessary iterating with any suboptimal retrieval systems.Indeed, junior lawyers (less than 10 years of experience) spend almost a third (28%) of their working time on case research (Poje, 2014). 16n the risk side, while prospective human reliance for decision making on erroneous model predictions is an ever-present consideration in NLP, we do not view this as a practical risk for LawngNLI.Everyday people can turn to numerous simple articles online summarizing the law, without digging into complex case retrieval and jurisprudence.And regarding advising others, lawyers bound by professional duties are exclusively authorized to practice law in the U.S. and around the world. 17Nothing can even be done just knowing the most relevant cases or implications; they must be synthesized by human judgment into an argument sound enough to pass the muster of judges and juries.In other words, legal NLI models are in no way lawyers.Instead, they can work as screening tools for practitioners who then must apply their own judgment to make the results useful.In this way, legal NLI models could help save the resources of lawyers and clients and help improve the quality of legal representation.gence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER Program.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, the Department of Defense, or the U.S. Government.The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.This research is supported by a Focused Award from Google.parenthetical and included as an example.In this paper, we only include examples from citations including a resolvable pincite (e.g., does not contain letters).
The short version of the premise consists of the resolvable cited pages within the cited case's majority opinion, while the long version of the premise consists of the cited case's full majority opinion.

A.1.2 Initial Filters
Examples are dropped or modified by simple "accuracy" filters.First, as an overbroad criterion to exclude examples where the (converted or unconverted) original Entail hypothesis was a parenthetical in a case that was later overturned, we drop all examples with hypotheses from cases where a later case shared the same party names in the same or reverse order.
Fourth, verbs ending with "ing" followed by "that" at the beginning of remaining hypotheses many times take a supporting stance toward the subsequent subordinate clause, so to adapt such hypotheses to be more similar to a standalone sentence, we remove such initial words and the subsequent "that" in hypotheses.

A.1.3 Identifying (Pivotal) Negation in Hypotheses
Next the Entail examples are automatically labeled by whether their hypotheses contain (pivotal) negation or not, depending on whether the contradiction algorithm described in Appendix Section A.1.5 removes or adds negation, respectively.Pairs with hypotheses rejected for processing by our contradiction algorithm are dropped from the dataset.Since the absence versus presence of such negation in the hypothesis results in contradictory truth values (and thus also flips the NLI label between Entail and 'Contradict), such negation can be called "pivotal."Negation is defined this way throughout the paper except in Table 2 when comparing to other datasets, since our contradiction algorithm might exhibit a different error rate on those datasets and confound the comparison.For this reason, greater than 50% of LawngNLI's hypotheses contain negation in Table 2, even though the dataset is constructed to contain 50% (pivotal) negation hypotheses.

A.1.4 NLI Label Split
Within examples from cases from each state (or federal) and pivotal negation or not, entail examples are randomly assigned to be 1/3 Entail, 1/3 converted to Neutral, and 1/3 converted to Contradict.

A.1.5 Converting Entail Examples to Contradict Examples: Contradiction Algorithm
For examples labeled Contradict in Appendix Section A.1.4,we use our contradiction algorithm to add or remove pivotal negation20 from the hypothesis, toward aligning the NLI relation with the label.
Our contradiction algorithm builds on the negation algorithm outlined in Section 4.2 of Bilu et al. (2015), which in their paper was annotated by majority vote to have generated an opposing claim with probability 0.79. 21he algorithm chooses a random sentence for adding or removing negation and leaves the others unchanged.It finds a non-compound independent clause within the chosen sentence and then makes the first applicable change in the list below.If none of the changes' conditions apply, the hypothesis is rejected for processing by the algorithm.This includes rejecting hypotheses consisting of verb phrases not nested within independent clauses; since these are rarely found in negated form in the original dataset, including them would leave an artifact of this contradiction algorithm.So for these hypotheses, we prioritize balance across labels over coverage of candidate examples.
1.If there are any contradictable indefinite pronouns in the first highest-level noun phrase, the first one is changed to a contradictory pronoun (e.g., "some" to "none" or "neither" to "either").
2. If there are any verb phrases, the first highestlevel verb phrase is contradicted using a modified version (e.g., also reversing negation by removing "do"/"does"/"did"+"not") of the negation algorithm from Bilu et al. (2015) mentioned above.
3. If there are any adjective phrases, the first ['no','not','never'] is removed from or else a 'not' is added to the first highest-level adjective phrase or past participle.

A.1.6 Filtering
Now we apply simple "difficulty" filters: examples with hypotheses containing quotation marks or fewer than four words or with at least 50% bigram overlap with their premise are dropped.

A.1.7 Converting Entail Examples to Neutral Examples: Neutralization Algorithm
For examples labeled Neutral in Appendix Section A.1.4,we use our neutralization algorithm to match the hypothesis with a different premise, toward aligning the NLI relation with the label.To balance attrition, the neutralization algorithm is applied to all examples regardless of NLI label, but only the hypotheses from Neutral examples are actually re-paired with the assigned premise.
The candidates for matching with each hypothesis are the premises from all examples that are from cases in the same state as the original premise (or from a federal case if the original premise is from a federal case).Excluded from candidacy are premises from cases citing or cited by the case containing the original hypothesis.
A hypothesis is paired with a candidate premise as follows.The short version of the premise is used for this step.
Second, candidate premises with which a hypothesis has at least 50% bigram overlap are dropped.This step preserves the filter applied earlier to all examples through the re-pairing for the Neutral examples.
Finally, Neutral hypotheses only are paired with their remaining candidate premise with respect to which it has the highest BM25 (Robertson and Zaragoza, 2009) score via Gensim 3.8.3(Rehurek and Sojka, 2010).For hypotheses of all labels, if no candidate premises remain, their example is dropped.

A.1.8 Balancing
We split the dataset into "analysis"/non-"analysis" subsets by the inclusion criterion for this paper's experimental evaluation (Section 4): whether the sequence length of an example's long premise is at most 4096 tokens, via a RoBERTa (Liu et al., 2019) tokenizer.
Within each of the "analysis"/non-"analysis" subsets, the dataset is then downsampled by randomly sampling each of the three label-plusnegation groups closed under the contradiction operation (Entail+negation plus Contradict+nonnegation; Contradict+negation plus Entail+nonnegation; Neutral+negation plus Neutral+nonnegation) down to the minimum of their example counts.A 90/5/5 train/val/test split is stratified by "analysis"/non-"analysis" subset and these groups.
Each example is then complemented with its contradictory twin: the same premise paired with the hypothesis modified by adding or removing pivotal negation (so applying the contradiction algorithm).Neutral labels are unchanged from the original example, while Entail and Contradict labels are flipped.This twinning balances the dataset within the "analysis"/non-"analysis" subsets on NLI label by pivotal negation versus not.

A.1.9 Citation Removal Algorithm and Prepending
Our algorithm here attempts to remove as many in-line citations from premises as it can so that the premises are more customary English-language texts.The processed premises are studied in this shows retrieval using dot-product similarity on this model's embeddings to perform best among several models on TREC-DL 2019 (Craswell et al., 2020) and the MS Marco Passage Retrieval dataset (Bajaj et al., 2016).
paper.But the dataset obtainable from code to be released will include the pre-processing premises as well for future study.Finally, we copy and prepend at the beginning of the long premises the minimum number of paragraphs from the end that contain 512 tokens, to limit models from relying on cues for the NLI label near the start.

A.2 Implementation Details
External code is from GitHub repositories, with repository forking permitted under contemporaneous GitHub's Terms of Service.External models are from HuggingFace Transformers (Wolf et al., 2020;contemporaneously  In particular, if necessary to ensure this compliance, we will share code for constructing our datasets rather than the datasets themselves.NVIDIA 12GB TITAN Xp, 11GB GeForce GTX 1080 Ti, 11GB GeForce RTX 2080 Ti, 24GB TITAN RTX GPUs, and NVIDIA 48GB RTX A6000 GPUs were used for all fine-tuning.

A.2.1 Evaluation 1
For our intermediate fine-tuning, we adapt the code and largely follow the respective model hyperparameters and fine-tuning settings of the three existing NLI benchmarks.The settings that we modify rather than follow are: attention gradient checkpointing, GPU setup while not changing accumulated batch size, and maximum sequence length (given our long sequence lengths, we also train for 3 epochs instead of 5 on DocNLI (Yin et al., 2021)).Maximum sequence lengths for intermediate fine-tuning are the lesser of the model maximum and 2048 (except for a maximum sequence length of 156 for pretrained short-sequence models fine-tuned on ANLI, consistent with Nie et al. (2020) 24 ).
After intermediate fine-tuning, the longsequence models' maximum sequence lengths are increased to 4096 for further fine-tuning on LawngNLI.We adapt the code from Xiong et al. (2021). 25We adapted this code in order to allow compatibility with their suite of efficient Transformers, but ultimately we did not pretrain them and did not further explore including them after several (initialized with copied RoBERTa-base (Liu et al., 2019) embeddings) did not rise far above random accuracy for LawngNLI fine-tuning under some initial hyperparameters explored.This reflects little on these models since we did not pretrain them.
For fine-tuning on LawngNLI in our NLI experiments, we use a batch size of 32, learning rate of 1e-5, 4 epochs (2 epochs for DocNLI; see next paragraph), learning rate schedule adapted from Xiong et al. (2021) (Adam optimizer with β1 = 0.9, β2 = 0.999, L2 weight decay of 0.01, warm-up over first 10,000 steps, and linear decay), and half precision.We explored hyperparameters among those explored by RoBERTa (Liu et al., 2019) for GLUE (Wang et al., 2018), along with batch size 128 so that all of our models in Appendix Section A.4 would start to converge during fine-tuning starting from their initial losses and accuracies.
To transfer learning from two-label DocNLI, the models intermediate-fine-tuned on DocNLI are further fine-tuned and evaluated on a two-label version of LawngNLI (where the Entail examples are duplicated and then (Entail, Neutral and Contradict) labels are mapped to (Entail, Not Entail)).This construction balances the two-label version between (Entail, Not Entail).For further fine-tuning these models on LawngNLI, the number of epochs is then halved.This is equivalent to splitting the Neutral and Contradict examples (now labeled Not Entail) in the original three-label dataset in half across pairs of consecutive original epochs (1 and 2, 3 and 4, and so on) so that the fine-tuning example count is 2/3 of the original dataset times the original number of epochs, except that example shuffling also pools examples between these consecutive original epochs.

A.3 Procedure for Human Assessment
Human assessment was limited to Amazon Mechanical Turk Master Workers based in the U.S.
Assessed accuracy of examples with long premises is lower than for with short premises, even though the former arguably should have a higher accuracy against the ground truth: they are a superset of the information in the short premise, thereby providing additional context while being written to be internally consistent.It may be then that the human-assessed error rates for the automatic labels are themselves imperfect against the ground truth, especially for more difficult examples.
Human assessment proceeded as follows: • Examples were each reviewed by two workers in batches of 28 examples, which were drawn from a first and then second set of 504 examples with sequence length at most 4096.Each set consists of a stratified random sample of test examples.The stratification is as follows: First, balance over the Cartesian product of the automatic label and pivotal negation versus not.Then half using the short premise and half using the long premise.
• Workers provided NLI labels for batches effectively without a time limit (batches due 1 week after assignment).Batches were issued until there were 300 non-screening examples with their two worker labels in agreement.The accuracy of these examples' automatic labels was then evaluated against those agreed labels (as gold).
• Workers were advised that they were providing NLI labels to be used in an academic analysis evaluating a new dataset.
• A co-author provided NLI labels for a predetermined random sample as "screening" examples, which did not enter the dataset with human-assessed gold labels or the error rate calculations and so do not directly impact them.They were scattered throughout and not separately identified to workers.Performance formulae using the screening examples only were used to calculate worker bonuses and to exclude (ultimately two) workers who appeared to be guessing frequently.
• Workers were paid above the U.S. federal minimum wage on "reasonable" (as opposed to actual) time spent: 2 hours per batch, but workers may have spent more or less time on any batch up to 1 week.In addition, a performance bonus was provided for each label deemed correct on a screening example.
A.5 Appendix Tables Table 7: Interface for the main NLI task.
Table 8: Interface for the pre-screen task.An included illustrative example is also omitted here.Note that some earlier workers saw earlier versions.

Sample twin Entail/Contradict examples with same premise from LawngNLI
Twin hypotheses with same premise, from "analysis" subset • Contradict: city acted affirmatively to create or increase risk of harm on city street by ignoring residents' requests to reduce speed limit or by taking down residents' signs indicating drivers should adhere to a lower speed limit • Entail: city did not act affirmatively to create or increase risk of harm on city street by ignoring residents' requests to reduce speed limit or by taking down residents' signs indicating drivers should adhere to a lower speed limit Some additional hypotheses with same premise • Entail: failing to enforce or lower the speed limit on a residential street "did not create a 'special danger' to a discrete class of individuals..[ed.: excerpted]..as opposed to a general traffic risk to pedestrians and other automobiles" • Contradict: traffic laws and enforcement practices did not pose "a general traffic risk to pedestrians and other automobiles" Relevant excerpts of shared premise • [ed.: Plaintiffs] ...submit that the City of Fort Thomas..violated their son's substantive due process rights by failing to act upon their request (and the requests of others) to lower the speed limit on the street..The police also removed signs posted by residents indicating that drivers should adhere to a 15 mile-per-hour speed limit.. • [ed.: Plaintiffs] ...alleged that the City's failure to maintain safe conditions on Garrison Avenue violated their son's substantive due process rights..established a "state-created danger" under DeShaney.. • ...DeShaney's holding..precludes [ed.: Plaintiffs'] argument that the Due Process Clause constitutionalizes a locality's choices about what speed limit to adopt for a given street or how to enforce that speed limit.. • There are two exceptions to the DeSha-ney rule..Under the second exception..a plaintiff may bring a substantive due process claim by establishing (1) an affirmative act by the State that either created or increased the risk that the plaintiff would be exposed to private acts of violence.. • [ed.: Plaintiffs] fail to satisfy any of the three requirements for establishing our circuit's "state-created danger" exception to DeShaney.First, the creation of a street and the management of traffic conditions on that street are too attenuated and indirect to count as an "affirmative act".. Distractor excerpts of same premise • ...After all, the City was told about the risks of not lowering the speed limit to 15 miles per hour (more accidents); it intentionally chose not to heed this warning (taking on the risk of more accidents); and the alleged risk came to pass when..was killed (an accident).. • ...For in one sense, it could be said that all governing bodies act with deliberate indifference when they consider and reject a traffic-safety proposal of this sort that comes with known risks.. Table 9: Sample twin Entail/Contradict examples with same premise from LawngNLI, also in the "analysis" subset analyzed in our experiments (Section 4): sequence length of long premise at most 4096.Each hypothesis pairs with the excerpted premise in a separate example.For those specific "additional hypotheses" above, the examples containing them are in unfiltered-LawngNLI2 (see GitHub link in first footnote) but not LawngNLI, the core dataset studied in this paper.See also Table 1.

Sample twin Neutral examples with same premise from LawngNLI
Twin hypotheses with same premise, from "analysis" subset • Neutral: a parade permit requirement did not violate the First Amendment • Neutral: a parade permit requirement violated the First Amendment Distractor excerpts of same premise • ...Section 13k prohibits two distinct activities: it is unlawful either "to parade, stand, or move in processions or assemblages in the Supreme Court Building or grounds,".. • ...we shall address only whether the proscriptions of 13k are constitutional as applied to the public sidewalks..
• [ed.: Plaintiffs]...alleged that the City's failure to maintain safe conditions on Garrison Avenue violated their son's substantive due process rights..established a "state-created danger" under DeShaney.. • ...DeShaney's holding..precludes [ed.: Plaintiffs'] argument that the Due Process Clause constitutionalizes a locality's choices about what speed limit to adopt for a given street or how to enforce that speed limit.. • There are two exceptions to the DeSha-ney rule..Under the second exception..a plaintiff may bring a substantive due process claim by establishing (1) an affirmative act by the State that either created or increased the risk that the plaintiff would be exposed to private acts of violence..

Table 1 :
Sample twin Entail/Contradict examples from LawngNLI.See Appendix Tables

Table 3 :
Major steps in LawngNLI dataset construction process.Steps are detailed in Appendix Section A.1.

Table 10 :
Sample twin Neutral examples from LawngNLI, but not in the "analysis" subset analyzed in our experiments (Section 4): sequence length of long premise at most 4096.Each hypothesis pairs with the excerpted premise in a separate example.See also Table1.