Factoring Statutory Reasoning as Language Understanding Challenges

Statutory reasoning is the task of determining whether a legal statute, stated in natural language, applies to the text description of a case. Prior work introduced a resource that approached statutory reasoning as a monolithic textual entailment problem, with neural baselines performing nearly at-chance. To address this challenge, we decompose statutory reasoning into four types of language-understanding challenge problems, through the introduction of concepts and structure found in Prolog programs. Augmenting an existing benchmark, we provide annotations for the four tasks, and baselines for three of them. Models for statutory reasoning are shown to benefit from the additional structure, improving on prior baselines. Further, the decomposition into subtasks facilitates finer-grained model diagnostics and clearer incremental progress.


Introduction
As more data becomes available, Natural Language Processing (NLP) techniques are increasingly being applied to the legal domain, including for the prediction of case outcomes (Xiao et al., 2018;Vacek et al., 2019;Chalkidis et al., 2019a). In the US, cases are decided based on previous case outcomes, but also on the legal statutes compiled in the US code. For our purposes, a case is a set of facts described in natural language, as in Figure 1, in blue. The US code is a set of documents called statutes, themselves decomposed into subsections. Taken together, subsections can be viewed as a body of interdependent rules specified in natural language, prescribing how case outcomes are to be determined. Statutory reasoning is the task of determining whether a given subsection of a statute applies to a given case, where both are expressed in natural language. Subsections are implicitly framed as predicates, which may be true or false of a given case. Holzenberger et al. (2020) introduced SARA, a benchmark for the task of statutory reasoning, as well as two different approaches to solving this problem. First, a manually-crafted symbolic reasoner based on Prolog is shown to perfectly solve the task, at the expense of experts writing the Prolog code and translating the natural language case descriptions into Prolog-understandable facts. The second approach is based on statistical machine learning models. While these models can be induced computationally, they perform poorly because the complexity of the task far surpasses the amount of training data available.
We posit that statutory reasoning as presented to statistical models is underspecified, in that it was cast as Recognizing Textual Entailment (Dagan et al., 2005) and linear regression. Taking inspiration from the structure of Prolog programs, we re-frame statutory reasoning as a sequence of four tasks, prompting us to introduce a novel extension of the SARA dataset (Section 2), referred to as SARA v2. Beyond improving the model's performance, as shown in Section 3, the additional structure makes it more interpretable, and so more suitable for practical applications. We put our results in perspective in Section 4 and review related work in Section 5.

SARA v2
The symbolic solver requires experts translating the statutes and each new case's description into Prolog. In contrast, a machine learning-based model has the potential to generalize to unseen cases and to changing legislation, a significant advantage for a practical application. In the following, we argue that legal statutes share features with the symbolic solver's first-order logic. We formalize this connection in a series of four challenge tasks, described in this section, and depicted in Figure 1. We hope they provide structure to the problem, and a more efficient inductive bias for machine learning algorithms. The annotations mentioned throughout the remainder of this section were developed by the authors, entirely by hand, with regular guidance from a legal scholar 1 . Examples for each task are given in Appendix A. Statistics are shown in Figure 2 and further detailed in Appendix B.
Argument identification This first task, in conjunction with the second, aims to identify the arguments of the predicate that a given subsection represents. Some terms in a subsection refer to something concrete, such as "the United States" or "April 24th, 2017". Other terms can take a range of values depending on the case at hand, and act as placeholders. For example, in the top left box of Figure 1, the terms "a taxpayer" and "the taxable year" can take different values based on the context, while the terms "section 152" and "this paragraph" have concrete, immutable values. Formally, given a sequence of tokens t 1 , ..., t n , the task is to return a set of start and end indices (s, e) ∈ {1, 2, ..., n} 2 where each pair represents a span. We borrow from the terminology of predicate argument alignment (Roth and Frank, 2012;Wolfe et al., 2013) and call these placeholders arguments. The first task, which we call argument identification, is tagging which parts of a subsection denote such placeholders. We provide annotations for argument identification as character-level spans representing arguments. Since each span is a pointer to the corresponding argument, we made each span the shortest meaningful phrase. Figure 2 (b) shows corpus statistics about placeholders.
Argument coreference Some arguments detected in the previous task may appear multiple times within the same subsection. For instance, in the top left of Figure 1, the variable representing the taxpayer in §2(a)(1)(B) is referred to twice. We refer to the task of resolving this coreference problem at the level of the subsection as argument coreference. While this coreference can span across subsections, as is the case in Figure 1, we intentionally leave it to the next task. Keeping the notation of the above paragraph, given a set of spans {(s i , e i )} S i=1 , the task is to return a matrix C ∈ {0, 1} S×S where C i,j = 1 if spans (s i , e i ) and (s j , e j ) denote the same variable, 0 otherwise. Corpus statistics about argument coreference can be found in Figure 2(a). After these first two tasks, we can extract a set of arguments for every subsection. In Figure 1, for §2(a)(1)(A), that would be {Taxp, Taxy, Spouse, Years}, as shown in the bottom left of Figure 1.
Structure extraction A prominent feature of legal statutes is the presence of references, implicit and explicit, to other parts of the statutes. Resolving references and their logical connections, and passing arguments appropriately from one subsection to the other, are major steps in statutory reasoning. We refer to this as structure extraction. This mapping can be trivial, with the taxpayer and taxable year generally staying the same across subsections. Some mappings are more involved, such as the taxpayer from §152(b)(1) becoming the dependent in §152(a). Providing annotations for this task in general requires expert knowledge, as many references are implicit, and some must be resolved using guidance from Treasury Regulations. Our approach contrasts with recent efforts in breaking down complex questions into atomic questions, with the possibility of referring to previous answers (Wolfson et al., 2020). Statutes contain their own breakdown into atomic questions. In addition, our structure is interpretable by a Prolog engine.
We provide structure extraction annotations for SARA in the style of Horn clauses (Horn, 1951), using common logical operators, as shown in the bottom left of Figure 1. We also provide character offsets for the start and end of each subsection. Argument identification and coreference, and structure extraction can be done with the statutes only. They correspond to extracting a shallow version of the symbolic solver of Holzenberger et al. (2020).

Argument instantiation
We frame legal statutes as a set of predicates specified in natural language. Each subsection has a number of arguments, provided by the preceding tasks. Given the description of a case, each argument may or may not be associated with a value. Each subsection has an @truth argument, with possible values True or False, reflecting whether the subsection applies or not. Concretely, the input is (1) the string representation of the subsection, (2) the annotations from the first three tasks, and (3) values for some or all of its arguments. Arguments and values are represented as an array of key-value pairs, where the names of arguments specified in the structure an- §2(a)(1) (Taxp 1 , Taxy 2 , Spouse 3 , Years 4 , Household 5 , Dependent 6 , Deduction 7 , Cost 8 ) : §2(a)(1)(A) (Taxp 1 , Taxy 2 , Spouse 3 , Years 4 ) AND §2(a)(1)(B) (Taxp 1 , Taxy 2 , Household 5 , Dependent 6 , Deduction 7 ) §2(a)(1)(B) (Taxp 1 , Taxy 2 , Household 5 , Dependent 6 , Deduction 7 ) : §151(c) (Taxp 1 , Taxy 2 , S24=Dependent 6 ) §2. Definitions and special rules (a) Definition of surviving spouse (1) In general For purposes of section 1, the term "surviving spouse" means a taxpayer 1 -(A) whose 1 spouse 3 died during either of the two years 4 immediately preceding the taxable year 2 , and (B) who maintains as his home a household 5 which constitutes for the taxable year 2 the principal place of abode (as a member of such household 5 ) of a dependent 6 (i) who (within the meaning of section 152) is a son, stepson, daughter, or stepdaughter of the taxpayer 1 , and (ii) with respect to whom the taxpayer 1 is entitled to a deduction 7 for the taxable year 2 under section 151.
For purposes of this paragraph, an individual 1 shall be considered as maintaining a household 5 only if over half of the cost 8 of maintaining the household 5 during the taxable year 2 is furnished by such individual 1 .   notations are used as keys. In Figure 1, compare the names of arguments in the green box with the key names in the blue boxes. The output is values for its arguments, in particular for the @truth argument. In the example of the top right in Figure 1, the input values are taxpayer = Alice and taxable year = 2017, and one expected output is @truth = True. We refer to this task as argument instantiation. Values for arguments can be found as spans in the case description, or must be predicted based on the case description. The latter happens often for dollar amounts, where incomes must be added, or tax must be computed. Figure 1 shows two examples of this task, in blue.
Before determining whether a subsection applies, it may be necessary to infer the values of unspecified arguments. For example, in the top of Figure 1, it is necessary to determine who Alice's deceased spouse and who the dependent mentioned in §2(a)(1)(B) are. If applicable, we provide values for these arguments, not as inputs, but as additional supervision for the model. We provide manual annotations for all (subsection, case) pairs in SARA. In addition, we run the Prolog solver of Holzenberger et al. (2020) to generate annotations for all possible (subsection, case) pairs, to be used as a silver standard, in contrast to the gold manual annotations. We exclude from the silver data any (subsection, case) pair where the case is part of the test set. This increases the amount of available training data by a factor of 210.

Baseline models
We provide baselines for three tasks, omitting structure extraction because it is the one task with the highest return on human annotation effort 2 . In other words, if humans could annotate for any of these four tasks, structure extraction is where we posit their involvement would be the most worthwhile. Further, Pertierra et al. (2017) have shown that the related task of semantic parsing of legal statutes is a difficult task, calling for a complex model.

Argument identification
We run the Stanford parser (Socher et al., 2013) on the statutes, and extract all noun phrases as spansspecifically, all NNP, NNPS, PRP$, NP and NML constituents. While de-formatting legal text can boost parser performance (Morgenstern, 2014), we found it made little difference in our case.
As an orthogonal approach, we train a BERTbased CRF model for the task of BIO tagging. With the 9 sections in the SARA v2 statutes, we create 7 equally-sized splits by grouping §68, 3301 and 7703 into a single split. We run a 7-fold crossvalidation, using 1 split as a dev set, 1 split as a test set, and the remaining as training data. We embed each paragraph using BERT, classify each contextual subword embedding into a 3-dimensional logit with a linear layer, and run a CRF (Lafferty et al., 2001). The model is trained with gradient descent to maximize the log-likelihood of the sequence of gold tags. We experiment with using Legal BERT (Holzenberger et al., 2020) and BERT-base-cased (Devlin et al., 2019) as our BERT model. We freeze its parameters and optionally unfreeze the last layer. We use a batch size of 32 paragraphs, a learning rate of 10 −3 and the Adam optimizer (Kingma and Ba, 2015). Based on F1 score measured on the dev set, the best model uses Legal BERT and unfreezes its last layer. Test results are shown in Table 1.

Argument coreference
Argument coreference differs from the usual coreference task (Pradhan et al., 2014), even though we are using similar terminology, and frame it in a similar way. In argument coreference, it is equally 2 Code for the experiments can be found under https: //github.com/SgfdDttt/sara_v2 as important to link two coreferent argument mentions as it is not to link two different arguments. In contrast, regular coreference emphasizes the prediction of links between mentions. We thus report a different metric in Tables 2 and 4, exact match coreference, which gives credit for returning a cluster of mentions that corresponds exactly to an argument. In Figure 1, a system would be rewarded for linking together both mentions of the taxpayer in §2(a)(1)(B), but not if any of the two mentions were linked to any other mention within §2(a)(1)(B). This custom metric gives as much credit for correctly linking a single-mention argument (no links), as for a 5-mention argument (10 links).
Single mention baseline Here, we predict no coreference links. Under usual coreference metrics, this system can have low performance.
String matching baseline This baseline predicts a coreference link if the placeholder strings of two arguments are identical, up to the presence of the words such, a, an, the, any, his and every.  We also provide usual coreference metrics in Table 3, using the code associated with Pradhan et al. (2014). This baseline perfectly resolves coreference for 80.8% of subsections, versus 68.9% for the single mention baseline.  In addition, we provide a cascade of the best methods for argument identification and coreference, and report results in

Argument instantiation
Argument instantiation takes into account the information provided by previous tasks. We start by instantiating the arguments of a single subsection, without regard to the structure of the statutes. We then describe how the structure information is incorporated into the model.

Algorithm 1 Argument instantiation for a single subsection
Require: argument spans with coreference information A, input argument-value pairs D, subsection text s, case description c Ensure: output argument-value pairs P 1: function ARGINSTANTIATION(A, D, s, c) 2: end for 10: r ← INSERTVALUES(s, A, D, P ) 11: y ← BERT CLS(c, r) 12: t ← TRUTHPREDICTOR(y) 13: P ← P ∪ (@truth, t) 14: return P 15: end function Single subsection We follow the paradigm of Chen et al. (2020), where we iteratively modify the text of the subsection by inserting argument values, and predict values for uninstantiated arguments. Throughout the following, we refer to Algorithm 1 and to its notation.
For each argument whose value is provided, we replace the argument's placeholders in subsection s by the argument's value, using INSERTVALUES (line 4). This yields mostly grammatical sentences, with occasional hiccups. With §2(a)(1)(A) and the top right case from Figure 1, we obtain "(A) Alice spouse died during either of the two years immediately preceding 2017".
We concatenate the text of the case c with the modified text of the subsection r, and embed it using BERT (line 5), yielding a sequence of contextual subword embeddings y = {y i ∈ R 768 | i = 1...n}. Keeping with the notation of Chen et al. (2020), assume that the embedded case is represented by the sequence of vectors t 1 , ..., t m and the embedded subsection by s 1 , ..., s n . For a given argument a, compute its attentive representations 1 , ...,s m and its augmented feature vectors x 1 , ..., x m . This operation, described by Chen et al. (2020), is performed by COMPUTEATTENTIVEREPS (line 6). The augmented feature vectors x 1 , ..., x m represent the argument's placeholder, conditioned on the text of the statute and case.
Based on the name of the argument span, we predict its value v either as an integer or a span from the case description, using PREDICTVALUE (line 7). For integers, as part of the model training, we run k-means clustering on the set of all integer values in the training set, with enough centroids such that returning the closest centroid instead of the true value yields a numerical accuracy of 1 (see below). For any argument requiring an integer (e.g. tax), the model returns a weighted average of the centroids. The weights are predicted by a linear layer followed by a softmax, taking as input an average-pooling and a maxpooling of x 1 , ..., x m . For a span from the case description, we follow the standard procedure for fine-tuning BERT on SQuAD (Devlin et al., 2019). The unnormalized probability of the span from tokens i to j is given by e l·x i +r·x j where l, r are learnable parameters.
The predicted value v is added to the set of predictions P (line 8), and will be used in subsequent iterations to replace the argument's placeholder in the subsection. We repeat this process until a value has been predicted for every argument, except @truth (lines 3-9). Arguments are processed in order of appearance in the subsection. Finally, we concatenate the case and fully grounded subsection and embed them with BERT (lines 10-11), then use a linear predictor on top of the representation for the [CLS] token to predict the value for the @truth argument (line 12).

Algorithm 2 Argument instantiation with dependencies
Require: argument spans with coreference information A, structure information T , input argument-value pairs D, subsection s, case description c Ensure: output argument-value pairs P 1: function ARGINSTANTIATIONFULL(A, T, D, s, c) 2: t ← BUILDDEPENDENCYTREE(s, T ) 3: t ← POPULATEARGVALUES(t, D) 4: Q ← depth-first traversal of t 5: for q in Q do 6: if q is a subsection and a leaf node then 7: Dq ← GETARGVALUEPAIRS(q) 8:s ← GETSUBSECTIONTEXT(q) 9: q ← ARGINSTANTIATION(A, Dq,s, c) 10: else if q is a subsection and not a leaf node then 11: Dq ← GETARGVALUEPAIRS(q) 12: x ← GETCHILD(q) 13: Dx ← GETARGVALUEPAIRS(x) 14: Dq ← Dq ∪ Dx 15:s ← GETSUBSECTIONTEXT(q) 16: q ← ARGINSTANTIATION(A, Dq,s, c) 17: else if q ∈ {AND, OR, NOT} then 18: C ← GETCHILDREN(q) 19: q ← DOOPERATION(C, q) 20: end if 21: end for 22: x ← ROOT(t) 23: P ← GETARGVALUEPAIRS(x) 24: return P 25: end function Subsection with dependencies To describe our procedure at a high-level, we use the structure of the statutes to build out a computational graph, where nodes are either subsections with argumentvalue pairs, or logical operations. We resolve nodes one by one, depth first. We treat the singlesubsection model described above as a function, taking as input a set of argument-value pairs, a string representation of a subsection, and a string representation of a case, and returning a set of argument-value pairs. Algorithm 2 and Figure 3 summarize the following.
We start by building out the subsection's dependency tree, as specified by the structure annotations (lines 2-4). First, we build the tree structure using BUILDDEPENDENCYTREE. Then, values for arguments are propagated from parent to child, from the root down, with POPULATEARGVALUES. The tree is optionally capped to a predefined depth. Each node is either an input for the single-subsection function or its output, or a logical operation. We then traverse the tree depth first, performing the following operations, and replacing the node with the result of the operation: • If the node q is a leaf, resolve it using the single-subsection function ARGINSTANTIATION (lines 6-9 in Algorithm 2; step 1 in Figure 3).
• If the node q is a subsection that is not a leaf, find its child node x (GETCHILD, line 12), and corresponding argument-value pairs other than @truth, D x (GETARGVALUEPAIRS, line 13).
Merge D x with D q , the argument-value pairs of the main node q (line 14). Finally, resolve the parent node q using the single-subsection function (lines 15-16; step 3 in Figure 3. • If node q is a logical operation (line 17), get its children C (GETCHILDREN, line 18), to which the operation will be applied with DOOPERA-TION (line 19) as follows: -If q == NOT, assign the negation of the child's @truth value to q.
-If q == OR, pick its child with the highest @truth value, and assign its arguments' values to q.
-If q == AND, transfer the argument-value pairs from all its children to q. In case of conflicting values, use the value associated with the lower @truth value. This operation can be seen in step 4 of Figure 3.
This procedure follows the formalism of neural module networks (Andreas et al., 2016) and is illustrated in Figure 3. Reentrancy into the dependency tree is not possible, so that a decision made earlier cannot be backtracked on at a later stage. One could imagine doing joint inference, or using heuristics for revisiting decisions, for example with a limited number of reentrancies. Humans are generally able to resolve this task in the order of the text, and we assume it should be possible for a computational model too. Our solution is meant to be computationally efficient, with the hope of not sacrificing too much performance. Revisiting this assumption is left for future work.
Metrics and evaluation Arguments whose value needs to be predicted fall into three categories. The @truth argument calls for a binary truth value, and we score a model's output using binary accuracy. The values of some arguments, such as gross income, are dollar amounts. We score such values using numerical accuracy, as 1 if ∆(y,ŷ) = |y−ŷ| max(0.1 * y,5000) < 1 else 0, wherê y is the prediction and y the target. All other argument values are treated as strings. In those cases, we compute accuracy as exact match between predicted and gold value. Each of these three metrics defines a form of accuracy. We average the three metrics, weighted by the number of samples, to obtain a unified accuracy metric, used to compare the performance of models.
Training Based on the type of value expected, we use different loss functions. For @truth, we use binary cross-entropy. For numerical values, we use the hinge loss max(∆(y,ŷ) − 1, 0). For strings, let S be all the spans in the case description equal to the expected value. The loss function is log( i≤j e l·x i +r·x j ) − log( i,j∈S e l·x i +r·x j ) (Clark and Gardner, 2018). The model is trained end-to-end with gradient descent.
We start by training models on the silver data, as a pre-training step. We sweep the values of the learning rate in {10 −2 , 10 −3 , 10 −4 , 10 −5 } and the batch size in {64, 128, 256}. We try both BERTbase-cased and Legal BERT, allowing updates to the parameters of its top layer. We set aside 10% of the silver data as a dev set, and select the best model based on the unified accuracy on the dev set. Training is split up into three stages. The single-subsection model iteratively inserts values for arguments into the text of the subsection. In the first stage, regardless of the predicted value, we insert the gold value for the argument, as in teacher forcing (Kolen and Kremer, 2001). In the second and third stages, we insert the value predicted by the model. When initializing the model from one stage to the next, we pick the model with the highest unified accuracy on the dev set. In the first two stages, we ignore the structure of the statutes, which effectively caps the depth of each dependency tree at 1. Picking the best model from this pre-training step, we perform fine-tuning on the gold data. We take a k-fold cross-validation approach (Stone, 1974). We randomly split the SARA v2 training set into 10 splits, taking care to put pairs of cases testing the same subsection into the same split. Each split contains nearly exactly the same proportion of binary and numerical cases. We sweep the values of the learning rate and batch size in the same ranges as above, and optionally allow updates to the parameters of BERT's top layer. For a given set of hyperparameters, we run training on each split, using the dev set and the unified metric for early stopping. We use the performance on the dev set averaged across the 10 splits to evaluate the performance of a given set of hyperparameters. Using that criterion, we pick the best set of hyperparameters. We then pick the final model as that which achieves median performance on the dev set, across the 10 splits. We report the performance of that model on the test set.
In Table 5, we report the relevant argument instantiation metrics, under @truth, dollar amount and string. For comparison, we also report binary and numerical accuracy metrics defined in Holzenberger et al. (2020) Table 5: Argument instantiation. We report accuracies, in %, and the 90% confidence interval. Right of the bar are accuracy metrics proposed with the initial release of the dataset. Blue cells use the silver data, brown cells do not. "BERT" is the model described in Section 3.3. Ablations to it are marked with a "-" sign.
baseline has three parameters. For @truth, it returns the most common value for that argument on the train set. For arguments that call for a dollar amount, it returns the one number that minimizes the dollar amount hinge loss on the training set. For all other arguments, it returns the most common string answer in the training set. Those parameters vary depending on whether the training set is augmented with the silver data.

Discussion
Our goal in providing the baselines of Section 3 is to identify performance bottlenecks in the proposed sequence of tasks. Argument identification poses a moderate challenge, with a language model-based approach achieving non-trivial F1 score. The simple parser-based method is not a sufficient solution, but with its high recall could serve as the backbone to a statistical method. Argument coreference is a simpler task, with string matching perfectly resolving nearly 80% of the subsections. This is in line with the intuition that legal language is very explicit about disambiguating coreference. As reported in Table 3, usual coreference metrics seem lower, but only reflect a subset of the full task: coreference metrics are only concerned with links, so that arguments appearing exactly once bear no weight under that metric, unless they are wrongly linked to another argument. Argument instantiation is by far the most challenging task, as the model needs strong natural language understanding capabilities. Simple baselines can achieve accuracies above 50% for @truth, since for all numerical cases, @truth = True. We receive a slight boost in binary accuracy from using the proposed paradigm, departing from previous results on this benchmark. As compared to the baseline, the models mostly lag behind for the dollar amount and numerical accuracies, which can be explained by the lack of a dedicated numerical solver, and sparse data. Further, we have made a number of simplifying assumptions, which may be keeping the model from taking advantage of the structure information: arguments are instantiated in order of appearance, forbidding joint prediction; revisiting past predictions is disallowed, forcing the model to commit to wrong decisions made earlier; the depth of the dependency tree is capped at 3; and finally, information is being passed along the dependency tree in the form of argument values, as opposed to dense, high-dimensional vector representations. The latter limits both the flow of information and the learning signal. This could also explain why the use of dependencies is detrimental in some cases. Future work would involve joint prediction (Chan et al., 2019), and more careful use of structure information.
Looking at the errors made by the best model in Table 5 for binary accuracy, we note that for 39 positive and negative case pairs, it answers each pair identically, thus yielding 39 correct answers. In the remaining 11 pairs, there are 10 pairs where it gets both cases right. This suggests it may be guessing randomly on 39 pairs, and understanding 10. The best BERT-based model for dollar amounts predicts the same number for each case, as does the baseline. The best models for string arguments generally make predictions that match the category of the expected answer (date, person, etc) while failing to predict the correct string.
Performance gains from silver data are noticeable and generally consistent, as can be seen by comparing brown and blue cells in Table 5. The silver data came from running a human-written Prolog program, which is costly to produce. A possible substitute is to find mentions of applicable statutes in large corpora of legal cases (Caselaw, 2019), for example using high-precision rules (Ratner et al., 2017), which has been successful for extracting information from cases (Boniol et al., 2020).
In this work, each task uses the gold annotations from upstream tasks. Ultimately, the goal is to pass the outputs of models from one task to the next.
Clark et al. (2019) as well as preceding work (Friedland et al., 2004;Gunning et al., 2010) tackle a similar problem in the science domain, with the goal of using the prescriptive knowledge from science textbooks to answer exam questions. The core of their model relies on several NLP and specialized reasoning techniques, with contextualized language models playing a major role. Clark et al. (2019) take the route of sorting questions into different types, and working on specialized solvers. In contrast, our approach is to treat each question identically, but to decompose the process of answering into a sequence of subtasks.
The language of statutes is related to procedural language, which describes steps in a process.  Weller et al. (2020) propose models that generalize to new task descriptions.
Argument instantiation is closest to the task of aligning predicate argument structures (Roth and Frank, 2012;Wolfe et al., 2013). We frame argument instantiation as iteratively completing a statement in natural language. Chen et al. (2020) refine generic statements by copying strings from input text, with the goal of detecting events. Chan et al.
(2019) extend transformer-based language models to permit inserting tokens anywhere in a sequence, thus allowing to modify an existing sequence. For argument instantiation, we make use of neural module networks (Andreas et al., 2016), which are used in the visual (Yi et al., 2018) and textual domains . In that context, arguments and their values can be thought of as the hints from Khot et al. (2020). The Prolog-based data augmentation is related to data augmentation for semantic parsing (Campagna et al., 2019;Weir et al., 2019).

Conclusion
Solutions to tackle statutory reasoning may range from high-structure, high-human involvement expert systems, to less structured, largely selfsupervised language models. Here, taking inspiration from Prolog programs, we introduce a novel paradigm, by breaking statutory reasoning down into a sequence of tasks. Each task can be annotated for with far less expertise than would be required to translate legal language into code, and comes with its own performance metrics. Our contribution enables finer-grained scoring and debugging of models for statutory reasoning, which facilitates incremental progress and identification of performance bottlenecks. In addition, argument instantiation and explicit resolution of dependencies introduce further interpretability. This novel approach could possibly inform the design of models that reason with rules specified in natural language, for the domain of legal NLP and beyond.

A Task examples
In the following, we provide several examples for each of the tasks defined in Section 2.

A.1 Argument identification
For ease of reading, the spans mentioned in the output are underlined in the input.  (5)) In the case of an individual with respect to whom a deduction under section 151 is allowable to another taxpayer for a taxable year beginning in the calendar year in which the individual's taxable year begins, the basic standard deduction applicable to such individual for such individual's taxable year shall not exceed the greater of- (iv) $31,172, plus 36% of the excess over $115,000 if the taxable income is over $115,000 but not over $250,000; Output 3 {(5, 45), (50, 67)}

A.2 Argument coreference
We report the full matrix C. In addition, for ease of reading, coreference clusters are marked with superscripts in the input.
In the case of an individual 1 with respect to whom a deduction 2 under section 151 is allowable to another taxpayer 3 for a taxable year 4 beginning in the calendar year 5 in which the individual 1 's taxable year 6 begins, the basic standard deduction 7 applicable to such individual 1 for such individual 1 's taxable year 6 shall not exceed the greater Input 3 ( §1(d)(iv)) (iv) $31,172, plus 36% of the excess over $115,000 1 if the taxable income 2 is over $115,000 but not over $250,000; {(5, 45), (50, 67)} Output 3 1 0 0 1

A.3 Structure extraction
To clarify the link between the input and the output, we are adding superscripts to argument names in the output. While the output is represented as plain text, a graph-based representation would likely be used in a practical system, to facilitate learning and inference. Arguments are keyword based. For example, in Output 2, the value of the Taxp argument of §63(c)(5) is passed to the Spouse argument of §151(b). If no equal sign is specified, it means the argument names match. For example, part of Output 2 could have been rewritten more explicitly as §151(b)(Spouse=Taxp, Taxp=S45, Taxy=Taxy). Output 1 §3306(a)(1)(B)(Caly 2 , S16 7 , Workday 1 , Employment 6 , Preccaly 3 , Employee 5 , S13A 4 , Employer, Service) :- §3306(c)(Employee, Employer, Service).

A.4 Argument instantiation
The following are example cases. In addition to the case description, subsection to apply and input argument-value pairs, the agent has access to the output of Argument identification, Argument coreference and Structure extraction, for the entirety of the statutes.