Multi-Defendant Legal Judgment Prediction via Hierarchical Reasoning

Multiple defendants in a criminal fact description generally exhibit complex interactions, and cannot be well handled by existing Legal Judgment Prediction (LJP) methods which focus on predicting judgment results (e.g., law articles, charges, and terms of penalty) for single-defendant cases. To address this problem, we propose the task of multi-defendant LJP, which aims to automatically predict the judgment results for each defendant of multi-defendant cases. Two challenges arise with the task of multi-defendant LJP: (1) indistinguishable judgment results among various defendants; and (2) the lack of a real-world dataset for training and evaluation. To tackle the first challenge, we formalize the multi-defendant judgment process as hierarchical reasoning chains and introduce a multi-defendant LJP method, named Hierarchical Reasoning Network (HRN), which follows the hierarchical reasoning chains to determine criminal relationships, sentencing circumstances, law articles, charges, and terms of penalty for each defendant. To tackle the second challenge, we collect a real-world multi-defendant LJP dataset, namely MultiLJP, to accelerate the relevant research in the future. Extensive experiments on MultiLJP verify the effectiveness of our proposed HRN.

Figure 1: An illustration of multi-defendant LJP.Generally, a judge needs to reason on the fact description to clarify the complex interactions among different defendants and make accurate judgments for each defendant.
Despite these successful efforts, singledefendant LJP suffers from an inevitable restriction in practice: a large number of fact descriptions with multiple defendants.According to statistics derived from published legal documents sampled from legal case information disclosure, multidefendant cases constitute a minimum of 30% of all cases (Pan et al., 2019).As shown in Figure 1, the multi-defendant LJP task aims at predicting law articles, charges, and terms of penalty for each defendant in multi-defendant cases.Since multiple defendants are mentioned in the fact description and exhibit complex interactions, the multi-defendant LJP task requires clarifying these interactions and making accurate judgments for each defendant, which is intuitively beyond the reach of single-defendant LJP methods.Hence, there is a pressing need to extend LJP from single-defendant to multi-defendant scenarios.
However, two main challenges arise with the task of multi-defendant LJP: • Indistinguishable judgment results among various defendants.As complicated interactions exist among multiple defendants, fact descriptions of various defendants are usually mixed together.Thus it is difficult to distinguish different judgment results among various defendants so as to make accurate judgments for each defendant.
As shown in Figure 1, in order to distinguish different judgment results among various defendants, a judge has to clarify criminal relationships among defendants to determine whether defendants share same law articles and charges, and sentencing circumstances affecting terms of penalty for each defendant.Based on these intermediate reasoning results, the judge determines and verifies the judgment results (law articles, charges, and terms of penalty) for each defendant, following a forward and backward order.The motivation behind forward prediction and backward verification is rooted in the complex nature of legal reasoning, where evidence and conclusions can be interdependent (Zhong et al., 2018;Yang et al., 2019).Overall, the multi-defendant judgment process requires simulating the judicial logic of human judges and modeling complex reasoning chains.
• Lack of real-world multi-defendant LJP datasets.Existing datasets for LJP either only have single-defendant cases annotated for multiple LJP subtasks or multi-defendant cases for a single LJP subtask.Xiao et al. (2018) collect a real-world LJP dataset CAIL, which only retains single-defendant cases.Pan et al. (2019) only annotate multi-defendant cases with the charge prediction subtask and ignore information about criminal relationships and sentencing circumstances that can distinguish judgment results of multiple defendants in real scenarios.In order to accelerate the research on multi-defendant LJP, we urgently need a real-world multi-defendant LJP dataset.
To tackle the first challenge, we formalize the multi-defendant judgment process as hierarchical reasoning chains and propose a method for multidefendant LJP, named Hierarchical Reasoning Network (HRN), which follows the hierarchical reasoning chains to distinguish different judgment results of various defendants.Specifically, the hi-erarchical reasoning chains are divided into two levels.The first-level reasoning chain identifies the relationships between defendants and determines the sentencing circumstances for each defendant.The second-level reasoning chain predicts and verifies the law articles, charges, and terms of penalty for each defendant, using a forward prediction process and a backward verification process, respectively.Since generative language models have shown great ability to reason (Talmor et al., 2020;Yao et al., 2021;Hase and Bansal, 2021), we convert these reasoning chains into Sequenceto-Sequence (Seq2Seq) generation tasks and apply the mT5 (Xue et al., 2021) to model them.Furthermore, we adopt Fusion-in-Decoder (FID) (Izacard and Grave, 2021) to process multi-defendant fact descriptions with thousands of tokens efficiently.
To tackle the second challenge, we collect a realworld dataset, namely MultiLJP, with 23,717 realworld multi-defendant LJP cases.Eight professional annotators are involved in manually editing law articles, charges, terms of penalty, criminal relationships, and sentencing circumstances for each defendant.In 89.58 percent of these cases, the defendants have different judgment results for at least one of the subtasks of the multi-defendant LJP task.MultiLJP requires accurate distinction of the judgment results for each defendant.This makes MultiLJP different from the existing singledefendant LJP datasets.Our work provides the first benchmark for the multi-defendant LJP task.
Using MultiLJP, we evaluate the effectiveness of HRN for multi-defendant LJP on various subtasks.The results show that HRN can significantly outperform all the baselines.In summary, our main contributions are: • We focus on the multi-defendant LJP task and formalize the multi-defendant judgment process as hierarchical reasoning chains for the multidefendant LJP task.• We introduce HRN, a novel method that follows the hierarchical reasoning chains to distinguish the judgment results for each defendant in multidefendant LJP. 2 Related work

Multi-step reasoning with language models
Multi-step reasoning by training or fine-tuning language models to generate intermediate steps has been shown to improve performance (Zaidan et al., 2007;Talmor et al., 2020;Yao et al., 2021;Hase and Bansal, 2021;Zhang et al., 2023b;Gu et al., 2021).Ling et al. (2017)  However, these methods are not designed for more realistic legal reasoning applications (Huang and Chang, 2022).In this paper, we aim to use generative language models to capture hierarchical reasoning chains for the multi-defendant LJP task.

Dataset
In this section, we describe the construction process and analyze various aspects of MultiLJP to provide a deeper understanding of the dataset.

Dataset construction
To the best of our knowledge, existing LJP datasets only focus on single-defendant cases or charge prediction for multi-defendant cases.Thus, we construct a Multi-defendant Legal Judgment Prediction (MultiLJP) dataset from the published legal documents in China Judgements Online2 .Instead of extracting labels using regular expressions as in existing works (Xiao et al., 2018;Pan et al., 2019), we hire eight professional annotators to manually produce law articles, charges, terms of penalty, criminal relationships, and sentencing circumstances for each defendant in multi-defendant cases.The annotators are native Chinese speakers who have passed China's Unified Qualification Exam for Legal Professionals.All data is evaluated by two annotators repeatedly to eliminate bias.Since second-instance cases and retrial cases are too complicated, we only retain first-instance cases.Besides, we anonymize sensitive information (e.g., name, location, etc.) for multi-defendant cases to avoid potential risks of structured social biases (Pitoura et al., 2017) and protect personal privacy.After preprocessing and manual annotation, the MultiLJP consists of 23,717 multi-defendant cases.The statistics information of dataset MultiLJP can be found in Table 1.

Dataset analysis
Analysis of the number of defendants.MultiLJP only contains multi-defendant cases.The number of defendants per case is distributed as follows: 49.40 percent of cases have two defendants, 21.41 percent have three defendants, 11.22 percent have four defendants, and 17.97 percent have more than four defendants.The MultiLJP dataset has 80,477 defendants in total.On average, each multidefendant case has 3.39 defendants.
Analysis of multi-defendant judgment results.In 89.58 percent of these cases, the defendants have different judgment results for at least one of the subtasks of the multi-defendant LJP task.Specifically, 18.91 percent of cases apply different law articles to different defendants, 26.80 percent of cases impose different charges on different defendants, and 88.54 percent of cases assign different terms of penalty to different defendants.
Analysis of criminal relationships and sentencing circumstances.Based on the gold labels of criminal relationships and sentencing circumstances, ideally, a judge can distinguish between 69.73 percent of defendants with different judgment results (law articles, charges, and terms of penalty).Specifically, based on the criminal relationship, a judge can distinguish 70.28 percent of defendants with different law articles and 72.50 percent of defendants with different charges; based on the sentencing circumstances, a judge can distinguish 96.28 percent of defendants with different terms of penalty.

Method
In this section, we describe the HRN method.First, we formulate our research problem.Then, we introduce the Sequence-to-Sequence (Seq2Seq) gen-eration framework for hierarchical reasoning.Next, we introduce hierarchical reasoning chains of multidefendant in detail.Finally, a training process with Fusion-in-Decoder for HRN is explained.

Problem formulation
We first formulate the multi-defendant LJP task.The fact description of a multi-defendant case can be seen as a word sequence x = {w 1 , w 2 , ..., w n }, where n represents the number of words.Each multi-defendant case has a set of defendant names E = {e 1 , e 2 , ..., e |E| }, where each name is a sequence of words e = {w 1 , w 2 , .., w |e| }.Given the fact description x and the defendant name e of a multi-defendant case, the multi-defendant task aims to predict the judgment results of multiple applicable law articles, multiple charges, and a term of penalty.The law article prediction and the charge prediction subtasks are multi-label classification problems, and the term of penalty prediction subtask is a multi-class classification problem.
We introduce the criminal relationship and the sentencing circumstance as intermediate tasks to model the hierarchical reasoning chains for multidefendant LJP and improve the prediction of the main judgment results.Criminal relationships refer to the relationships between defendants, specifically whether one defendant assisted other codefendants during the commission of the crime.Sentencing circumstances refer to specific behaviors (such as confession and recidivist) or factors (like accessory and blind) that can influence the severity or leniency of a penalty3 .These two tasks are also multi-label classification problems.We denote the labels of law articles, charges, terms of penalty, criminal relationships, and sentencing circumstances as word sequences y l , y c , y t , y r and y s respectively in this paper.

Sequence-to-sequence generation
From the perspective of Sequence-to-Sequence (Seq2Seq) generation, each task can be modeled as finding an optimal label sequence y that maximizes the conditional probability based on the fact description, a specific defendant name and a specific task description p(y|x, e, d), which is calculated  (1) where m denotes the length of the label sequence, and the specific task description d is a semantic prompt that allows Seq2Seq generation models to execute the desired task.To accomplish the Seq2Seq generation tasks, we apply the Seq2Seq language model mT5 (Xue et al., 2021) to generate label sequences as follows: where EN C refers to the encoder of the language model, DEC denotes the decoder of the language model, and ŷ is prediction results composed of words.We use special [SEP] tokens to separate the different information to form the input of the encoder.

Hierarchical reasoning chains
To distinguish different judgment results among various defendants, our method HRN follows hierarchical reasoning chains to determine each defendant's criminal relationships, sentencing circumstances, law articles, charges, and terms of penalty.
As shown in Figure 2, the hierarchical reasoning chains consist of two levels: The first-level reasoning is for intermediate tasks.The first-level reasoning chain is to first identify relationships between defendants based on the fact description, the names of all defendants and the criminal relationship prediction task description d r as follows: (3) Then, we determine sentencing circumstances for the defendant e based on the fact description, name of the defendant e, prediction results of criminal relationships and the sentencing circumstance prediction task description d s , that is: The second-level reasoning is for judgment prediction tasks.The second-level reasoning chain consists of a forward prediction process and a backward verification process.The forward prediction process is to predict law articles, charges, and terms of penalty (in that order) based on the fact description, the name of defendant e, first-level reasoning results, and the forward prediction task description d lct as follows: ŷlct = DEC(EN C(x, e, ŷr , ŷs , d lct )).(5) Then, the backward verification process is to verify these judgment results in reverse order based on the fact description, the name of defendant e, first-level reasoning results and the backward verification task description d tcl , that is: ŷtcl = DEC(EN C(x, e, ŷr , ŷs , d tcl )).(6)

Training with fusion-in-decoder
To handle multi-defendant fact descriptions whose average length exceeds the length limit of the encoder, we adopt Fusion-in-Decoder (FID) (Izacard and Grave, 2021) to encode multiple paragraphs split from a fact description.We first split the fact description x into K paragraphs containing M words.Then, we combine multiple paragraph representations from the encoder, the decoder generates prediction results by attending to multiple paragraph representations as follows: where h i denotes the representation of the i-th paragraph of the fact description x.Since all tasks are formulated as sequence-to-sequence generation tasks, we follow Raffel et al. (2020) to train the model by standard maximum likelihood and calculate the cross-entropy loss for each task.The overall loss function is formally computed as: where hyperparameters λ determine the trade-off between all subtask losses.L r , L s , L lct and L tcl denote the cross-entropy losses of the criminal relationship prediction task, the sentencing circumstance prediction task, the forward prediction process and the backward verification process, respectively.At test time, we apply greedy decoding to generate forward and backward prediction results.Finally, the chain with the highest confidence is chosen for the final prediction.

Research questions
We aim to answer the following research questions with our experiments: (RQ1) How does our proposed method, HRN, perform on multi-defendant LJP cases?(RQ2) How do the different levels of reasoning chains affect the performances of HRN on multi-defendant LJP?

Baselines
To verify the effectiveness of our method HRN on multi-defendant LJP, we compare it with a variety of methods, which can be summarized in the following three groups: • Single-defendant LJP methods, including Topjudge (Zhong et al., 2018), which is a topological dependency learning framework for singledefendant LJP and formalizes the explicit dependencies over subtasks as a directed acyclic graph; MPBFN (Yang et al., 2019), which is a singledefendant LJP method and utilizes forward and backward dependencies among multiple LJP subtasks; LADAN (Xu et al., 2020), which is a graph neural network based method that automatically captures subtle differences among confusing law articles; NeurJudge (Yue et al., 2021), which utilizes the results of intermediate subtasks to separate the fact statement into different circumstances and exploits them to make the predictions of other subtasks.• Pre-trained language models, including BERT (Cui et al., 2021), which is a Transformerbased method that is pre-trained on Chinese Wikipedia documents; mT5 (Xue et al., 2021), which is a multilingual model pre-trained by converting several language tasks into "text-totext" tasks and pre-trained on Chinese datasets; Lawformer (Xiao et al., 2021b), which is a Transformer-based method that is pre-trained on large-scale Chinese legal long case documents.• Multi-defendant charge prediction method, including MAMD (Pan et al., 2019), which is a multi-defendant charge prediction method that leverages multi-scale attention to recognize fact descriptions for different defendants.
We adapt single-defendant LJP methods to multidefendant LJP by concatenating a defendant's name and a fact description as input and training models to predict judgment results.However, we exclude some state-of-the-art single-defendant approaches unsuitable for multi-defendant settings.Few-shot (Hu et al., 2018), EPM (Feng et al., 2022b), andCEEN (Lyu et al., 2022) annotate extra attributes for single-defendant datasets, not easily transferable to MultiLJP.Also, CTM (Liu et al., 2022) and CL4LJP (Zhang et al., 2023a) design specific sampling strategies for contrastive learning of single-defendant cases, hard to generalize to multi-defendant cases.

Implementation details
To accommodate the length of multi-defendant fact descriptions, we set the maximum fact length to 2304.Due to the constraints of the model input, BERT's input length is limited to 512.For training, we employed the AdamW (Loshchilov and Hutter, 2019) optimizer and used a linear learning rate schedule with warmup.The warmup ratio was set to 0.01, and the maximum learning rate was set to 1 • 10 −3 .We set the batch size as 128 and adopt the gradient accumulation strategy.All models are trained for a maximum of 20 epochs.The model that performs best on the validation set is selected.

Experimental results and analysis
In this section, we first conduct multi-defendant legal judgment predictions and ablation studies to answer the research questions listed in the Section.5.1.In addition, we also conducted a case study to intuitively evaluate the importance of hierarchical reasoning.

Multi-defendant judgment results (RQ1)
Table 2 shows the evaluation results on the multidefendant LJP subtasks.Generally, HRN achieves the best performance in terms of all metrics for all multi-defendant LJP subtasks.Based on the results, we have three main observations:  LJP by improving the first-level reasoning.

Ablation studies (RQ2)
To analyze the effect of the different levels of reasoning chains in HRN, we conduct an ablation study.Table 3 shows that all levels of reasoning chains help HRN as removing any of them decreases per-formance: • Removing the first-level reasoning chain.We observe that both criminal relationships and sentencing circumstances decrease the performance of HRN when we remove the first-level reasoning chains.Specifically, removing criminal relationships (CR) negatively impacts performance, especially on law articles and charges prediction, which means criminal relationships are helpful for distinguishing law articles and charges; removing sentencing circumstances (SC) negatively impacts performance, especially on terms of penalty, which means sentencing circumstances are helpful for distinguishing terms of penalty.
• Removing the second-level reasoning chain.
We observe that the model without the secondlevel forward prediction process (FP) or the second-level backward verification process (BV) faces a huge performance degradation in multidefendant LJP.As a result, although the model still performs the first-level reasoning, the absence of modeling forward or backward dependencies between LJP subtasks leads to poor LJP performances.
• Removing all reasoning chains.When removing all reasoning chains from HRN, there is a substantial drop in the performances of multidefendant LJP.Experimental results prove that hierarchical reasoning chains can be critical for multi-defendant LJP.

Case study
We also conduct a case study to show how multidefendant reasoning chains help the model distinguish judgment results of different defendants and make correct predictions.Figure 3 shows prediction results for two defendants, where red and green represent incorrect and correct predictions, respectively.Defendant B helped A beat the vic-tim, but the fact description does not show a direct crime against the victim.Without determining criminal relationships and sentencing circumstances, MAMD focuses on the tailing behavior around A and misclassifies A's law articles, charges, and terms of penalty as article 264, theft, and 11 months.In contrast, by following reasoning chains to determine criminal relationships, sentencing circumstances, law articles, charges, and terms of penalty, HRN distinguishes different judgment results between defendants.

Conclusions
In this paper, we studied the legal judgment prediction problem for multi-defendant cases.We proposed the task of multi-defendant LJP to promote LJP systems from single-defendant to multidefendant.To distinguish confusing judgment results of different defendants, we proposed a Hierarchical Reasoning Network (HRN) to determine criminal relationships, sentencing circumstances, law articles, charges and terms of penalty for each defendant.As there is no benchmark dataset for multi-defendant LJP, we have collected a realworld dataset MultiLJP.We conducted extensive experiments on the MultiLJP dataset.Experimental results have verified the effectiveness of our proposed method and HRN outperforms all baselines.

Limitations
Although our work distinguishes the judgment results of multiple defendants by hierarchical reasoning, in real life, there exist many confusing charge pairs, such as (the crime of intentional injury, and the crime of intentional homicide).The fact descriptions of these confusing charge pairs are very similar, which makes it difficult for the multi-defendant LJP model to distinguish between confusing charge pairs.We leave this challenge for future work.

Figure 2 :
Figure 2: Overview of our proposed HRN.HRN leverages Sequence-to-Sequence (Seq2Seq) generation framework to follow hierarchical reasoning chains to generate prediction results.asfollows:

Figure 3 :
Figure 3: Case study for intuitive comparisons.Red and green represent incorrect and correct judgment results, respectively.Blue denotes descriptions of criminal relationships.

Table 1 :
Statistics of the MultiLJP.
Wei et al. (2022)etune pre-trained language models to solve competition mathematics problems by generating multi-step solutions.Nye et al. (2021) train language models to predict the final outputs of programs by predicting intermediate computational results.Recently,Wei et al. (2022)propose chain of thought prompting, which feed large language models with step-by-step reasoning examples without fine-tuning to improve model performance.