Precedent-Enhanced Legal Judgment Prediction with LLM and Domain-Model Collaboration

Legal Judgment Prediction (LJP) has become an increasingly crucial task in Legal AI, i.e., predicting the judgment of the case in terms of case fact description. Precedents are the previous legal cases with similar facts, which are the basis for the judgment of the subsequent case in national legal systems. Thus, it is worthwhile to explore the utilization of precedents in the LJP. Recent advances in deep learning have enabled a variety of techniques to be used to solve the LJP task. These can be broken down into two categories: large language models (LLMs) and domain-specific models. LLMs are capable of interpreting and generating complex natural language, while domain models are efficient in learning task-specific information. In this paper, we propose the precedent-enhanced LJP framework (PLJP), a system that leverages the strength of both LLM and domain models in the context of precedents. Specifically, the domain models are designed to provide candidate labels and find the proper precedents efficiently, and the large models will make the final prediction with an in-context precedents comprehension. Experiments on the real-world dataset demonstrate the effectiveness of our PLJP. Moreover, our work shows a promising direction for LLM and domain-model collaboration that can be generalized to other vertical domains.


Introduction
Legal AI has been the subject of research for several decades, with the aim of assisting individuals in various legal tasks, including legal QA (Monroy et al., 2009), court view generation (Wu et al., 2020), legal entity recognition (Cardellino et al., 2017), and so on.As one of the most important legal tasks, legal judgment prediction (LJP) aims to predict the legal judgment of the case based on the case fact description.The legal judgment typically includes the law article, charge and prison term.Precedents, which refer to previous cases with similar fact descriptions, hold a crucial position within national legal systems (Guillaume, 2011).On a more macro level, precedents are known as the collective body of judge-made laws in a nation (Garner, 2001).They serve the purpose of ensuring consistency in judicial decisions, providing greater legal guidance to judges and facilitating legal progress and evolution to meet dynamic legal demands.In the Common Law system, the precedents are the mandatory basis of the judgment of the subsequent case (Rigoni, 2014).In the Civil Law system, judge-made laws are perceived as secondary legal sources while written laws are the basic legal sources (Larenz, 1992).In the contemporary era, there is also a growing trend to treat the precedents as a source of "soft" law (Fon and Parisi, 2006), and judges are expected to take them into account when reaching a decision (Guillaume, 2011).Thus, it is worthwhile to explore the utilization of precedents in the legal judgment prediction.
With the development of deep learning, many technologies have been adopted in the LJP task, which can be split into two categories: large lan-guage models (LLMs) and domain-specific models (Ge et al., 2023).Owing to extensive training, LLMs are good at understanding and generating complex natural language, as well as in-context learning.On the other hand, domain-specific models are designed to cater to specific tasks and offer cost-effective solutions.However, when it comes to incorporating precedents into the LJP task, both categories of models face certain limitations.LLMs, constrained by their prompt length, struggle to grasp the meaning of numerous abstract labels and accurately select the appropriate one.For domain models, though trained with label annotations, the drawback is the limited ability to comprehend and distinguish the similarities and differences between the precedents and the given case.
In this paper, as Fig. 1 shows, we try to collaborate the LLMs with the domain-specific models and propose a novel precedent-enhanced legal judgment prediction framework (PLJP).Specifically, domain models contribute by providing candidate labels and finding the proper precedents from the case database effectively; the LLMs will decide the final prediction through an in-context precedent comprehension.
Following the previous LJP works (Zhong et al., 2018;Yue et al., 2021;Dong and Niu, 2021), our experiments are conducted on the publicly available real-world legal dataset.To prevent any potential data leakage during the training of the LLMs, where the model may have already encountered the test cases, we create a new test set comprising cases that occurred after 2022.This is necessary because the LLMs we utilize have been trained on a corpus collected only until September 2021.By doing so, we ensure a fair evaluation of the PLJP framework.Remarkably, our proposed PLJP framework achieves state-of-the-art (SOTA) performance on both the original test set and the additional test set.
To sum up, our main contributions are as follows: • We address the important task of legal judgment prediction (LJP) by taking precedents into consideration.
• We propose a novel precedent-enhanced legal judgment prediction (PLJP) framework that leverages the strength of both LLM and domain models.
• We conduct extensive experiments on the realworld dataset and create an additional test set to ensure the absence of data leakage during LLM training.The results obtained on both the original and additional test sets validate the effectiveness of the PLJP framework.
• Our work shows a promising direction for LLM and domain-model collaboration that can be generalized over vertical domains.We make all the codes and data publicly available to motivate other scholars to investigate this novel and interesting research direction1 .
2 Related Work

Legal AI
Legal Artificial Intelligence (Legal AI) aims to enhance tasks within the legal domain through the utilization of artificial intelligence techniques (Zhong et al., 2020;Katz et al., 2023).Collaborative efforts between researchers in both law and computer fields have been lasting to explore the potential of Legal AI and its applications across various legal tasks.These tasks encompass areas such as legal question answering (QA) (Monroy et al., 2009), legal entity recognition (Cardellino et al., 2017), court view generation (Wu et al., 2020), legal summarization (Hachey and Grover, 2006;Bhattacharya et al., 2019), legal language understanding (Chalkidis et al., 2022) and so on.
In this work, we focus on the task of legal judgment prediction, which is one of the most common tasks in Legal AI.

Legal Judgment Prediction
Legal judgment prediction (LJP) aims to predict judgment results based on the fact descriptions automatically (Lin et al., 2012;Chalkidis et al., 2019;Yue et al., 2021;Xu et al., 2020;Niklaus et al., 2021;Malik et al., 2021;Feng et al., 2022;Lyu et al., 2022;Gan et al., 2022).The LJP methods in earlier years required manually extracted features (Keown, 1980), which is simple but costly.Owing to the prosperity of machine learning (Wu et al., 2022;Shen et al., 2022;Li et al., 2022a,b;Zhang et al., 2022;Li et al., 2023;Zhang et al., 2023), researchers began to formalize the LJP problem with machine learning methods.These data-driven methods can learn the features with far less labor (e.g., only the final labels are required).Sulea et al. (2017) developed an ensemble system that averages the output of multiple SVM to improve the performance of LJP.Luo et al. (2017) utilized an attention mechanism in the LJP.Zhong et al. (2018) considered the dependency of the sub-tasks in the LJP.Yue et al. (2021) investigated the problem by separating the representation of fact description into different embedding.Liu et al. (2022) used contrastive learning in the LJP.
However, these existing LJP methods tend to overlook the significance of precedents.In this study, we propose a precedent-enhanced LJP framework (PLJP) that leverages the collaboration between domain-specific models and large language models (LLMs) to address the LJP task.

Precedent Retrieval
The precedent is the basis of judgment in the Common Law system, and also an important reference for decision-making in the Civil Law system.Therefore, precedent retrieval is another valuable task in Legal AI (Althammer et al., 2021).There are two main precedent retrieval models: expert knowledge-based models and natural language processing (NLP)-based models (Bench-Capon et al., 2012).Expert knowledge-based models use the designed sub-elements to represent the legal cases (Saravanan et al., 2009), while NLP-based models mainly convert the text into embeddings and then calculate the similarity from the embedding level (Ma et al., 2021;Chalkidis et al., 2020).
Most retrieval models required additional annotation so can not be directly applied to the LJP task.In our paper, we use an unsupervised dense retrieval model (Izacard et al., 2022) to get the precedents, which can be updated by other retrieval models if needed.

Large Language Models
Large language models (LLMs), such as ChatGPT, have attracted widespread attention from society (Zhao et al., 2023).With pre-training over largescale corpora, LLMs show strong capabilities in interpreting and generating complex natural language, as well as reasoning (e.g., in-context learning).The technical evolution of LLMs has been making an important impact on the fields of natural language processing (Brown et al., 2020;Touvron et al., 2023), computer vision (Shao et al., 2023;Wu et al., 2023), and reinforcement learning (Du et al., 2023).In the legal domain, LLMs can also be used for many tasks such as legal document analysis and legal document writing (Sun, 2023).However, in the prediction tasks, which can involve dozens of abstract labels, the performance of LLMs is not as good as in generation tasks, due to the limited prompt length.In this paper, we explore the utilization of LLMs in the LJP task with the collaboration of domain-specific models.

Problem Formulation
In this work, we focus on the problem of legal judgment prediction.We first clarify the definition of the terms as follows.
• Fact Description refers to a concise narrative of the case, which typically includes the timeline of events, the actions or conduct of each party, and any other essential details that are relevant to the case.Here we define it as a token sequence f = , where l f is the length.• Judgment is the final decision made by a judge in a legal case based on the facts and the precedents.It typically consists of the law article, the charge, and the prison term.We represent the judgment of a case as j = (a, c, t), where a, c, t refer to the labels of article, charge and prison term, respectively.
• Precedent is the previous case with a similar fact.The judgments of the precedents are important references for the current case.Here, a precedent is defined as p = (f p , j p ), where f p is its fact description and j p is its judgment.For a given case, there can be several precedents, which can be denoted as P = {p 1 , p 2 , ..., p n }, where n is the number of precedents.
Then the problem can be defined as: Problem 1 (Legal Judgment Prediction).Given the fact description f , our task is to get and comprehend the precedents P , then predict the judgment j = (a, c, t).

Precedent-Enhanced LJP (PLJP)
In this section, we describe our precedent-enhanced legal judgment prediction framework (PLJP), Fig. 2 shows the overall framework.

Case Database Construction
Before we use the precedents, we have to collect a large number of previous cases to construct a case database.Since the fact descriptions are usually long and elaborate, it is difficult for the models to get the proper precedents.To this end, we reorganize the fact description of these previous cases with the help of LLMs.

Fact Reorganization
Given a fact description of a case, we summarize it from three aspects: subjective motivation, objective behavior, and ex post facto circumstances.The reorganization doesn't require human annotation and is completed by the LLMs with the following prompts: "A fact description can be categorized into subjective motivation, objective behavior, and ex post facto circumstances.Subjective motivation refers to the psychological attitude of the perpetrator towards their harmful actions and their consequences, including intent, negligence, and purposes of the crime.Objective behavior pertains to the necessary conditions for constituting a crime in terms of observable activities, including harmful conduct, harmful results, and the causal relationship between the conduct and the results.Ex post facto circumstances are various factual situations considered when determining the severity of penalties.Mitigating circumstances for lenient punishment include voluntary surrender and meritorious conduct, while aggravating circumstances for harsher punishment include recidivism.Based on the provided information, your task is to summarize the following facts." The reorganization reduces the length of facts and makes the precedents easy to get and comprehend in the PLJP.
After the reorganization, the fact description f is translated to a triplet (sub, obj, ex), which indi-cates the subjective motivation, objective behavior, and ex post facto circumstances, respectively.Finally, a previous case in the case database is stored as a pair of reorganized facts and the judgment.

Legal Judgment Prediction
Next, we describe the collaboration of the LLM and domain models in legal judgment prediction.

Domain Models
The domain models are trained on specific datasets, aiming to solve certain tasks.Here, we use two kinds of domain models, including the predictive model and the retrieval model.
Predictive model.The predictive model takes the fact description as the input and outputs the candidate labels of the three sub-tasks (e.g., law article, charge, prison term).Since the fact descrip- t=1 are sequences of words, we first transform it into embedding sequence H f ∈ R l f ×d with an Encoder: where and d is the dimension of the embedding.
We take a max-pooling operation to obtain the pooled hidden vector h f ∈ R d and then feed it into a fully-connected network with softmax activation to obtain the label probability distribution P ∈ R m : where W p ∈ R m×d and b p ∈ R m are learnable parameters.Note m varies in different sub-tasks.
Then, each sub-task gets its candidate labels according to the probability distribution P , and the number of candidate labels is equal to the number of precedents n.
Retrieval model.The retrieval model aims to get the proper precedents of the given case based on its reorganized fact (sub, obj, ex).
Formally, to get the similarity score of any two texts D 1 and D 2 , we will first encode each of them independently using the same encoder: Here we concatenate the sub, obj and ex into a whole text to calculate the similarity score of the given case and the cases in the case database.
For each candidate label, we pick one case as the precedent: the case that has the highest similarity score and has the same label.For example, if the label "Theft" is in the candidate labels in the charge prediction, we will find the most similar previous case with the same label as the corresponding precedent.The one-to-one relationship between the candidate label and precedent helps the LLM distinguish the differences among the labels.In other words, the precedent serves as a supplementary explanation of the label.

LLMs
The large language models are models with billions of parameters, which are trained on large-scale corpora, and show strong capabilities in interpreting and generating complex natural language.LLMs contribute to PLJP by fact reorganization and incontext precedent comprehension.
Fact Reorganization The fact reorganization is described in case database construction (Sec.4.1.1),which aims to summarize the fact description from three aspects by the LLMs.Besides the database contribution, as Fig. 2 shows, when a new test case comes, the LLMs will reorganize the fact description with the same prompt.
In-Context Precedent Comprehension Since LLMs are capable of understanding complex natural language, we stack the given case with its precedents and let the LLMs make the final prediction by an in-context precedent comprehension.Specifically, the prompt of law article prediction is designed as follows: "Based on the facts, we select the candidate law articles by the domain models and select the following three precedents based on the candidate law articles.Please comprehend the difference among the precedents, then compare them with the facts of this case, and choose the final label." Consider the topological dependencies among the three sub-tasks (Zhong et al., 2018), in the prediction of charge, we add the predicted law article in the prompt; in the prediction of prison term, we add the predicted law article and charge.

Training
In PLJP, considering the realizability, we train domain models on legal datasets and leave the LLMs unchanged.To train predictive models, the crossentropy loss is employed.As for retrieval models, contrastive loss is used like Izacard et al. (2022).includes a fact description accompanied by a complete judgment encompassing three labels: law articles, charges, and prison terms 2 .

Experiments
To mitigate the potential data leakage during the training of LLMs, which were trained on corpora collected until September 2021, we have compiled a new dataset called CJO22.This dataset exclusively contains legal cases that occurred after 2022, sourced from the same origin as CAIL2018 3 .However, due to its limited size, the newly collected CJO22 dataset is inadequate for the training purposes of the domain models.Consequently, we utilize it solely as an additional test set.To facilitate meaningful comparisons, we retain only the labels that are common to both datasets, considering that the labels may not be entirely aligned.
Tab. 1 shows the statistics of the processed datasets, and all the experiments are conducted on the same datasets.For CAIL2018 dataset, we randomly divide it into training set, validation set and test set according to the ratio of 8: 1: 1.The previous cases in the case database are sampled from the training dataset, and we set the amount to 4000.
For PLJP, we take the CNN and BERT as the predictive models, and take the text-davinci-003 as the implementation of the LLM, named as PLJP(CNN) and PLJP(BERT).The top-k accuracy of CNN and BERT is shown in the Appendix.Considering the length limit of the prompt, we set the number of precedents to 3.
We also do ablation experiments as follows: PLJP w/o p refers to the removal of precedents, and the prediction of labels is done solely based on the candidate labels using the LLM; PLJP w/o c denotes we remove the candidate labels and predict the label only with the fact description and precedents; PLJP w/o d means we predict the three labels independently instead of considering the dependencies among the three subtasks; PLJP w/o r denotes we find precedents based the raw fact instead of from the reorganized fact; PLJP w/ e means we let the LLMs generate the explanation of the prediction as well.

Experiment Settings
Here we describe the implementation of PLJP in our experiments.Note all the LLMs and domain models are replaceable in the PLJP framework.
In the experiments, for the LLMs, we directly use the APIs provided by OpenAI.For the domain models, we use the unsupervised dense retrieval model (Izacard et al., 2022) in precedent retrieval, which gets the precedents from the case database

Given Case
On June 23, 2021, defendant A leased a loader from D for the Z Project, agreeing to a monthly rent of $15,000.A signed a machinery leasing contract and subsequently sold the loader the next day for $70,000.A paid a total of $32,000 in rent to D. The Price Service Center determined that the deceived loader was worth $61,167.Around July 23, 2021, defendant A leased another loader from F for Project Z, agreeing to a monthly rent of $15,000.After towing the loader to E, A sold it to H on July 24 for $54,000.On July 26, 2021, A leased a Y loader from F for the same project, agreeing to a monthly rent of $15,000.On the same day, A towed the vehicle to E and sold it to H for $40,000.Later, at F's request, A and F signed a loader lease contract.A paid a total of $28,000 in rent to F. The Price Service Center determined that the deceived loaders were worth $54,167 and $46,200, respectively.The fraudulently obtained funds were used by A to settle debts, cover living expenses, and indulge in lavish spending.It was also confirmed that defendant A surrendered to the public security authorities on August 3, 2021.green parts are useful information for prediction, while the red parts are content that can be confused by the domain models.
according to the reorganized facts.For other domain models such as TopJudge and NeurJudge, we use the training settings from the original paper.

Experiment Results
We analyze the experimental results in this section.
Result of judgment prediction: From Tab. 2, Tab. 3 and Tab. 4, we have the following observations: 1) The LLMs perform not well in the prediction tasks alone, especially when the label has no actual meaning (e.g., the index of the law article and prison term).2) By applying our PLJP framework with the collaboration of LLMs and domain models, the simple models (e.g., CNN, BERT) gain significant improvement.3) The model performance on CJO22 is lower than that on CAIL2018, which shows the challenge of the newly constructed test set.4) PLJP(BERT) achieves the best performance in almost all the metric evaluation metrics in both CAIL2018 and CJO22 test sets, which proves the effectiveness of the PLJP.5) Compared to the prediction of the law article and charge, the prediction of prison term is still a more challenging task.6) The reported results of the LJP baselines are not as good as the original papers, this may be because we keep all the low-frequency labels instead of removing them as the original papers did.
Results of ablation experiment: From Tab. 5, we can conclude that: 1) The performance gap of the PLJP w/o p and PLJP demonstrates the effects of the precedents.2) The results of PLJP w/o c prove the importance of the candidate labels.
3) Considering the topological dependence of the three sub-tasks benefits the model performance as PLJP w/o d shows.4) When we use the raw fact instead of the reorganized fact, the performance drops (e.g., the Acc of prison term in CJO22 drops from 45.32% to 36.27%).5) If we force the LLMs to generate the explanation of the prediction, the performance also drops a bit.We put cases with explanations in the Appendix.
From Fig. 3, we can find that the performance of PLJP improves as the number of precedents increases, which also proves the effectiveness of injecting precedents into the LJP.

Case Study
Fig. 4 shows an intuitive comparison among the three methods in the process of charge prediction.Based on the fact description of the given case, the domain models provide candidate charges with the corresponding precedents.As the case shows, the defendant made fraud by selling the cars that were rented from other people.However, since there contains "contract" in the fact description, baselines (e.g., R-Former and BERT) can be misled and predict the wrong charge of "Contract Fraud".Through an in-context precedent comprehension by the LLMs, PLJP(BERT) distinguishes the dif-ferences among the precedents and the given case (e.g., the crime does not occur during the contracting process, and the contract is only a means to commit the crime), and give the right result of "Fraud".

Conclusion and Future Work
In this paper, we address the important task of legal judgment prediction (LJP) by taking precedents into consideration.We propose a novel framework called precedent-enhanced legal judgment prediction (PLJP), which combines the strength of both LLMs and domain models to better utilize (e.g., retrieve and comprehend) the precedents.Experiments on the real-world dataset prove the effectiveness of the PLJP.
Based on the PLJP, in the future, we can explore the following directions: 1) Develop methods to identify and mitigate any biases that could affect the predictions and ensure fair and equitable outcomes.2) Validate the effectiveness of LLM and domain collaboration in other vertical domains such as medicine and education.

Ethical Discussion
With the increasing adoption of Legal AI in the field of legal justice, there has been a growing awareness of the ethical implications involved.The potential for even minor errors or biases in AI-powered systems can lead to significant consequences.
In light of these concerns, we have to claim that our work is an algorithmic exploration and will not be directly used in court so far.Our goal is to provide suggestions to judges rather than making final judgments without human intervention.In practical use, human judges should be the final safeguard to protect justice fairness.In the future, we plan to study how to identify and mitigate potential biases to ensure the fairness of the model.

Limitations
In this section, we discuss the limitations of our works as follow: • We only interact with the LLMs one round per time.The LLMs are capable of multi-round interaction (e.g., Though of Chains), which may help the LLM to better understand the LJP task.
• We validate the effectiveness of LLM and domain model collaboration in the legal domain.It's worthwhile to explore such collaboration in other vertical domains such as medicine and education, as well as in other legal datasets (e.g., the datasets from the Common Law system).

A Appendices
A.1 Top-k Accuracy

Figure 1 :
Figure 1: An illustration of the judicial process, our motivation is to promote the collaboration between the domain model and LLM (right part) for simulating the judicial process of the human judge (left).

Figure 2 :
Figure2: The overall framework of PLJP, where the sub, obj and ex refer to the subjective motivation, objective behavior and ex post facto circumstance, respectively.The solid lines are the precedent retrieval process, while the dotted lines represent the process of the prediction.

Figure 3 :
Figure 3: The Ma-F of PLJP with different number of precedents.
Candidate labels from domain models: Contract Fraud, Fraud, Extortion Precedents Precedent for Extortion Sub: Defendant A deliberately attacked victim.Obj: Defendant A entered the home of victim B with a knife, beat him, and destroyed the property in B's home, causing mental and property damage.Ex: None.Precedent for Fraud Sub: Defendant A deliberately defrauded the victim for the purpose of illegal possession.Obj: Defendant A defrauded the victim of $3,000 on the grounds that he could apply for a driver's license and squandered the stolen money.Ex: Defendant surrendered to the public security.Precedent for Contract Fraud Sub: Defendant A deliberately defrauded the car in the name of another person.Obj: In the process of signing the contract, the defendant fraudulently used the name of another person to obtain 2 cars.Ex: None.Predicted Judgment PLJP(BERT): Fraud✅ BERT: Contract Fraud❌ R-Former: Contract Fraud❌

Figure 4 :
Figure 4: The charge prediction of a given case.The ::::green parts are useful information for prediction, while the red parts are content that can be confused by the domain models.

Figure 8 :
Figure 8: The top-k accuracy of BERT on CJO22 dataset.

Table 1 :
Statistics of datasets.

Table 2 :
Results of law article prediction, the best is bolded and the second best is underlined.

Table 3 :
Results of charge prediction, the best is bolded and the second best is underlined.

Table 4 :
Results of prison term prediction, the best is bolded and the second best is underlined.

Table 5 :
Results of ablation experiments, the best is bolded and the second best is underlined.
ing and capture the dependencies among the three sub-task in LJP; NeurJudge(Yue et al., The top-k accuracy of CNN on CJO22 dataset.The top-k accuracy of BERT on CAIL dataset.