Exploiting Contrastive Learning and Numerical Evidence for Confusing Legal Judgment Prediction

Given the fact description text of a legal case, legal judgment prediction (LJP) aims to predict the case's charge, law article and penalty term. A core problem of LJP is how to distinguish confusing legal cases, where only subtle text differences exist. Previous studies fail to distinguish different classification errors with a standard cross-entropy classification loss, and ignore the numbers in the fact description for predicting the term of penalty. To tackle these issues, in this work, first, we propose a moco-based supervised contrastive learning to learn distinguishable representations, and explore the best strategy to construct positive example pairs to benefit all three subtasks of LJP simultaneously. Second, in order to exploit the numbers in legal cases for predicting the penalty terms of certain cases, we further enhance the representation of the fact description with extracted crime amounts which are encoded by a pre-trained numeracy model. Extensive experiments on public benchmarks show that the proposed method achieves new state-of-the-art results, especially on confusing legal cases. Ablation studies also demonstrate the effectiveness of each component.

One core problem hindering the performance of LJP from being satisfying is confusing legal cases, which have subtle text or number differences, but with totally different charges, applicable law articles, or terms of penalty.Figure 1 shows two examples of confusing legal cases.Figure 1 (a) shows an example whose golden charge label is Crime of Picking a Quarrel, which is easily classified into its confusing charge Crime of Intention Injury.Figure 1 (b) shows an erroneous prediction of the term of penalty, which does not exploit the amounts related to the crime.A series of studies on this problem has been conducted, including manually discriminative legal attributes annotation (Hu et al., 2018), distinguishing representations learning via graph neural networks (Xu et al., 2020), and separating the fact description into different circumstances for different subtasks (Yue et al., 2021).
However, we argue that there exist two drawbacks to these studies.Firstly, all of these studies use a standard cross-entropy classification loss, which cannot distinguish different mistaken classification errors.For example, the error of classifying the charge Crime of Picking a Quarrel into its confusing charge Crime of Intention Injury is the same as classifying Crime of Picking a Quarrel into a not confusing charge Crime of Rape.The model should be punished more if it classifies a charge into its corresponding confusing charges.Secondly, the crime amounts in the fact description are crucial evidence for predicting the penalty terms of certain types of cases, such as financial legal cases.However, the crime amounts are distributed randomly throughout the fact description.Thus it is difficult for the model to directly deduce the precise total crime amount and predict correct penalty terms based on the scattered numbers.
To tackle these issues, we present a framework that leverages numerical evidence and moment contrast-based supervised contrastive learning for confusing LJP.Firstly, the framework proposes to extract numerical evidence (the total crime amount) from the fact description as a named entity recognition (NER) task for predicting the term of the penalty, where the recognized numbers make up the total crime amount.This formulation is able to address the difficulty of directly deducing the precise total crime amount from the fact description where the numbers are scattered randomly throughout and only some of them are part of the crime amount, while others are not.Then, the extracted numerical evidence is infused into the term of penalty prediction model while preserving its numeracy, which is achieved by a pre-trained number encoder model (Sundararaman et al., 2020).
Secondly, in order to pull fact representations from the same class closer and push apart fact representations from confusing charges, the framework introduces the moment contrast-based supervised contrastive learning(SCL; (Khosla et al., 2020;Gunel et al., 2020;Suresh and Ong, 2021)) and explores the best strategy to construct positive example pairs to benefit all three subtasks of LJP simultaneously.The proposed moment contrastbased SCL addresses two challenges to applying the original in-batch SCL to LJP.The first challenge is that the number of charge classes is significantly greater than the number studied in previous studies (Gunel et al., 2020) (e.g., 119 classes for charge prediction), which increases the difficulty of finding sufficient negative examples in the minibatches.To address it, we introduce a momentum update queue with a large size for SCL, which allows for providing sufficient negative examples.The second challenge is when applying the original single-task SCL to the multi-task LJP there exists a Contradictory Phenomenon.Contradictory Phenomenon means that instances with the same charge label may have different applicable law or penalty term labels.If we pair the training instances with the same charge label as positive examples, the resulting learned shared features will benefit the charge prediction task but will degrade the performance of the other two tasks.To tackle this challenge, we explore the best way to construct positive examples, which can benefit all three subtasks of LJP simultaneously.
The proposed framework provides the following merits for predicting confusing LJP: 1) the framework is model-agnostic, which can be used to improve any existing LJP models; 2) the extracted numerical evidence makes the predictions of the term of penalty more interpretable, which is critical for legal judgment prediction; 3) compared with previous studies (Hu et al., 2018), the use of supervised contrastive learning does not require additional manual annotation.
We conduct extensive experiments on two realworld datasets (i.e., CAIL-Small and CAIL-Big).The experimental results demonstrate that the proposed framework achieves new state-of-the-art results, obtaining up to a 1.6 F1 score improvement for confusing legal cases and a 3.73 F1 score improvement for numerically sensitive legal cases.Ablation studies also demonstrate the effectiveness of each component of the framework.

Legal Judgment Prediction
In recent years, with the increasing availability of public benchmark datasets (Xiao et al., 2018a;Feng et al., 2022b) and the development of deep learning, LJP has become one of the hottest topics in legal artificial intelligence (Yang et al., 2019;Zhong et al., 2020;Gan et al., 2021b;Cui et al., 2022;Feng et al., 2022a;Lyu et al., 2022).Our work focuses on confusing legal judgment prediction, which is a typical difficulty in LJP.To solve this challenge, Hu et al. (2018) manually annotates discriminative attributes for legal cases and generates attribute-aware representations for confusing charges by attention mechanism.LADAN (Xu et al., 2020) extracts distinguishable features for law articles by removing similar features between nodes through a graph neural network-based operator.NeurJudge (Yue et al., 2021)   scription into different circumstances and exploits them to make the predictions of other subtasks.

Contrastive Learning
The purpose of contrastive learning (Chopra et al., 2005) is to make similar examples closer together and dissimilar examples further apart in the feature space.Contrastive learning has been widely explored for self-supervised/unsupervised representation learning (Wu et al., 2018;Hjelm et al., 2018;Bachman et al., 2019;Chen et al., 2020;He et al., 2020;Nguyen and Luu, 2021;Wu et al., 2022).
Recently, several studies have extended contrastive learning to supervised settings (Gunel et al., 2020;Khosla et al., 2020;Suresh and Ong, 2021;Zhang et al., 2022;Nguyen et al., 2022), where examples belonging to the same label in the mini-batch are regarded as positive examples to compute additional contrastive losses.In contrast to previous studies, we present a framework that leverages MoCo-based supervised contrastive learning and numerical evidence, which is neglected by earlier studies for confusing LJP.

Background
In this section, we formalize the LJP task and its multi-task learning framework.

Problem Formulation
Let f = {s 1 , s 2 , ..., s N } denote the fact description of a case, where sentence s i = {w 1 , w 2 , ..., w M } contains M words, and N is the number of sentences.Given a fact description f , the LJP task aims at predicting its charge y c ∈ C, applicable law article y l ∈ L and term of penalty y t ∈ T.

Multi-Task Learning Framework of LJP
While previous studies have designed various neural architectures for LJP, these models can be boiled down to the following multi-task learning framework as shown in Figure 2. Specifically, firstly, a shared fact encoder is used to encode f into basic legal document representations.
Thirdly, based on these task-specific representations, different classification heads (e.g., multilayer perceptron) and cross-entropy classification loss are used to compute the losses (i.e., ℓ c , ℓ l , ℓ t ) for the three tasks.The training objective is the sum of each task's loss as follows: provides an overview of the proposed framework.Given a fact description f , on one hand, a well-trained BERT-CRF-based named entity recognition model is used to extract the total crime amount from f as numerical evidence, which is then encoded into representations for predicting the term of the penalty of f .On the other hand, a Moco-based supervised contrastive learning for LJP and two strategies for constructing positive example pairs are introduced to compute the contrastive loss.The final training loss is the weighted sum of the contrastive loss and three standard crossentropy classification losses of subtasks.

Numerical Evidence for Term of Penalty Prediction
Numerical Evidence Extraction as NER.In the LJP datasets, there is no explicit total crime amount provided for each instance.To address this, we propose to formalize the calculation of the total crime amount as a Named Entity Recognition (NER) task, where the scattered numbers in the fact description that are part of the total crime amount will be recognized as named entities.Then, the sum of the recognized numbers is regarded as the final crime amount.The reason for this formalization is that recognizing which numbers are part of the crime amount is easier than directly computing the total crime amount from the fact description.Specifically, we train the numerical extraction model on the dataset used for the Crime Amount Extraction (CAE) task 2 .To train the NER model, given an instance (f, T ) in CAE, where f and T denote fact description and crime amount, respectively, we need to convert each instance into the NER format.To reduce expensive manual annotation costs, we propose a 0-1 knapsack algorithm to automatically label named entities in f .The 0-1 knapsack algorithm finds a set of sentences from f , where the sum of their numbers equals the crime amount T .Then the numbers in the selected sentences are labeled as named entities.Algorithm 1 illustrate this construction process.Figure 6 shows an example of converting an instance in CAE into the NER format.The converted dataset is named CAE-NER, based on which we train the state-ofthe-art BERT-CRF NER model, referred to as the numerical evidence extraction model.Now, each instance in the LJP dataset can obtain 2 http://data.court.gov.cn/pages/laic2021.html Algorithm 1: The 0-1 knapsack algorithm used for automatically constructing the CAE-NER dataset.
/* Select a set of sentences, the sum of their crime amount is T */ Numerical Evidence Encoder.Given the extracted numerical evidence, we need a numerical evidence encoder (NumEncoder) that should be capable of encoding the numerical evidence into hidden representations while preserving its numerical significance.To achieve this, we propose to pretrain NumEncoder with the following principle: the cosine similarity of the learned representations of a pair of numbers should have a linear relationship with respect to their numerical distance.Specifically, given an automatically generated training data , where (x i , y i ) represents a pair of numbers, we use the following training objective ℓ num to optimize the parameters of the LSTM-based NumEncoder: where cos represents the cosine distance function.
Infusing Numerical Evidence for Predicting the Term of Penalty.Lastly, we infuse the representations of the numerical evidence into the term of the penalty prediction model.Specifically, given a training instance (f, y c , y l , y t , m), its numerical evidence m is encoded by the pre-trained number encoder and then is fused into the term of penalty prediction head as follows: where [; ] denotes the concatenation operation.

MoCo-based Supervised Contrastive Learning for Confusing Judgment Prediction
To address the challenges of large class numbers and the multi-task learning nature of LJP when applying the original in-batch SCL, we introduce the momentum contrast (MoCo) (He et al., 2020) based SCL.Furthermore, we explore the best way to construct positive examples so that they can benefit all three subtasks of LJP simultaneously.
Firstly, we propose to augment the standard inbatch SCL with a large-sized momentum update queue (He et al., 2020), allowing for providing sufficient samples for computing the contrastive loss.Specifically, we maintain one feature queue Q and one label queue L to store sample features and corresponding labels.For each example <e i , l i > in the mini-batch I, we select positive and negative samples from Q based on the labels in L to compute the supervised contrastive loss as follows: (11) where P (i) = {t|y t = y i , t ∈ L} and A(i) = {t|t ∈ L}. q i is the query feature encoded by a query encoder f q (•; θ q ).k p , k a in Q are the key features encoded by a key encoder f k (•; θ k ).θ k are smoothly updated as follows: where m is the momentum coefficient.Samples in Q and L are progressively replaced by the current mini-batch following a first-in-first-out strategy.In the ablation section, this MoCo-based SCL shows advantages over the standard in-batch SCL.
Next, we explore two strategies to construct positive example pairs to address the multi-task learning challenge of LJP.
Strategy I.A straightforward strategy is to compute a contrastive loss for each subtask of LJP, and then sum them into one loss.Formally, three feature queues, i.e., Q c , Q l and Q t , are used to store task-specific feature, i.e., H c , H l and H t .Three label queues, i.e., L c , L l and L t are used to store subtask labels, i.e., y c , y l and y t .The overall contrastive loss is defined as follows: where ℓ c sup , ℓ l sup and ℓ t sup are contrastive losses for each subtask computing by Eq. 11.The final training objective of Strategy I is defined by: Strategy II.When closely examining Eq. 13, we can observe the Contradictory Phenomenon as discussed in Sec. 1.In Strategy I, ℓ task sup treats instances with the same subtask labels (e.g., charge labels) as positive examples.However, these instances may have different other subtasks labels (e.g., applicable law labels or term of penalty labels).As a result, ℓ task sup will force the SharedF actEncoder to learn features that benefit one subtask but degrade the performance of the other two tasks.
To solve this problem, we propose to view the instances whose three subtask labels are all the same as positive examples and impose the MoCo-based SCL on the shared features H f .Specifically, we use a feature queue Q B to store the shared features H f , and three label queues L c , L l and L t to store three subtask labels.Then the positive samples set for sample i is denoted as P (i) = {q|L c (q) = y c i , L l (q) = y l i , L t (q) = y t i , q ∈ Q B } where L task (q) denotes the label of index q in each task label queue.Based on P (i), we can use Eq.11 to compute the contrastive loss, denote as ℓ B sup .Strategy II is able to address the Contradictory Phenomenon in Strategy I and improve the performance of all three subtasks.
The final training objective of Strategy II is defined by: where λ is a hyperparameter.

Datasets
To evaluate the effectiveness of our framework, we conduct experiments on two real-world datasets (i.e., CAIL-Small and CAIL-Big) (Xiao et al., 2018b).Each instance in both datasets contains one fact description, one applicable law article, one charge, and one term of penalty.To ensure a fair comparison, we use the code released by (Xu et al., 2020) to process the data.All models are trained on the same dataset.The crime Amount Extraction (CAE) dataset is also a real-world dataset from the Chinese Legal AI challenge3 .Table 1 shows the statistics of the used datasets.
To specifically evaluate our framework on confusing legal cases, we define a set of confusing and number-sensitive charges.Due to page limitations, the details and statistics of these charges definitions are listed in Table 9 in the Appendix
More training details about the BERT-CRF NER model, the LJP model, and the NumberEncoder model can refer to Table 8 in the Appendix.

Development Experiments
To empirically evaluate which strategy is better for performing SCL for multi-task LJP, we conduct development experiments on CAIL-small using the LADAN backbone.The results are listed in Table 2.As can be observed, firstly, both Strategy I and II lead to improvements in LADAN's performance across the three subtasks.However, the gains of Strategy I is much smaller than those of Strategy II, which verifies the existence of Contradictory Phenomenon in Strategy I. Furthermore, we also explore the effect of combining these two strategies, i.e., using ℓ cl + ℓ B as the supervised contrastive loss.As seen, the improvement of this combination is not significant.Consequently, in the remaining experiments, we adopt Strategy II as the contrastive loss, unless otherwise specified.The final method combined the MoCo-based SCL and numerical evidence is denoted as NumSCL.

Main Results
To evaluate the effectiveness of the proposed framework, we augment each baseline with NumSCL and conduct experiments on the CAIL-Small and CAIL-Big datasets.Due to the expensive training cost and the large size of the training dataset, we did not evaluate CrimeBERT on CAIL-Big following (Yue et al., 2021).The results are listed in Table 3 and Table 4.  HARNN  84.54 82.56 82.94 82.26 80.09 76.46 77.69 75.95 38.38 36.12 33.99 34.32  w/NumSCL 85.26 83.93 83.76 83.39 81.07 77.95 78.52 77.11 39.18 37.32 34.50 35.03  LADANMTL  84.90 82.55 83.26 82.42 80.38 75.84 77.84 75.67 38.21 35.95 34.01 34.28  w/NumSCL 85.37 83.91 84.04 83.5781.32 78.06 78.59 77.24 39.38 37.95 35.23 35.95  NeurJudge +  83.25 82.11 81.69  81.3  80.95 77.93 78.59 77.00 37.88 37.20 33.82 34.92  w/NumSCL 84.45 83.30 83.55 82.88 81.12 78.10 78.98 77.32 39.65 39.48 34.65 36.22CrimeBERT  86.61 85.04 84.72 84.51 82.33 79.38 79.72 78.46 39.34 38.66 35.48 36.58  w/NumSCL 85.91 85.71 85.98 85.54 82.63 80.10 80.88 79.50 39.72 38.50 35.84 36.67Table 3: Main results on CAIL-Small.Acc., MP, and MR are short for accuracy, precision, and macro recall, respectively.From Table 3 and Table 4, we make the following observations.Firstly, the proposed framework can improve all the baselines and achieve new state-of-the-art results on the two datasets.Specifically, on CAIL-Small, the absolute improvements reach up to 1.15, 1.57, and 1.67 F1 scores for the charge, law article, and term of penalty predictions, respectively.Secondly, on CAIL-Big, the gains are smaller, giving absolute improvements of 0.54, 1.16, and 0.53 F1 scores for the charge, law article, and term of penalty predictions, respectively.Thirdly, we observe that on CAIL-Big, NeurJudge + gives worse performance than the other baselines.We hypothesize that the complex neural network architecture of NeurJudge + may lead to overfitting on the CAIL-Big dataset.Lastly, for CrimeBERT, our framework can still obtain an absolute improvement of 1.03 and 1.04 F1 scores for charge and law article predictions.The overall gain on the term of penalty prediction is slight, however, when specifically evaluating number-sensitive legal cases, the improvement can still be up to a 1.79 F1 score, as shown in Table 5.

Ablation Studies
Effect of Numerical Evidence for Number-Sensitive Legal Cases.To examine the effect of numerical evidence for predicting the term of penalty, we conduct ablative experiments.As Table 5: Effects of the proposed method on the term of penalty prediction of number-sensitive legal cases.Num.F1 is the F1 score on the defined number-sensitive charges.
shown in Table 5, the improvement of MoCobased SCL only is relatively tiny, only giving 0.11 and 0.05 F1 score improvements for HARNN and NeurJudge + on number-sensitive legal cases.However, when the models are further provided with the extracted numerical evidence, the F1 scores of all the baselines have a considerable boost.In particular, LADAN MTL obtains a 3.73 F1 score improvement on number-sensitive cases.These results show that the extracted crime amount is more beneficial for number-sensitive legal cases.Effect of Momentum Contrast Queue.We utilize LADAN as the backbone for comparing Num-SCL with the in-batch SCL(SCL) which takes the current mini-batch as the lookup dictionary to compute the contrastive loss.The in-batch SCL is trained using the same parameters as the MoCobased SCL.As depicted in Table 7, the in-batch SCL yields superior results when combined with LADAN but underperforms NumSCL in charge and law article prediction tasks.This highlights the advantage of employing a large queue as the lookup dictionary in SCL.We also observe that the performance of the term of penalty is not significantly affected by the choice of the lookup dictionary.This observation aligns with the previous finding that the extracted numerical evidence is more beneficial than SCL for the term of penalty prediction.

Effect of Contrastive
Effect of λ.We carry out experiments to verify the impact of λ in Eq.15.As illustrated in Fig- ure5, with the increasing of λ, the performance of charge and law article predictions correspondingly improves.We also observe a fluctuation in the term of penalty prediction, again showing that the extracted numerical evidence plays a more significant role than SCL for predicting the term of penalty.

Example 1
Fact Descriptions: At XXX, the defendant XXX fought with the victim XXX due to trivial matters in a market in XXX Town, XXX City, and wanted to take revenge on the victm XXX.At X X X of the same year, the defendant X X X gathered three people including XXX, XXX and XXX, and fled to the open-air parking lot next to a market.The defendant X X X slashed the victim XXX s right arm with a watermelon knife, then they fled the scene.According to the forensic identification, the victim XXX suffered minor injuries Example 2 Fact Descriptions: From X X X to X X X , the defendant X X X committed three thefts in X X X City .1.At X X X , 2015, the defendant X X X stole a "Dayang brand" electric tricycle worth $5,600 from the victim XXX. 2. On XXX, 2015, the defendant X X X stole a "Dayang" electric tricycle worth $ 4,800 from the victim XXX in XXX city.3.At XXX, 2016, the defendant XXX stole a electric vehicle battery from the victim X X X , but was discovered and attempted.

Case Studies
Figure 4 shows two cases to qualitatively demonstrate the effect of the proposed framework.In the first case, LADAN MTL incorrectly predicts the case's charge into its confusing charge Crime of Intentional Injury, which should be Crime of Picking a Quarrel.However, with the proposed framework, this error is corrected.The second case demonstrates the effect of numerical evidence.LADAN MTL incorrectly predicts the case's term of penalty as label 9, meaning a sentence of fewer than 6 months.Given the accurately extracted crime amount of $10,400, which is a relatively large crime amount, the model correctly predicts the term of penalty as label 7, meaning a sentence of more than 9 months but less than 12 months.

Conclusion
In this paper, we present a framework that introduces MoCo-based supervised contrastive learning and weakly supervised numerical evidence for confusing legal judgment prediction.The framework is capable of automatically extracting numerical evidence for predicting number-sensitive cases and learning distinguishable representations to benefit all three subtasks of LJP simultaneously.Extensive experiments validate the effect of the framework.

Limitations
While the used 0-1 knapsack algorithm enjoys the merit of automatically constructing a training dataset for building the NER model, it cannot accurately calculate the crime amount when the suspects return some properties to the victims as the returned properties should be subtracted from the amount of the crime.More sophisticated techniques could be developed to calculate the amount of crime more precisely.
Our LJP research focuses on Chinese legal documents under the jurisdiction of the People's Republic of China.While the framework was developed and tested specifically for the 3-task Chinese Legal Judgment Prediction (LJP), we believe the underlying methodology could be generalized and applied to other LJP tasks, even those from different jurisdictions.However, this would likely require modifications to account for the unique characteristics and complexities of each jurisdiction's legal system.We will leave this for future work.

Ethical Concerns
Due to the sensitive nature of the legal domain, applying artificial intelligence technology to the legal field should be carefully treated.In order to alleviate ethical concerns, we undertake the following initiatives.First, to prevent the risk of leaking personal private information from the evaluated real-world datasets, sensitive information, such as names of individuals and locations, has been anonymized.Second, we suggest the predictions generated by our model should serve as supportive references to assist judges in making judgments more efficiently, rather than solely determining the judgments.
The court found that at XXX, the defendant XXX stole a black rectangular wallet and a white cell phone (Meitu M4 brand, worth $900 MONEY ) from the victim XXX, while he was asleep.The wallet contained cash of $2600 .After Being arrested, the defendant returned the aforementioned cell phone and cash of $2,000 to the victim.

Fact Description:
Crime Amount: $3,500   batch size is set to 128.We train the model for 16 epochs and select the best model on the validation set for testing.In contrastive learning, the MoCo queue size and the temperate t are set to 65536 and 0.07, respectively.We run each experiment with five different seeds and report the averaged results.
Table 1 shows the detailed statistics of the used datasets.

D Definition of Confusing Charges.
To specifically evaluate our method on confusing legal cases, we define confusing charges using the predicted results of the baseline model LADAN MTL .Concretely, if the number that the model incorrectly classifies class A into class B exceeds the pre-defined threshold, classes A and B will be added to the confusing classes.The definition of the number-sensitive charges is determined by an experienced legal expert.The statistics of the definition of confusing charges and numbersensitive charges are listed in Table 9.

Figure 1 :
Figure 1: Two examples of confusing legal cases.Figure 1 (a) shows an example whose golden charge label is Crime of Picking a Quarrel, which is easily classified into its confusing charge Crime of Intention Injury.Figure 1 (b) shows an erroneous prediction of the term of penalty, which does not exploit the amounts related to the crime.

Figure 2 :
Figure 2: The multi-task learning framework of legal judgment prediction.

Figure 3 :
Figure 3: (a) Overview of the proposed framework.(b) Illustration of the training process of the numerical evidence extraction model.(c) Pre-training of the number encoder.
Fig 3(b) illustrates the training process.
BinarySelect(f , T , 0) 17 return E a pseudo crime amount label m annotated by the well-trained numerical evidence extraction model, denoted as a five-tuple (f, y c , y l , y t , m).
Fig 3(c) illustrates the training process of NumEncoder.

Figure 4 :Figure 5 :
Figure 4: Qualitative examples to demonstrate the effect of the proposed framework.

Figure 6 :
Figure 6: An example of converting an instance in the CAE dataset into the sample for training the NER model.

Figure 6
Figure 6 shows an example of converting an instance in CAE into the NER format.
utilizes the results of intermediate subtasks to separate the fact de-

Table 2 :
to conduct data preprocessing.The THULAC 4 tool is used to segment Chinese into words.The word embedding layer in the neural network is initialized by pre-train Effects of different supervised contrastive learning strategies.
Learning for Confusing Charges.We conduct experiments to validate the

Table 6 :
Effects of the proposed method on the charge prediction of confusing legal cases.Conf.F1 is the F1 score on the defined confusing charges.

Table 9 :
Statistics of the defined confusing charges and number-sensitive charges.