Argumentation-Driven Evidence Association in Criminal Cases

Evidence association in criminal cases is dividing a set of judicial evidence into several non-overlapping subsets, improving the interpretability and legality of conviction. Ob-servably, evidence divided into the same subset usually supports the same claim. There-fore, we propose an argumentation-driven supervised learning method to calculate the distance between evidence pairs for the following evidence association step in this paper. Experimental results on a real-world dataset demon-strate the effectiveness of our method.


Introduction
Previous work has put forward multiple legal assistant systems with various functions, such as searching relevant cases given the query (Chen et al., 2013), predicting the legal judgement (Ye et al., 2018), etc. Despite promising results in this area, the research of judicial evidence in criminal cases has been omitted in recent years. The role of judicial evidence is to support several sub-claims in favour of conviction and the evidence description is an essential part of criminal judgement documents. However, the organization of evidence varies in different legal documents. The form of evidence association mainly includes collection form and argumentation-driven form as shown in Figure 1. In most current criminal judgement documents, the evidence is only listed in the form of a collection without giving explicit claims, which is regarded as collection form. However, evidence collection is divided into several subsets according to related claim only in around 5% criminal judgement documents, which is regarded as argumentation-driven form.
As shown in Figure 1, evidence divided into the same subset could support the same claim and such kind of legal documents have better readability. Inspired by this observation, we propose to study the problem of evidence association in this paper.
Evidence association is dividing a set of judicial evidence into several non-overlapping subsets according to their corresponding claims, improving the interpretability and legality of conviction. To our knowledge, there has been very limited research about evidence association in the legal field.
Evidence association could be treated as a clustering problem. Existing short text clustering methods broadly fall into two categories: representationbased methods and semantic textual similarity methods (Xu et al., 2017;Reimers et al., 2019). The representation-based methods concentrate on extracting rich semantic representation and then calculate cosine distance between text representations. The semantic textual similarity methods predict the distance between texts directly through supervised learning. However, the former methods perform poorly on the very short text and the latter methods require manually labelled data in the same field for supervised learning. We learn distance metric based on the probability supporting the same claim between evidence pairs directly on account of the short length of judicial evidence, which is regarded as an argumentation-driven method. Another challenge is that the number of clusters in each case is various. In this paper, we use agglomerative hierarchical clustering to learn the stopping threshold to avoid specifying the number of clusters. Our contributions of this paper are presented as follows: 1. We propose a task of evidence association in criminal cases which is significative but has not been well studied before and release a real-world dataset for this task.
2. We learn the distance metric by supervised argumentation-driven method for subsequent clustering without extra manual annotation.
3. Extensive experiments conducted on a realworld dataset show the efficiency of our methods and provide a simple baseline for future research. Figure 1: A real-word example of evidence descriptions in argumentation-driven form and collection form. The part before "prove that" is the evidence subset and the part after "prove that" is the corresponding claim.

Related works
The evidence association task is motivated by previous research on the legal assistant system, especially by the work of improving the interpretability of charge prediction (Ye et al., 2018). To our knowledge, there has been very limited research about evidence associations in the legal field. One of the most related research work was done by Poudyal et al. (2018). They use clustering techniques to identify argumentative sentences in legal documents, whereas it is a sentence-level task. As a part of argument mining, argument clustering aims to identify similar arguments. Boltužić andŠnajder (2015) identifies similar arguments in online debates using semantic textual similarity. Ajjour et al. (2019) groups arguments that emphasize a specific aspect of a controversial topic. Contextualized word embeddings methods are introduced in the classification and clustering of arguments in recent years (Reimers et al., 2019). In this paper, we mainly used the BERT (Devlin et al., 2019) and ESIM (Chen et al., 2017) model to learn the distance metric between evidence pairs.

Methodology
Given a set of evidence E = {e 1 , e 2 , ..., e n } involved with a criminal case, we expect to split the Each nonoverlapping subset of evidence E k proves the same claim c k . We firstly study the latent argumentationdriven evidence association in the case of lacking explicit claims. We also explored how to associate evidence more accurately in the case of giving the explicit claim set C = {c 1 , c 2 , ..., c k } involved in the criminal case. Similarly, we define it as an explicit argumentation-driven evidence association. A suitable clustering method and a meaningful distance between evidence pairs are crucial for evidence association.

Clustering Method
It is a prior that the number of clusters in each case is various so that we can not set a specific cluster number like the K-Means method. We try to cluster evidence via agglomerative hierarchical clustering (Day and Edelsbrunner, 1984), which learns a stopping threshold that determines when to stop merging two clusters without giving the specific number of clusters.

Latent Distance
Without giving the explicit claims, we can only use the information of the evidence pairs to calculate the distance between them. Nogueira and Cho (2019) define the correlation between relevant query-passage pairs as 0 and irrelevant querypassage pairs as 1 on account of the lack of labeled dataset. Similarly, we assume a smaller distance between two pieces of evidence that support the same claim. For simplification, the distance between evidence pairs that supports the same claim is labeled to 0. And the distance between evidence pairs involved in the same criminal case that prove different claims is labeled to 1. If p is the possibility that the distance between evidence pairs is 0 predicted by the model, then we simply regard the latent distance between evidence pairs as 1 − p.

Explicit Distance
There is strong relevance between evidence and the corresponding claim. For example, the traffic accident responsibility certificate can support the division of responsibility for traffic accidents. Therefore, we assume a higher relevance score between evidence and the corresponding claim. Similar to the sampling method mentioned above, the relevance score between evidence and the corresponding claim is 1 and the relevance score between evidence and any other claim is 0.
For a given criminal case, there is a evidence set denoted as E = {e 1 , e 2 , ..., e n } and a claim set denoted as C = {c 1 , c 2 , ..., c m }. Models predict a relevance score matrix denoted as A ∈ R n×m . Each element a i j in matrix A means the relevance score between the evidence e i and the claim c j . We assume that evidence belonging to the same cluster have a similar relevance score distribution. More specifically, suppose the relevance score distribution of evidence e 1 is P ∈ R 1×m , where each element P j is the relevance score between evidence e 1 and claim c j . Similarly, Q ∈ R 1×m is the relevance score distribution of evidence e 2 . We view Jensen-Shannon divergence (Endres and Schindelin, 2003) between these two distributions as the explicit distance between e 1 and e 2 .

Ensemble Distance
The latent distance only uses the semantic information between the evidence texts to calculate the similarity. The explicit distance only uses the inference relationship between evidence and claim to calculate the distance between evidence. We try to use the semantic information between the evidence and the inference information between the evidence and the claim at the same time by fusing these two methods. We define the ensemble distance as the weighted sum of these two distances.

Datasets
We construct a new dataset from the published legal documents in China Judgements Online 1 . We selected the legal documents where the evidence description is the argumentation-driven form as shown in Figure 1 for experiments. For those evidence descriptions of argument-driven form, we can extract the evidence and corresponding claims without manual annotation easily. A subset of evidence and the corresponding claim are always on the same line. The part before "prove that" is the evidence subset and the part after "prove that" is the corresponding claim. Evidence in the same subset is usually separated by punctuations. After pre-processing, each judicial evidence description sample can be composed of an evidence set and a claim set as the illustration of our data in Figure 2. We select 500 cases of the Traffic Accident Crime, which is one of the most frequent criminal charges. We counted the average number of judicial evidence and claims per case. The average length of evidence and claims of Chinese characters are calculated. The detailed statistical results of the datasets are shown in Table 1.

Experimental Setup
We calculate the cosine distance between the average word GloVe embeddings of evidence pairs as a baseline. We mainly adopt ESIM and BERT to predict the distance via supervised learning.
ESIM. We tokenize the Chinese texts with the open-source tool of HanLP 2 and use the Glove Figure 3: An example of clustering results via different distance metrics. The superscript of the evidence text represents which claim the evidence can support and evidence with the same superscript should be grouped together. (Pennington et al., 2014) word embedding trained on the corpus crawled from China Judgements Online with the embedding size of 300. We trained the model for 20 epochs, with a learning rate of 1e-4, a hidden size of 300, and a batch-size of 32.
BERT. We concatenate evidence pairs (evidence-claim pairs while calculating explicit distance, both separated by a special [SEP] token) and add a sigmoid layer to the special [CLS] token. We only fine-tune the last two layers of the BERT model for 10 epochs with a learning rate of 5e-5 and a batch-size of 32.
We choose the weights between latent and explicit distance after testing the results of different proportions.
The agglomerative hierarchical clustering method has a stopping threshold parameter. We choose the best parameter on the validation dataset in the range of 0 to 0.2 with a step size of 0.001. To ensure the stability of the experimental results, we evaluate methods via 5-fold cross-validation.

Result and Analysis
As the constructed datasets include ground truth cluster labels, we adopt the Adjusted Rand Index(ARI) (Hubert and Arabie, 1985) and the Adjusted Mutual Information(AMI) (Vinh et al., 2009) to evaluate the clustering performance. Table 2 presents the experiment results. Encouragingly, compared with unsupervised methods, the performance of any one of the supervised methods is much higher. Meanwhile, the BERT model outperforms the ESIM model. Firstly, the deeper neural network produces better performance. Another possible reason may be that the evidence pairs supporting the same claim have a co-occurrence tendency, which could be learned by the next sen- tence prediction task of the BERT model. The performance of latent distance is better than the explicit distance because it utilizes the semantic information between evidence pairs. The clustering result via the ensemble distance has a great improvement than any single distance owing to integrating the relationship between evidence pairs and evidence-claim pairs.
As shown in Figure 3, claims 1 and 2 represent the victim's date of birth and death, respectively. Both the victim's household registration certificate and the victim's death certificate can partly support the victim's identification information, and they were clustered together by mistake while using latent distance because no explicit claims were given so that only the semantic relationship between evidence pairs are used. Claims 4 and 5 are similar and they are both descriptions of the scene of a traffic accident. The defendant Wang's confession and the testimony of witness Dong are clustered together by mistake because almost no semantic relationship between evidence pairs is considered while using explicit distance. The clustering result via the ensemble distance is correct via combining the semantic relationship between evidence pairs and the information introduced by explicit claims.

Conclusion
In this paper, we propose a novel task of evidence association. The experiment results show that supervised methods significantly improve the clustering results even with a few training data. The clustering results have been greatly improved by introducing the information from explicit claims. Since explicit claims are not given in most cases, we are now studying how to model the claims through the fact description of the case in order to take advantage of the improvement of explicit claims.

Ethics Statement
The dataset constructed in this paper is from China Judgements Online 3 , which is an official legal documents website. The names of all participants in the dataset are anonymized before being published online. And there are already lots of datasets constructed from this website used in Chinese lawrelated research. We do not perform analysis at the user level rather than the evidence level, which is less intrusive for specific people. Finally, This technology mainly plays an auxiliary role to provide a reference for judges rather than play a decisive role.