HLDC: Hindi Legal Documents Corpus

Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area.


Introduction
In recent times, the legal system in many populous countries (e.g., India) has been inundated with a large number of legal documents and pending cases (Katju, 2019).There is an imminent need for automated systems to process legal documents and help augment the legal procedures.For example, if a system could readily extract the required information from a legal document for a legal practitioner, then it would help expedite the legal process.However, the processing of legal documents is challenging and is quite different from conventional text processing tasks.For example, legal documents are typically quite long (tens of pages), highly unstructured and noisy (spelling and grammar mistakes since these are typed), use domainspecific language and jargon; consequently, pretrained language models do not perform well on these (Malik et al., 2021b).Thus, to develop legal text processing systems and address the challenges associated with the legal domain, there is a need for creating specialized legal domain corpora.
In recent times, there have been efforts to develop such corpora.For example, Chalkidis et al. (2019) have developed an English corpus of European Court of Justice documents, while Malik et al. (2021b) have developed an English corpus of Indian Supreme Court documents.Xiao et al. (2018) have developed Chinese Legal Document corpus.However, to the best of our knowledge, there does not exist any legal document corpus for the Hindi language (a language belonging to the Indo-European family and pre-dominantly spoken in India).Hindi uses Devanagari script (Wikipedia contributors, 2021) for the writing system.Hindi is spoken by approximately 567 million people in the world (WorldData, 2021).Most of the lower (district) courts in northern India use Hindi as the official language.However, most of the legal NLP systems that currently exist in India have been developed on English, and these do not work on Hindi legal documents (Malik et al., 2021b).To address this problem, in this paper, we release a large corpus of Hindi legal documents (HINDI LEGAL DOCUMENTS CORPUS or HLDC) that can be used for developing NLP systems that could augment the legal practitioners by automating some of the legal processes.Further, we show a use case for the proposed corpus via a new task of bail prediction.
India follows a Common Law system and has a three-tiered court system with District Courts (along with Subordinate Courts) at the lowest level (districts), followed by High Courts at the state level, and the Supreme Court of India at the high-est level.In terms of number of cases, district courts handle the majority.According to India's National Judicial Data Grid, as of November 2021, there are approximately 40 million cases pending in District Courts (National Judicial Data Grid, 2021) as opposed to 5 million cases pending in High Courts.These statistics show an immediate need for developing models that could address the problems at the grass-root levels of the Indian legal system.Out of the 40 million pending cases, approximately 20 million are from courts where the official language is Hindi (National Judicial Data Grid, 2021).In this resource paper, we create a large corpus of 912,568 Hindi legal documents.In particular, we collect documents from the state of Uttar Pradesh, the most populous state of India with a population of approximately 237 million (PopulationU, 2021).The Hindi Legal Documents Corpus (HLDC) can be used for a number of legal applications, and as a use case, in this paper, we propose the task of Bail Prediction.
Given a legal document with facts of the case, the task of bail prediction requires an automated system to predict if the accused should be granted bail or not.The motivation behind the task is not to replace a human judge but rather augment them in the judicial process.Given the volume of cases, if a system could present an initial analysis of the case, it would expedite the process.As told to us by legal experts and practitioners, given the economies of scale, even a small improvement in efficiency would result in a large impact.We develop baseline models for addressing the task of bail prediction.
In a nutshell, we make the following main contributions in this resource paper: • We create a Hindi Legal Documents Corpus (HLDC) of 912,568 documents.These documents are cleaned and structured to make them usable for downstream NLP/IR applications.Moreover, this is a growing corpus as we continue to add more legal documents to HLDC.We release the corpus and model implementation code with this paper: https://github.com/Exploration-Lab/HLDC.• As a use-case for applicability of the corpus for developing legal systems, we propose the task of Bail Prediction.• For the task of bail prediction, we experiment with a variety of deep learning models.We propose a multi-task learning model based on trans-former architecture.The proposed model uses extractive summarization as an auxiliary task and bail prediction as the main task.

Related Work
In recent years there has been active interest in the application of NLP techniques to the legal domain (Zhong et al., 2020a).A number of tasks and models have been proposed, inter alia, Legal Judgment Prediction (Chalkidis et al., 2019), Legal Summarization (Bhattacharya et al., 2019;Tran et al., 2019), Prior Case Retrieval (Jackson et al., 2003;Shao et al., 2020), Legal Question Answering (Kim and Goebel, 2017), Catchphrase Extraction (Galgani et al., 2012), Semantic Segmentation (Kalamkar et al., 2022;Malik et al., 2021a Majority of corpora for Legal-NLP tasks have been in English; recently, there have been efforts to address other languages as well, for example, Xiao et al. (2018), have created a large-scale Chinese criminal judgment prediction dataset with over 2.68 million legal documents.Work on Legal-NLP in languages other than English is still in its incipient stages.Our paper contributes towards these efforts by releasing corpus in Hindi.
Majority of the work in the legal domain has focused on the higher court (Malik et al., 2021b;Strickson and De La Iglesia, 2020;Zhong et al., 2020b); however, the lower courts handle the maximum number of cases.We try to address this gap by releasing a large corpus of district court level legal documents.They augment the dataset with a database containing the legal knowledge required to answer the questions and also assign meta information to each of the questions for in-depth analysis.Xiao et al. (2019) proposed CAIL2019-SCM, a dataset containing 8,964 triplets of the case document, with the objective to identify which two cases are more similar in the triplets.Similar case matching has a crucial application as it helps to identify comparable historical cases.A historical case with similar facts often serves as a legal precedent and influences the judgement.Such historical information can be used to make the legal judgement prediction models more robust.Kleinberg et al. (2017) proposed bail decision prediction as a good proxy to gauge if machine learning can improve human decision making.A large number of bail documents along with the binary decision (granted or denied) makes it an ideal task for automation.In this paper, we also propose the bail prediction task using the HLDC corpus.

Hindi Legal Documents Corpus
Hindi Legal Documents Corpus (HLDC) is a corpus of 912,568 Indian legal case documents in the Hindi language.The corpus is created by downloading data from the e-Courts website (a publicly available website: https:// districts.ecourts.gov.in/).All the legal documents we consider are in the public domain.We download case documents pertaining to the district courts located in the Indian northern state of Uttar Pradesh (U.P.).We focus mainly on the state of U.P. as it is the most populous state of India, resulting in the filing of a large number of cases in district courts.U.P. has 71 districts and about 161 district courts.U.P. is a predominantly Hindi speaking state, and consequently, the official language used in district courts is Hindi.We crawled case documents from all districts of U.P. corresponding to cases filed over two years, from May 01, 2019 to May 01, 2021.Figure 2 shows the map of U.P. and district wise variation in the number of cases.As can be seen in the plot, the western side of the state has more cases; this is possibly due to the high population and more urbanization in the western part.Table 1 shows %wise division of different case types in HLDC.As evident from the table, majority of documents pertain to bail applications.HLDC corpus has a total of 3,797,817 unique tokens, and on average, each document has 764 tokens.HLDC Creation Pipeline: We outline the entire pipeline used to create the corpus in Figure 1.The documents on the website are originally typed in Hindi (in Devanagari script) and then scanned to PDF format and uploaded.The first step in HLDC creation is the downloading of documents from the e-Courts website.We downloaded a total of 1,221,950 documents.To extract Hindi text from these, we perform OCR (Optical Character Recognition) via the Tesseract tool 1 .Tesseract worked well for our use case as the majority of case documents were well-typed, and it out- performed other OCR libraries2 .The obtained text documents were further cleaned to remove noisy documents, e.g.too short (< 32 bytes) or too long (> 8096 bytes) documents, duplicates, and English documents (details in Appendix B).This resulted in a total of 912,568 documents in HLDC.We anonymized the corpus with respect to names and locations.We used a gazetteer3 along with regex-based rules for NER to anonymize the data.List of first names, last names, middle names, locations, titles like Ú (Pandit: title of Priest), я (Sir: Sir), month names and day names were normalized to < > (Naam: <name>).The gazetteer also had some common ambiguous words (these words can be names or sometimes verbs) like (Prathna: Can refer to prayer, the action of request or name), (Gaya: can refer to location name or verb), (Kiya: can refer to infinitive 'to do' or name), (Liya: can refer to infinitive 'to take' or name).These were removed.Further, we ran RNN-based Hindi NER model4 on a subset of documents to find additional entities and these were subsequently used to augment our gazetteer (details Appendix C).Phone numbers were detected using regex patterns and replaced with a <úº > (<phone-number>) tag, numbers written in both English and Hindi were considered.
Legal documents, particularly in lower courts, are highly unstructured and lack standardization with respect to format and sometimes even the terms used.We converted the unstructured doc-  uments to semi-structured documents.We segmented each document into a header and a body.The header contains the meta-information related to the case, for example, case number, court identifier, and applicable sections of the law.The body contains the facts of the case, arguments, judge's summary, case decision and other information related to the final decision.The documents were segmented using regex and rule based approaches as described in Appendix D.
Case Type Identification: HLDC documents were processed to obtain different case types (e.g., Bail applications, Criminal Cases).The case type was identified via the meta-data that comes with each document.However, different districts use a variation of the same case type name (e.g., Bail Application vs Bail App.).We resolved these standardization issues via manual inspection and regex-based patterns, resulting in a final list of 300 unique case types.
Lexical Analysis: Although Hindi is the official language, U.P. being a large and populous  gar).This particular variant of motionless being used most often only in East U.P. Similarly, the word (Gaushiya: cow and related animals) is mostly used in North-Western UP (Rampur, Pilibhit, Jyotiba Phule Nagar (Amroha), Bijnor, Budaun, Bareilly, Moradabad).Three districts -Muzaffarnagar, Kanshiramnagar and Pratapgarh district constitute 81.5% occurrences of the word Ú Ú (Dand: punishment).These districts are, however, spread across UP.An important thing to note is that words corresponding to specific districts/areas are colloquial and not part of the standard Hindi lexicon.This makes it difficult for prediction model to generalize across districts ( §7).Corpus of Bail Documents: Bail is the provisional release of a suspect in any criminal offence on payment of a bail bond and/or additional restrictions.Bail cases form a large majority of cases in the lower courts, as seen in Table 1.Additionally, they are very time-sensitive as they require quick decisions.For HLDC, the ratio of bail documents to total cases in each district is shown in Figure 3.As a use-case for the corpus, we further investigated the subset of the corpus having only the bail application documents (henceforth, we call it Bail Corpus).
Bail Document Segmentation: For the bail documents, besides the header and body, we further segmented the body part into more subsections (Figure 4).Body is further segmented into Facts and Arguments, Judge's summary and Case Result.Facts contain the facts of the case and the defendant and prosecutor's arguments.Most of the bail documents have a concluding paragraph where the judge summarizes their viewpoints of the case, and this constitutes the judge's summary sub-section.The case result sub-section contains the final decision given by the judge.More details about document segmentation are in Appendix D.
Bail Decision Extraction: Decision was extracted from Case Result Section using a rule based approach (Details in Appendix E).
Bail Amount Extraction: If bail was granted, it usually has some bail amount associated with it.We extracted this bail amount using regex patterns (Details in Appendix F).
We verified each step of the corpus creation pipeline (Detailed analysis in Appendix G) to ensure the quality of the data.We initially started with 363,003 bail documents across all the 71 districts of U.P., and after removing documents having segmentation errors, we have a Bail corpus with 176,849 bail documents.The bail corpus has a total of 2,342,073 unique tokens, and on average, each document has 614 tokens.A sample document segmented into various sections is shown in Appendix I.

HLDC: Ethical Aspects
We create HLDC to promote research and automation in the legal domain dealing with underresearched and low-resource languages like Hindi.The documents that are part of HLDC are in the public domain and hence accessible to all.Given the volume of pending cases in the lower courts, our efforts are aimed towards improving the legal system, which in turn would be beneficial for millions of people.Our work is in line with some of the previous work on legal NLP, e.g., legal corpora creation and legal judgement prediction (section 2).Nevertheless, we are aware that if not handled correctly, legal AI systems developed on legal corpora can negatively impact an individual and society at large.Consequently, we took all possible steps to remove any personal information and biases in the corpus.We anonymized the corpus (section 3) with respect to names, gender information, titles, locations, times, judge's name, petitioners and appellant's name.As observed in previous work (Malik et al., 2021b), anonymization of a judge's name is important as there is a correlation between a case outcome and a judge name.Along with the HLDC, we also introduce the task of Bail Prediction.Bail applications constitute the bulk of the cases ( §3), augmentation by an AI system can help in this case.The bail prediction task aims not to promote the development of systems that replace humans but rather the development of systems that augment humans.The bail prediction task provides only the facts of the case to predict the final decision and avoids any biases that may affect the final decision.Moreover, the Bail corpus and corresponding bail prediction systems can promote the development of explainable systems (Malik et al., 2021b), we leave research on such explainable systems for future work.The legal domain is a relatively new area in NLP research, and more research and investigations are required in this area, especially concerning biases and societal impacts; for this to happen, there is a need for corpora, and in this paper, we make initial steps towards these goals.

Bail Prediction Task
To demonstrate a possible applicability for HLDC, we propose the Bail Prediction Task, where given the facts of the case, the goal is to predict whether the bail would be granted or denied.Formally, consider a corpus of bail documents where each bail document is segmented as b i = (h i , f i , j i , y i ).Here, h i , f i , j i and y i represent the header, facts, judge's summary and bail decision of the document respectively.Additionally, the facts of every document contain k sentences, more formally, , where s k i represents the k th sentence of the i th bail document.We formulate the bail prediction task as a binary classification problem.We are interested in modelling p θ (y i |f i ), which is the probability of the outcome y i given the facts of a case f i .Here, y i ∈ {0, 1}, i.e., 0 if bail is denied or 1 if bail is granted.

Bail Prediction Models
We initially experimented with off-the-shelf pretrained models trained on general-purpose texts.However, as outlined earlier ( §1), the legal domain comes with its own challenges, viz.specialized legal lexicon, long documents, unstructured and noisy texts.Moreover, our corpus is from an under-resourced language (Hindi).Nevertheless, we experimented with existing fine-tuned (pre-trained) models and finally propose a multi-task model for the bail prediction task.

Embedding Based Models
We experimented with classical embedding based model Doc2Vec (Le and Mikolov, 2014) and transformer-based contextualized embeddings model IndicBERT (Kakwani et al., 2020).Doc2Vec embeddings, in our case, is trained on the train set of our corpus.The embeddings go as input to SVM and XgBoost classifiers.IndicBERT is a transformer language model trained on 12 major Indian languages.However, IndicBERT, akin to other transformer LMs, has a limitation on the input's length (number of tokens).Inspired by Malik et al. (2021b); Chalkidis et al. ( 2019), we experimented with fine-tuning IndicBERT in two settings: the first 512 tokens and the last 512 tokens of the document.The fine-tuned transformer with a classification head is used for bail prediction.

Summarization Based Models
Given the long lengths of the documents, we experimented with prediction models that use summarization as an intermediate step.In particular, an extractive summary of a document goes as input to a fine-tuned transformer-based classifier (In-dicBERT).Besides reducing the length of the document, extractive summarization helps to evaluate the salient sentences in a legal document and is a step towards developing explainable models.We experimented with both unsupervised and supervised extractive summarization models.
For unsupervised approaches we experimented with TF-IDF (Ramos, 2003) and TextRank (a graph based method for extracting most important sentences) (Mihalcea and Tarau, 2004).For the supervised approach, inspired by Bajaj et al. (2021), we propose the use of sentence salience classifier to extract important sentences from the document.Each document (b i = (h i , f i , j i , y i ), §5) comes with a judge's summary j i .For each sentence in the facts of the document (f i ) we calculate it's cosine similarity with judge's summary (j i ).Formally, salience of k th sentence s k i is given by: salience( Here h j i is contextualized distributed representation for j i obtained using multilingual sentence encoder (Reimers and Gurevych, 2020).Similarly, h s k i is the representation for the sentence s k i .The cosine similarities provides ranked list of sentences and we select top 50% sentences as salient.The salient sentences are used to train (and fine-tune) IndicBERT based classifier.

Multi-Task Learning (MTL) Model
As observed during experiments, summarization based models show improvement in results ( §7).
Inspired by this, we propose a multi-task framework (Figure 5), where bail prediction is the main task, and sentence salience classification is the auxiliary task.The intuition is that predicting the important sentences via the auxiliary task would force the model to perform better predictions and vice-versa.Input to the model are sentences corresponding to the facts of a case: s 1 i , s 2 i , . . ., s k i .A multilingual sentence encoder (Reimers and Gurevych, 2020) is used to get contextualized representation of each sentence: In addition, we append the sentence representations with a special randomly initialized CLS embedding (Devlin et al., 2019) that gets updated during model training.The CLS and sentence embeddings are fed into standard single layer transformer architecture (shared transformer).

Bail Prediction Task
A classification head (fully connected layer MLP) on the top of transformer CLS embedding is used to perform bail prediction.We use standard crossentropy loss (L bail ) for training.

Salience Classification Task
We use the salience prediction head (MLP) on top of sentence representations at the output of the shared transformer.For training the auxiliary task, we use sentence salience scores obtained via cosine similarity (these come from supervised summarization based model).For each sentence, we  Based on our empirical investigations, both the losses are equally weighted, and total loss is given by L = L bail + L salience 7 Experiments and Results

Dataset Splits
We evaluate the models in two settings: all-district performance and district-wise performance.For the first setting, the model is trained and tested on the documents coming from all districts.The train, validation and test split is 70:10:20.The districtwise setting is to test the generalization capabilities of the model.In this setting, the documents from 44 districts (randomly chosen) are used for training.Testing is done on a different set of 17 districts not present in train set.The validation set has another set of 10 districts.This split corresponds to a 70:10:20 ratio.Table 2 provides the number of documents across splits.The corpus is unbalanced for the prediction class with about 60:40 ratio for positive to negative class (Table 2).All models are evaluated using standard accuracy and F1-score metric (Appendix H.1). Implementation Details: All models are trained using GeForce RTX 2080Ti GPUs.Models are  tuned for hyper-parameters using the validation set (details in Appendix H.2).

Results
The results are shown in Table 3.As can be observed, in general, the performance of models is lower in the case of district-wise settings.This is possibly due to the lexical variation (section 3) across districts, which makes it difficult for the model to generalize.Moreover, this lexical variation corresponds to the usage of words corresponding to dialects of Hindi.Another thing to note from the results is that, in general, summarization based models perform better than Doc2Vec and transformer-based models, highlighting the importance of the summarization step in the bail prediction task.The proposed end-to-end multi-task model outperforms all the baselines in the district-wise setting with 78.53% accuracy.The auxiliary task of sentence salience classification helps learn robust features during training and adds a regularization effect on the main task of bail prediction, leading to improved performance than the two-step baselines.However, in the case of an all-district split, the MTL model fails to beat simpler baselines like TF-IDF+IndicBERT.We hypothesize that this is due to the fact that the sentence salience training data may not be entirely correct since it is based on the cosine similarity heuristic, which may induce some noise for the auxiliary task.Additionally, there is lexical diversity present across documents from different districts.Since documents of all districts are combined in this setting, this may introduce diverse sentences, which are harder to encode for the salience classifier, while TF-IDF is able to look at the distribution of words across all documents and districts to extract salient sentences.

Error Analysis
We did further analysis of the model outputs to understand failure points and figure out improvements to the bail prediction system.After examining the miss-classified examples, we observed the following.First, the lack of standardization can manifest in unique ways.In one of the documents, we observed that all the facts and arguments seemed to point to the decision of bail granted.Our model also gauged this correctly and predicted bail granted.However, the actual result of the document showed that even though initially bail was granted because the accused failed to show up on multiple occasions, the judge overturned the decision and the final verdict was bail denied.In some instances, we also observed that even if the facts of the cases are similar the judgements can differ.We observed two cases about the illegal possession of drugs that differed only a bit in the quantity seized but had different decisions.The model is trained only on the documents and has no access to legal knowledge, hence is not able to capture such legal nuances.We also performed quantitative analysis on the model output to better understand the performance.Our model outputs a probabilistic score in the range {0, 1}.
A score closer to 0 indicates our model is confident that bail would be denied, while a score closer to 1 means bail granted.In Figure 6  We also plot (Figure 7) the density functions corresponding to True Positive (Bail correctly granted), True Negative (Bail correctly dismissed), False Positive (Bail incorrectly granted) and False Negatives (Bail incorrectly dismissed).We observe the correct bail granted predictions are shifted towards 1, and the correct bail denied predictions are shifted towards 0. Additionally, the incorrect samples are concentrated near the middle (≈ 0.5), which shows that our model was able to identify these as borderline cases.

Future Work and Conclusion
In this paper, we introduced a large corpus of legal documents for the under-resourced language Hindi: Hindi Legal Documents Corpus (HLDC).We semi-structure the documents to make them amenable for further use in downstream applications.As a use-case for HLDC, we introduce the task of Bail Prediction.We experimented with several models and proposed a multi-task learning based model that predicts salient sentences as an auxiliary task and bail prediction as the main task.Results show scope for improvement that we plan to explore in future.We also plan to expand HLDC by covering other Indian Hindi speaking states.Furthermore, as a future direction, we plan to collect legal documents in other Indian languages.India has 22 official languages, but for the majority of languages, there are no legal corpora.Another interesting future direction that we would like to explore is the development of deep models infused with legal knowledge so that model is able to capture legal nuances.We plan to use the HLDC corpus for other legal tasks such as summarization and prior case retrieval.Facts and Arguments: This chunk of the document contains case facts related to the case and arguments from defendant and prosecutor.

1Figure 2 :
Figure 2: Variation in number of case documents per district in the state of U.P. Prominent districts are marked.

Figure 3 :
Figure 3: Ratio of number of bail applications to total number of applications in U.P.

Figure 5 :
Figure 5: Overview of our multi-task learning approach.
we plot the ROC curve to showcase the capability of the model at different classification thresholds.ROC plots True Positive and False Positive rates at different thresholds.The area under the ROC curve (AUC) is a measure of aggregated classification performance.Our proposed model has an AUC score of 0.85, indicating a high-classification accuracy for a challenging problem.

Figure 6 :Figure 7 :
Figure 6: ROC curve for the proposed model.The total AUC (Area under curve) is 0.85.
case like court hearing date, IPC sections attached, police station of complain, etc.

Table 1 :
Case types in HLDC.Out of around 300 different case types, we only show the prominent ones.Majority of the case documents correspond to bail applications.

Table 2 :
Number of documents across each split use binary-cross entropy loss (L salience ) to predict the salience.

Table 3 :
Model results.For TF-IDF and TextRank models we take the sum of the token embeddings.

Table 10 -
continued from previous page Result: This chunk of the document contains decision made by judge on the case.

Table 10 :
A sample segmented document