Semantic Segmentation of Legal Documents via Rhetorical Roles

Legal documents are unstructured, use legal jargon, and have considerable length, making them difficult to process automatically via conventional text processing techniques. A legal document processing system would benefit substantially if the documents could be segmented into coherent information units. This paper proposes a new corpus of legal documents annotated (with the help of legal experts) with a set of 13 semantically coherent units labels (referred to as Rhetorical Roles), e.g., facts, arguments, statute, issue, precedent, ruling, and ratio. We perform a thorough analysis of the corpus and the annotations. For automatically segmenting the legal documents, we experiment with the task of rhetorical role prediction: given a document, predict the text segments corresponding to various roles. Using the created corpus, we experiment extensively with various deep learning-based baseline models for the task. Further, we develop a multitask learning (MTL) based deep model with document rhetorical role label shift as an auxiliary task for segmenting a legal document. The proposed model shows superior performance over the existing models. We also experiment with model performance in the case of domain transfer and model distillation techniques to see the model performance in limited data conditions.


Introduction
In recent times, courts in many populous countries (e.g., India) have been inundated with a large number of pending cases (Katju, 2019). There is an imminent need for automated systems to process legal documents and help augment the legal procedures. For example, if a system could readily extract the required information from a legal document for a legal practitioner, then it would help them expedite the legal process. However, the processing of legal documents is challenging and is quite different * Equal Contributions from conventional text processing tasks. This can be attributed to the following reasons. Firstly, legal texts are long (e.g., court judgment documents are tens of pages long) compared to much shorter documents used for training conventional NLP models. Long documents are challenging to process, the information is spanned over, and it needs to be retained and recalled appropriately depending on the task. Secondly, legal documents (e.g., in common law countries like India) are highly unstructured and noisy (e.g., grammatical, spelling mistakes) as many of these are manually typed. Thirdly, legal documents use natural language; however, these documents use legal-domain-specific lexicons. Moreover, commonly occurring words in a natural language could have specialized meaning in legal parlance. The use of specialized lexicon and different semantics associated with words makes pre-trained neural models (e.g., transformer-based models) ineffective. Fourth, the legal domain has several sub-domains (corresponding to different laws, e.g., criminal law, income tax law) within it. Although some of the fundamental legal principles are common, the overlap between different sub-domains is low, hence systems developed on one law (e.g., income tax law) may not directly work for another law (e.g., criminal law), so there is problem of a domain shift. Due to the aforementioned reasons, conventional text processing models do not perform well on legal texts.
In this paper, we target legal case proceedings in the form of judgment documents. For the purpose of analyzing a judgment and understanding why and how the judge(s) reach a particular decision based on the facts of the case, the statutes applicable, the existing precedents about similar factual scenarios, the arguments made by the lawyers of the parties involved, it is essential that the body of the judgment be divided and categorized into different parts. Each of these parts (text segment) plays a different role towards the final judgment. To aid the processing of legal documents, we propose a method of semantically segmenting a legal document into coherent units of information referred to as Rhetorical Roles. In particular, we propose eight main Rhetorical Roles (Sec. 3). We create a new corpus of legal text annotated with Rhetorical Roles (RR) and propose a new model for segmenting documents into RR as conventional text classification models fail to perform well on legal documents. RR can be useful for various legal applications, e.g., legal text summarization, crime classification. Legal documents are fairly long, and dividing these into rhetorical role units can help summarize documents effectively. In the task of legal judgment prediction, for example, using RR, one could extract the relevant portions of the case that contributes towards the final decision. In this paper, we address an application of RR for the task of judgment prediction. RR can also address the explainability requirement of an AI-based legal processing system since these provide semantically coherent units that could serve as an explanation in various tasks. RR can be useful for legal information extraction, e.g., it can help extract facts of a case. Similarly, prior cases similar to a given case could be retrieved by comparing different rhetorical role units. In this work, we make the following contributions: 1. We create a new corpus of legal documents annotated with rhetorical role labels. To the best of our knowledge, this is the largest RR corpus. We release the corpus and model implementations and experiments code.
2. We propose new multi-task learning (MTL) based deep learning model with document level rhetorical role shift as an auxiliary task for segmenting the document into rhetorical role units. We experiment with various text classification models and show that the proposed model performs better than the existing models. We further show that our method is robust against domain transfer.
3. Given that annotating legal documents with RR is a tedious process, we perform model distillation experiments with the proposed MTL model and attempt to leverage unlabeled data to enhance the performance.
4. To the best of our knowledge, we are first to show an application of RR for the task of judgment prediction.

Related Work
Legal text processing has been an active area of research in recent times. A number of dataset, applications and tasks have been proposed. For example, Argument Mining (Wyner et al., 2010), Information Extraction and Retrieval (Tran et al., 2019), Event Extraction (Lagos et al., 2010), Prior Case Retrieval (Jackson et al., 2003, Summarization (Moens et al., 1999), andCase Prediction (Malik et al., 2021;Chalkidis et al., 2019;Strickson and De La Iglesia, 2020 legal system: Competition Law (CL) (also called as Anti-Trust Law in the US and Anti-Monopoly law in China) and Income Tax (IT). CL deals with regulating the conduct of companies, particularly concerning competition. With the help of legal experts, we narrowed down the cases pertinent to CL and IT from the crawled corpus. We also scraped CL documents from Indian Tribunal court cases (National Company Law Appellate Tribunal (NCLAT), COMPetition Appellate Tribunal (COM-PAT), Competition Commission of India (CCI)). Choice of CL and IT domains: India has a common law system where a decision may not be exactly as per the statutes, but the judiciary may come up with its interpretation and overrule existing precedents. This introduces a bit of subjectivity. One of the biggest problems faced during the task of identifying the rhetorical roles in a judgment is that the element of subjectivity involved in the judicial perception and interpretation of different rhetorical roles, ranging from the factual matrix to the statutory applicability and interpretation to determine the fitness of a particular judicial precedent to the case at hand. In order to overcome this particular obstacle, we focus on specific legal domains (CL and IT) that display a relatively greater degree of consistency and objectivity in terms of judicial reliance on statutory provisions to reach decisions. Corpus Statistics: We randomly selected a set of 50 documents each for CL and IT from the set of acquired documents. These 100 documents were annotated with RR labels by a team of legal experts. Our corpus is double the size of existing RR corpus. The CL documents have 13,328 sentences (avg. of 266 per document), and IT has a total of 7856 sentences (avg. of 157 per document). Label wise distribution for IT and CL documents are provided in Appendix A.1. Annotating legal documents with RRs is a tedious as well as challenging task even for legal experts. Annotating just 100 documents took almost three months. Nevertheless, this is a growing corpus, and we plan to add more annotated documents. However, given the complexity of annotations, the RR labeling task also points towards looking for model distillation (Sec. 5) and zero-shot learning-based methods.
Annotation Setup: The annotation team (legal team) consisted of two law professors from prestigious law schools and six graduate-level law student researchers. Based on detailed discussions with the legal team, we arrived at the following eight main rhetorical roles plus one 'none' label. During the annotation, the documents were annotated with thirteen fine-grained labels since some could be sub-divided into more fine-grained classes. These included Fact (FAC) (further subdivided into Fact (FAC), and Issues (ISS)), Argument (ARG) (further subdivided into Argument Petitioner (ARG-P), Argument Respondent (ARG-R)), Statute (SAT), Dissent (DIS), Precedent (PRE) (further subdivided into Precedent Relied Upon (PRE-R), Precedent Not Relied Upon (PRE-NR), and Precedent Overrruled (PRE-O)), Ruling by Lower Court (RLC), Ratio of the Decision (ROD), Ruling by the Present Court (RPC), None (NON) (role definitions in Appendix A.2). The dataset was annotated by six legal experts (graduate law student researchers), 3 annotated 50 CL documents, and the remaining 3 annotated 50 IT documents. We used Webanno (de Castilho et al., 2016) as the annotation framework. Each legal expert was asked to assign one of the 13 Rhetorical roles to each document sentence. Note that we initially experimented with different levels of granularity (e.g., phrase level, paragraph level), and based on the pilot study, we decided to go for sentence-level annotations as it maintains the balance (from the perspective of topical coherence) between too short (having no labels) and too long (having too many labels) texts. Legal experts pointed out that a single sentence can sometimes represent multiple rhetorical roles (although this is not common). Each expert could also assign secondary and tertiary rhetorical roles to a single sentence to handle such scenarios and motivate future research. As an example, suppose a sentence is a 'Fact' but could also be an 'Argument' according to the legal expert. In that case, the expert could assign the rhetorical roles 'Primary Fact' and 'Secondary Argument' to that sentence. We extended it to the tertiary level as well to handle rare cases. Adjudication and Data compilation: Annotating RR is not a trivial task, and there are disagreements among annotators. We followed a majority voting strategy over primary labels to determine the gold label from the experts' annotations. There were a few cases (approx. 5%) where all the three legal experts assigned a different role to the same sentence. In such cases, we asked the law professors to finalize the primary label. If the law professors decided to go with a label completely different from the three annotated labels, we went with their verdict. However, such cases were not frequent (approx. 4% of adjudicated cases). In the current paper, for RR prediction, we concentrate on the primary labels. Given the complexity of the task, we leave secondary and tertiary labels for future work.
Inter-annotator Agreements: The Fleiss kappa (Fleiss et al., 2013) between the annotators is 0.65 for the IT domain and 0.87 for the CL domain, indicating a substantial agreement between annotators. To determine the agreement between the three annotators A 1 , A 2 , A 3 (each for IT and CL domain), we calculate the pairwise F1 scores (Appendix B) between annotators (A 1 , A 2 ), (A 2 , A 3 ) and (A 3 , A 1 ). We average these pairwise scores for each label and average them out. We report the label-wise F1 and Macro F1 in Table 1. The table shows that the agreements between domains differ (0.73 for IT vs. 0.88 for CL). This is mainly due to (as pointed by law professors) the presence of more precedents and a greater number of statutory provisions in IT laws. These factors combine to produce more subjectivity (relative to CL) when it comes to interpreting and retracing judicial decisions. The confusion matrix between the annotators (A 1 , A 3 ) is shown in Figure 1. Confusion matrices between other annotators are given in Appendix A.3. Analysis: Annotation of judgments to identify the rhetorical roles is a challenging task even for legal experts. Several factors contribute to this challenge. Annotators need to glean and combine information non-trivially (e.g., facts and arguments presented by each party, the implicit setting, and the context under which the events described in the case happened) to arrive at the label. Moreover, the annotator only has access to the current document, which is a secondary account of what actually happened in the court. These limitations certainly make the task of the annotator further difficult and leaves them with no choice other than to make certain educated guesses when it comes to understanding the various nuances, both ostensible and probable, of certain rhetorical roles. It should, however, be noted that such variation need not occur for every rhetorical role since not all the roles are equally susceptible to it. A cumulative effect of the aforementioned factors can be observed in the results of the annotation. The analysis provided by the three annotators in the case of CL bears close resemblance with each other. On the other hand, in the case of IT, the analysis provided by Users 1 and 3 bears greater resemblance with each other, compared to the resemblance between Users 1 and 2, or between Users 2 and 3. On a different note, it is also observed that the rhetorical role where the annotators have differed between themselves the most has been the point of Ruling made by the Lower Court, followed by the Ratio. This also ties in with the argument that all rhetorical roles are not equally susceptible to the variation caused by the varying levels of success achieved by the different annotators in retracing the judicial thought pattern (more details and case studies in Appendix A.4).

Rhetorical Roles Labelling/Prediction
Task Definition: Given a legal document, D, containing the sentences [s 1 , s 2 , ...s n ], the task of rhetorical role prediction is to predict the label (or Table 2: Results for the auxillary task LSP upon different datasets (converted to LSP form) and models SBERT-Shift and BERT-SC role) y i for each sentence s i ∈ D.
Baseline Models: For the first set of baseline models, the task is modeled as a single sentence prediction task, where given the sentence, s, the model predicts the rhetorical role of the sentence. In this case, the context is ignored. We consider pretrained BERT (Devlin et al., 2019) and LEGAL-BERT (Chalkidis et al., 2020) models for this. As another set of baseline models, we consider the task a sequence labeling task, where the sequence of all the sentences in the document is given as input, and the model has to predict the RR label for each sentence. We used CRF with hand-crafted features (Bhattacharya et al., 2019) and the BiLSTM. Label Shift Prediction: Rhetorical role labels do not change abruptly across sentences in a document, and the text tends to maintain topical coherence. Given the label y for a sentence s i in the document, we hypothesize that the chances of shift (change) in the label for the next sentence s i+1 is low. We manually verified this using the training set and observed that on average in a document, if the label of sentence s i is y, then 88% of the times the label of the next sentence s i+1 is same as y. Note that this is true only for consecutive sentences, but in general, label shift inertia fades as we try to predict beyond the second consecutive sentence. Since we are performing a sequence prediction task, this alone is not a good model for label prediction. Nevertheless, we think that this label shift inertia can provide a signal (via an auxiliary task) to the main sequence prediction model. Based on this observation, we define an auxiliary binary classification task: Label Shift Prediction (LSP), that aims to model the relationship between two sentences s i and s i+1 and predict whether the labels y i for s i and y i+1 for s i+1 are different (shift occurs) or not. In particular, for each document D ∈ D, for sentence pair S = {s i , s i+1 } ∈ D, we  define the label of LSP task, Note for the full model (described later) at the inference time, we will not be provided with the true label of a sentence, hence predicting a shift in label makes more sense than performing a binary prediction that the next sentence has the same label or not. We model the LSP task via two different models: SBERT-Shift: We model the label shift via a Siamese network. In particular, we use the pretrained SBERT model (Reimers and Gurevych, 2019) to encode sentences s i and s i+1 to get representations e i and e i+1 . The combination of these representations (e i ⊕ e i+1 ⊕ (e i − e i+1 )) is passed through a feed-forward network to predict the shift. BERT-SC: We use the pre-trained BERT model and fine-tune it for the task of LSP. We model the input in the form of sentence semantic coherence task, to make the final prediction for shift. In general, the BERT-SC model performs better than SBERT-Shift (  CRF model. The model consists of ( Figure 2) a BiLSTM-CRF model with specialized input representation. Let the sentence embedding (from pre-trained BERT) corresponding to i th sentence be b i . Let, the representation of the label shift (the layer before the softmax layer in LSP model) between current sentence and previous sentence pair . This sentence representation goes as input to the BiLSTM-CRF model for RR prediction.
Multi-Task Learning (MTL): We use the framework of Multi-Task learning, where rhetorical role prediction is the main task and label shift prediction is the auxiliary task. Sharing representations between the main and related tasks helps in better generalization on the main task. The intuition is that a label shift would help the rhetorical role component make the correct prediction based on the prospective shift. The MTL model ( where, L shif t is the loss corresponding to label shift prediction and L RR is the loss corresponding to rhetorical role prediction, and hyperparameter λ balances the importance of each of the task. If λ is set to zero, we are back with our baseline BiLSTM-CRF model. Since there are two components, we experimented with sending the same encodings of sentences to both the components (E 1 = E 2 ), as well as sending different encodings of the same sentence to both components (E 1 = E 2 ).

Experiments, Results and Analysis
For RR prediction, we consider seven main labels (FAC, ARG, PRE, ROD, RPC, RLC, and STA). We do not consider sentences with NON (None) labels (about 4% for IT and 0.5% for CL). Analysis of the annotated dataset reveals that the IT domain does not have any instance of dissent (DIS) label. There were only three documents (out of 50) in the CL domain having few instances of dissent label. Moreover, the instances of dissent label were present as a contiguous chunk of sentences at the end of the document. Hence, we discarded the sentences with dissent labels. Furthermore, law experts told us that the dissent phenomenon is rare; from a practical (application) point of view, these labels can be discarded. To compare with prior work, we also experimented with a RR dataset of 50 documents (referred to as G) provided by Bhattacharya et al. (2019). Note that G dataset comes from a different legal sub-domain (criminal and civil cases) with very less overlap with IT and CL. We randomly split the IT/CL/G dataset into 80% train, 10% validation, and 10% test set. We also experiment with a combined dataset of IT and CL (IT+CL); the splits are made by combining individual train/val/test split of IT and CL. We experimented with a number of baseline models (   BERT embeddings, and with MLM embeddings. We finetuned BERT with Masked Language Modeling (MLM) objective on the train set to obtain MLM embeddings (CLS embedding) for each of the sentences (refer to the Appendix C for hyperparameters, training schedule, and compute settings). We use the Macro F1 metric to evaluate the performance (definition in Appendix B). We tuned the hyperparameter λ of the MTL loss function using the validation set. We trained the MTL model with λ ∈ [0.1, 0.9] with strides of 0.1 (Figure 4). λ = 0.6 performs the best for the IT domain and performs competitively on the combined domains. Results and Analysis: Among the baseline models (Table 3), we note that LEGAL-BERT performs slightly better on the CL domain but slightly worse on the IT domain when compared to pre-trained BERT. This might be attributed to the fact that LEGAL-BERT (trained on EU legal documents, which also has European competition law) is not trained on Indian IT law documents. Using BERT embeddings with BiLSTM-CRF provides better results. Both the proposed approaches outperform the previous approaches by a significant margin. The MTL approach (with λ = 0.6) provides the best results on both datasets with an average (over six runs) F1 score of 0.70 (standard devia-tion of 0.02) on the IT domain, an average F1 of 0.69(±0.01) on CL domain, and an average F1 score of 0.71(±0.01) for the combined domain. The MTL model shows variance across runs, hence we average the results. Other models were reasonably stable across runs.
We use the LSP shift component with BERT-SC as the encoder E 1 and the pre-trained BERT model as the encoder E 2 in our MTL architecture. We did not use SBERT since it was under-performing when compared to BERT-SC. We provide the labelwise F1 scores for the MTL model in Table 5. Note the high performance on the FAC label and low performance on the RLC label; this is similar to what we observe for annotators (see Table 1). Also, the MTL model performs significantly better on the AR label in the CL domain than the IT domain. An opposite trend can be observed for the RLC label. The contribution of the LSP task is evident from the superior performance. We conduct the ablation study of our MTL architecture from multiple aspects. Instead of using shift embeddings from BERT-SC as the encoder E 1 , we use a BERT model fine-tuned upon the MLM task on the IT and CL domain. However, we obtain a comparatively lower score (see Appendix C). This observation yet again points towards the significance of the LSP in the task of rhetorical role prediction (results on other encoders in Appendix C). The results have two interesting observations: firstly, MTL model performance on IT cases comes close to the average inter-annotator agreement. In the case of CL, there is a gap. Secondly, for the model, the performance on the IT domain is better than the CL domain, but in the case of annotators opposite trend was observed. We do not know the exact reason for this, but the legal experts pointed out that this is possible because the selected documents might be restricted to specific sections of the IT law and model learned solely from these documents alone without any other external knowledge. However, annotators, having knowledge of the entire IT law, might have looked from a broader perspective.   can observe that the MTL model generalizes better across the domains compared to the baseline model. Both the models perform better on the G test when the combined (IT+CL) train set is used. This points towards better generalization. Model distillation: RR annotation is a tedious process, however, there is an abundance of unlabelled legal documents. We experimented with semi-supervised techniques to leverage the unlabelled data. In particular, we tried a self-training based approach (Xie et al., 2020). The idea is to learn a teacher model θ tea on the labelled data D L . The teacher model is then used to generate hard labels on unlabeled sentences s u ∈ d i : Next, a student model θ stu is learned on labeled and unlabeled sentences, with the loss function for student training given by: Here, α U is a weighing hyperparameter between the labelled and unlabelled data (details in Appendix C). The process can be iterated and the final distilled model is used for prediction. The results of model distillation are shown in Table 7 for two iterations (initializing the teacher model of the current iteration as the learned student model of the previous iteration; further iterations do not improve results). MTL model was run just once, due to variance it shows F1 of 0.68. The results improve for majority of labels with an increment of 0.11 F1 score for the RLC label in

Model
IT+CL docs F1 BERT-ILDC last 512 tokens 0.55 BERT-ILDC Gold ROD & RPC 0.58 Table 8: Judgement prediction using RR the first iteration. Also, the variance of F1 scores across labels decreases significantly. RR Application: Judgment Prediction: To check the applicability of RR in downstream applications, we experimented with how RR could contribute towards judgment prediction. We use the legal judgment corpus (ILDC) provided by Malik et al. (2021) and fine-tune a pre-trained BERT model on the train set of ILDC for the task of judgment prediction on the last 512 tokens of the documents. Malik et al. (2021) observed that training on the last 512 (also the max size of the input to BERT) tokens of a legal document give the best results; we use the same setting. We use this trained model directly for predicting the outcome on 84 IT/CL cases. We removed text corresponding to the final decisions (and extracted gold decisions) from these documents with the help of legal experts. In the first experiment, we use the last 512 tokens of IT/CL cases for prediction. To study the effect of RRs, in another experiment, we extract the sentences corresponding to gold ratio (ROD) and ruling (RPC) RR labels in IT/CL documents and use this as input to the BERT model. There were no ROD or RPC labels for some documents (16 out of 100 for both IT and CL); we removed these in both experiments. The results are shown in Table 8. Using the gold RR gives a boost to the F1 score. We also experimented with using predicted RR and the performance was comparable to that of the BERT model. Since RR prediction for ROD and RPC is not perfect, improving it would enhance the results.

Conclusion
In this paper, we introduced a new corpus annotated with rhetorical roles. We proposed a new MTL model that uses label shift information for predicting labels. We further showed via domain transfer experiments the generalizability of the model. Since RR are tedious to annotate, we showed the possibility of using model distillation techniques to improve the system; we plan to further research in this direction. We also plan to explore crossdomain transfer techniques to perform RR identifi-cation in legal documents in other Indian languages. Nevertheless, we plan to grow the corpus. We also plan to apply RR models for other legal tasks such as summarization and information extraction.

Ethical Concerns
To the best of our knowledge, the proposed corpus and methods do not have direct ethical consequences. The corpus is created from publicly available data from a public resource: www. indiankanoon.org. The website allows free downloads and no copyrights were violated. The annotators voluntarily participated in the annotations and were compensated. The cases were selected randomly to avoid any bias towards any entity, situation or laws. Any meta-information related to individuals, organizations and judges was removed so as to avoid any introduction of bias. For the task of judgement prediction we took all the steps (names anonymization and other metainformation removal) as outlined in (Malik et al., 2021).  We collect the dataset of IT and CL cases from the Supreme Court of India, Bombay and Kolkata High Courts. For CL cases, we use the cases from the tribunals of NCLAT, CCI, COMPAT. Since the IT laws are 50 years old and relatively dynamic, we stick to certain sections of IT domain only, whereas we use all the sections for CL domain. We restrict ourselves to the IT cases that are based on Section 147, Section 92C and Section 14A only to limit the subjectivity in cases. We randomly select 50 cases from IT and CL domain each to be annotated. We used regular expressions in Python to remove the auxillary information in the documents (For example: date, appellant and respondent names, judge names etc.) and filter out the main judgment of the document. We use the NLTK 1 sentence tokenizer to split the document into sentences. The annotators were asked to annotate these sentences with the rhetorical roles.

A.2 Rhetorical Role Definitions
• Fact (FAC): These are the facts specific to the case based on which the arguments have been made and judgment has been issued. In addition to Fact, we also have the fine-grained label Issues (ISS). The issues which have been framed/accepted by the present court for adjudication. • Argument (ARG): The arguments in the case were divided in two more fine-grained sublabels: Argument Petitioner (ARG-P): Arguments which have been put forward by the petitioner/appellant in the case before the present court and by the same party in lower courts (where it may have been petitioner/respondent). Also, Argument Respondent (ARG-R): Arguments which have been put forward by the respondent in the case before the present court and by the same party in lower courts (where it may have been petitioner/respondent) • Statute (STA): The laws referred in the case.

A.3 Inter-annotator Agreement
Fleiss Kappa between all (fine-grained) labels is 0.59 for IT and 0.87 for CL, indicating substantial agreement. We provide the inter-annotator agreement (averaged pairwise macro F1 between annotators) upon 13 fine-grained labels in Table 9. Also, we provide the pairwise confusion matrices of an-notators (A 1 , A 2 ) and (A 2 , A 3 ) for both IT and CL domain in Figure 6.  Table 9: Label-wise inter-annotator agreement for all 13 fine-grained labels.

A.4 Annotation Analysis
Annotation of judgments in order to identify and distinguish between the rhetorical roles played by its various parts is in itself a challenging task even for legal experts. There are several factors involved in the exercise that requires the annotator to retrace the judicial decision making and recreate the impact left by the inputs available to the judge such as certain specific facts of the case, a particular piece of argument advanced by the lawyer representing one of the parties, or a judicial precedent from a higher court deemed applicable in the current case by the lawyer(s) or by the judge or by both. Moreover, the annotator only has access to the current document which is secondary account of what actually happened in the court. These limitations certainly makes the task of the annotator further difficult, and leaves them with no choice other than to make certain educated guesses when it comes to understanding the various nuances, both ostensible and probable, of certain rhetorical roles. It should, however, be noted that such variation need not occur for every rhetorical role, since not all the roles are equally susceptible to it -for instance, the facts of the case as laid down by the judge are more readily and objectively ascertainable by more than one annotator, whereas the boundaries between the issues framed by the judge and those deemed relevant as per the arguments advanced by the lawyers may blur more, especially because if the judge happens to agree with one of the lawyers and adopts their argument as part of the judicial reasoning itself. Similarly, it should also be noted that despite differing in their views of the nature and extent of rhetorical role played by a certain part of the judgment, the annotators may still agree with each other when it comes to identifying and segregating the final ruling made by the judge in that case -this phenomenon of having used two different routes to arrive at the same destination is not uncommon in the reenactment or ex-postfacto analysis of a judicial hearing and decision making process. A cumulative effect of the aforementioned factors can be observed in the results of the annotation. The analysis provided by the three annotators in case of competition law bear close resemblance with each other. On the other hand, in case of income tax law, the analysis provided by Users 1 and 3 bear greater resemblance with each other, compared to the resemblance between Users 1 and 2, or between Users 2 and 3. On a different note, it is also observed that the rhetorical role where the annotators have differed between themselves the most has been the point of Ruling made by the Lower Court, followed by the Ratio. This also ties in with the aforesaid argument that all rhetorical roles are not equally susceptible to the variation caused by the varying levels of success achieved by the different annotators in retracing the judicial thought pattern.

A.5 Annotation Case Studies
Along with law professors, we analyzed some of the case documents. Please refer to data files for the actual judgement.
In the case of CL cases, the best resemblance that has been achieved is in the case of SC_Competition Commission of India vs Fast Way Transmission Pvt Ltd and Ors 24012018 SC.txt, one would find that the judgment has been written in a manner as to provide specific indicators before every rhetorical role. For instance, before the Ruling by Lower Court starts, reference has been made that this is the opinion given the Competition Commission of India (the lower court in the relevant domain). Similarly, before Arguments made by Petitioner/Respondent, reference has been made that this is the argument made by the lawyer representing the petitioner/respondent. This judgment also provides a nice, consistent flow following the arrangement of the rhetorical roles in order. The relatively smaller size of the judgment also indicates a lower level of complexity (although there need not always be a consistent correlation between the two). On the other hand, if one considers the least resemblance achieved in the competition law domain, in the case of SC_Excel Crop Care Limited vs Competition Commission of India and Ors 08052017 SC(1).txt, one would find that such specific indicators are usually absent, thus leaving scope for individual discretion and interpretation, the judgment goes back and forth between certain rhetorical roles (Issue, Ruling by Lower Court, Ratio by Present Court, Argument by Petitioner/Respondent, Precedent Relied Upon), and the relatively bigger size also involves additional complexity and analysis, which make room for further nuances as described above.

B Evaluation Metrics
We use the Macro F1 metric to evaluate the performance of models upon the task of Rhetorical Role labelling. Macro F1 is the mean of the labelwise F1 scores for each label. Given the true positives (T P ), false positives (F P ) and false negatives (F N ), the F1 score for a single label is calculated as: The pairwise inter-annotator agreement F1 between two annotators A and B is calculated by considering the annotations by annotator A as the true labels and the annotations by annotator B as the predicted labels.
We also calculate Fleiss Kappa 2 to measure the inter-annotator agreement.

C Model Training Details
All of our baseline experiments and training of Label shift prediction models (SBERT and BERT-SC) were conducted on Google Colab 3 and used the default single GPU Tesla P100-PCIE-16GB, provided by Colab. Our models were trained upon a single 11GB GeForce RTX 2080 TI. We used the SBERT model provided in the sentence-transformers library 4 . We use the Huggingface 5 implementations of BERT-base and LEGAL-BERT models.
For SBERT-Shift, we kept the SBERT model as fixed and tuned the 3 linear layers on top. We used the Binary Crossentropy loss function with Adam Optimizer to tune the model upon the LSP task.
For BERT-SC, we fine-tuned the pre-trained BERT-base model upon the LSP task. We used the maximum sequence length of 256 tokens, a learning rate of 2e − 5 and kept the number of epochs as 5 during training. We used the same loss function and optimizer as the SBERT-Shift model.

C.1 Single Sentence Classification Baselines
We train single sentence classification models for the task of rhetorical role labelling. We use BERTbase-uncased and Legal-BERT models and finetune them upon the sentence classification task. We also try a variant of using context sentences (left sentence and the right sentence) along with the current sentence to make classification, we call this method BERT-neighbor. We use CrossEntropyLoss as the criterion and Adam as the optimizer. We use a batch size of 32 with a learning rate of 2e-5 and fine-tune for 5 epochs for all our experiments. Refer to Tables 10 , 12 and 11 and for results and more information about the hyperparameters.

C.2 Sequence Classification Baselines
We experiment with Sequence Classification Baselines like CRF with handcrafted features, BiLSTM with sent2vec embeddings and different versions of BiLSTM-CRF in which we varied the input embeddings. We experimented with sent2vec embeddings fine-tuned on Supreme Court Cases of India (same as in (Bhattacharya et al., 2019)). We also tried with sentence embeddings obtained from the BERT-base model. In another experiment, we finetuned a pre-trained BERT model upon the task of Masked Language Modelling (MLM) on the unlabelled documents of IT and CL domain, and used this model to extract the sentence embeddings for the BiLSTM-CRF model.
We used the same implementation of BiLSTM-CRF from (Bhattacharya et al., 2019), with Adam optimizer and NLL loss function. Refer to Tables 10 , 12 and 11 for experiment-wise hyperparameters.

C.3 LSP-BiLSTM-CRF and MTL-BiLSTM-CRF models
In our proposed approach of LSP-BiLSTM-CRF, we experiment with two methods of generating shift embeddings, namely BERT-SC and SBERT-Shift. These embeddings were then used as input to train a BiLSTM-CRF with similar training schedules. Refer to Tables 10 , 12 and 11 for other hyperparameters.
For MTL models, we experimented with different encoders E 1 and E 2 . We experimented with using Shift embeddings (or BERT embeddings of sentences obtained from pre-trained BERT model) from BERT-SC in both the components. However, the best performing model was the one in which we used shift embeddings for the shift component and BERT embeddings for the RR component. We used the NLL loss in both components of the MTL model weighted by the hyperparameter λ. We use the Adam Optimizer for training. We provide dataset-wise hyperparameters and results in Tables 10 , 12 and 11.