SemEval-2023 Task 6: LegalEval - Understanding Legal Texts

In populous countries, pending legal cases have been growing exponentially. There is a need for developing NLP-based techniques for processing and automatically understanding legal documents. To promote research in the area of Legal NLP we organized the shared task LegalEval - Understanding Legal Texts at SemEval 2023. LegalEval task has three sub-tasks: Task-A (Rhetorical Roles Labeling) is about automatically structuring legal documents into semantically coherent units, Task-B (Legal Named Entity Recognition) deals with identifying relevant entities in a legal document and Task-C (Court Judgement Prediction with Explanation) explores the possibility of automatically predicting the outcome of a legal case along with providing an explanation for the prediction. In total 26 teams (approx. 100 participants spread across the world) submitted systems paper. In each of the sub-tasks, the proposed systems outperformed the baselines; however, there is a lot of scope for improvement. This paper describes the tasks, and analyzes techniques proposed by various teams.


Introduction
In populous countries (e.g., India), pending legal cases have grown exponentially. Due to the nature of the legal domain, it may not be possible to automate the entire judicial pipeline completely; nevertheless, many intermediate tasks can be automated to augment legal practitioners and hence expedite the system. For example, legal documents can be processed with the help of Natural Language Processing (NLP) techniques to organize and structure the data to be amenable to automatic search and retrieval. However, legal texts are different from commonly occurring texts typically used to train NLP models. Legal documents are quite long (tens and sometimes hundreds of pages), making it challenging to process (Malik et al., 2021). Another challenge with legal documents is that they use different lexicons compared to standard texts. Moreover, legal documents are manually typed in countries like India and are highly unstructured and noisy (e.g., spelling and grammatical mistakes). Above mentioned challenges make it difficult to apply existing NLP models and techniques directly, which calls for the development of legal domainspecific techniques.
To promote research and development in the area of Legal-AI we proposed shared tasks via "LegalEval -Understanding Legal Texts". This paper gives details about the shared task and our experience in organizing LegalEval. In total 26 teams submitted systems papers in the shared task. Though the documents pertained to the Indian legal system, teams across the globe participated showing an interest in developing legal-NLP techniques that could later be generalized to other legal systems as well.
The paper is organized as follows. In Section 2, we outline work done in the legal-NLP domain. Section 3 describes the tasks in detail. This is followed by details of the corpora (used for various sub-tasks) in Section 4. Details of models developed by each of the teams are discussed in Section 5. Section 6 analyzes the results furnished by various techniques for each of the sub-task. Finally, Section 7 summarizes and concludes the paper.

Related Work
In recent years, Legal NLP has become an active area for research. Researchers have worked on various research problems pertaining to the legal domain such as Legal Judgment Prediction (Zhong et al., 2020;Malik et al., 2021;Chalkidis et al., 2019;Aletras et al., 2016;Long et al., 2019;Xu et al., 2020;Yang et al., 2019a;Kapoor et al., 2022), Rhetorical Role Labeling (Bhattacharya et al., 2019;Saravanan et al., 2008; Figure 1: Example of document segmentation via Rhetorical Roles labels. On the left is an excerpt from a legal document, and on the right is a document segmented and labeled with rhetorical role labels. Malik et al., 2022a;Kalamkar et al., 2022b), Summarization (Tran et al., 2019), Prior Case Retrieval and Statute Retrieval (Rabelo et al., 2022(Rabelo et al., , 2020Kim et al., 2019). Due to space limitations, we do not go into details of various tasks and allude to surveys on Legal-AI by Westermann et al. (2022); Villata et al. (2022); Colonna (2021). There have been some works in the area of Rhetorical Role Labeling and Legal Judgement Prediction; however, the research problems remain far from being solved, hence we organize shared tasks on these tasks. Moreover, Legal-NER (Named Entity Recognition) is an unexplored area, and we would like to draw the community's attention to it.

Tasks Description
In LegalEval 2023, we proposed the following three shared tasks, which focus on automatically understanding legal documents. The documents were in English. Task A: Rhetorical Roles (RR) Prediction: Given that legal documents are long and unstructured, we proposed a task for automatically segmenting legal judgment documents into semantically coherent text segments, and each such segment is assigned a label such as a preamble, fact, ratio, arguments, etc. These are referred to as Rhetorical Roles (RR). Concretely, we propose the task of Rhetorical Role Prediction: the task is to segment a given legal document by predicting ( Figure 1) the rhetorical role label for each sentence (Malik et al., 2022b;Kalamkar et al., 2022b). We target 13 RRs as outlined in our previous work Kalamkar et al. (2022b) Kalamkar et al. (2022b). RR Task Relevance: The purpose of creating a rhetorical role corpus is to enable an automated understanding of legal documents by segmenting them into topically coherent units ( Figure 1). This segmentation is a fundamental building block for many legal AI applications like judgment summarization, judgment outcome prediction, precedent search, etc. Task B: Legal Named Entity Recognition (L-NER): Named Entity Recognition is a widely studied problem in Natural Language Processing (Li et al., 2018;Dozier et al., 2010). However, legal documents have peculiar entities like names of the petitioner, respondent, court, statute, provision, precedents, etc. These entity types are not recognized by the standard Named Entity Recognizer. Hence there is a need to develop a Legal NER system; consequently, we proposed Task B as Legal Named Entity Recognition (L-NER). In this task, we target 14 legal named entities as described in our previous work Kalamkar et al. (2022a) and also shown in Table 1. An example of the task is shown in Figure 2. Heard learned counsel for the parties.
The appellant has filed this appeal challenging the impugned order of the Kerala High Court. dated 1.9.2010 granting bail to the respondent, Dr. Raneef who is a medical practioner (dentist) in Ernakulam district in Kerala, and is accused in crime no. 704 of 2010 of P.S.
Muvattupuzha for 2 offences under various provisions of the I.P.C., the Explosive Substances Act , and the Unlawful Activities (Prevention) Act. 1.

Case Number
Statute Statute Figure 2: Example of Named Entities in a court judgment com/ICLRandD/Blackstone) provides a library for L-NER; however, it is limited to only UK cases and does not generalize well to case documents from other countries; moreover, the set of entities covered is quite limited. Task C: Court Judgment Prediction with Explanation (CJPE): In recent years, in order to augment a judge in predicting the outcome of a case, the task of legal judgment prediction has been an active area of research (Malik et al., 2021). However, given the nature of the legal process, judgment prediction without providing an explanation for the judgments is not of much practical relevance to legal practitioners. We proposed the task of Legal Judgement Prediction with Explanation, where, given a legal judgment document, the task involves automatically predicting the outcome (binary: accepted or denied) of the case and also providing an explanation for the prediction. The explanations are in the form of relevant sentences in the document that contribute to the decision. This task is an extension of one of our previous works (Malik et al., 2021). We divided the task into two subtasks: (1) Court Judgment Prediction and (2) Explanations for the prediction. As done in our previous work, we take all the steps to remove any bias in the dataset and address the ethical concerns associated with the task. Task Relevance: The volume of legal cases has been growing exponentially in many countries; if there is a system to augment judges in the process, it would be of immense help and expedite the process. Moreover, if the system provides an explanation for the decision, it would help a judge make an informed decision about the final outcome of the case.

Evaluation
The rhetorical roles task (a multiclass prediction problem) is evaluated using a weighted F1 score based on the test data. For the CJPE task, the evaluation for judgment prediction (binary classification) is done using the standard F1 score metric, and for the explanation sub-task, we use ROUGE-2 score (Malik et al., 2021) for evaluating the machine explanations with respect to the gold annotations. For L-NER, we use standard F1 score metrics (Segura-Bedmar et al., 2013). Automatic evaluations of the submitted models are done using CodaLab, and a leaderboard is maintained (https://codalab.lisn.upsaclay.fr/ competitions/9558).

Corpora
We focus on freely and publicly available Indian legal documents in English. The documents are obtained from Indian Legal search portal Indi-anKanoon (https://indiankanoon.org/). India follows a common-law system, which is similar to the one in the U.S. and the U.K. The tasks organizers with the help of legal experts in the past, have created the annotated datasets for RR (Kalamkar et al., 2022b;Malik et al., 2022b), L-NER (Kalamkar et al., 2022a) and CJPE (Malik et al., 2021). The data is annotated by senior law students, professors, and legal practitioners, thus ensuring the high quality of the data. We annotated around 250 judgments which are likely to have 850k tokens and roughly 10k entities. The RR data has about 265 legal documents with 841, 648 tokens and 26, 304 sentences, with an average of 3176 tokens per document. It is split into the train, validation, and test sets with an approximate 70-10-20 split. L-NER corpus has about 14, 444 judgment documents annotated with 17, 485 named entities belonging to 14 categories. The CJPE corpus has about 34K documents, and these are split into the train, validation, and test sets.

Participating Systems
We organized the shared task using the CodaLab platform. In total, 40 teams participated in various sub-tasks and finally 26 teams (approx. 100 participants) submitted systems description papers. Overall, there were 17 submissions for Task-A (RR) and 11 submissions each for Task-B (NER) and Task-C (CJPE). Teams other than from India also participated in the shared task. The demographic distribution of teams is shown in Figure 3. Next, we summarize the proposed approaches by various teams for each of the tasks.

Task-A (Rhetorical Role Prediction)
In total, seventeen teams provided a description of their approach. We observed a wide variety of approaches for Task-A, like BiLSTM-based, Transformer-based, and GNN-based architectures, out of which BiLSTM was one of the popular choices (6 teams). Another common trend was the use of pre-trained transformer-based architectures as a backbone, e.g., BERT (Devlin et al., 2019), RoBERTa , LegalBERT (Chalkidis et al., 2020), InLegalBERT (Paul et al., 2022).
AntContentTech (Huo et al., 2023) proposed a system based on Bi-LSTM with CRF layers and used domain-adaptive pre-training with auxiliary-task learning with pre-trained Legal-BERT as a backbone. They also experimented with augmentation strategies like shuffling precedents among the documents and back translation. This helps to improve performance.
VTCC-NER (Tran and Doan, 2023) followed a unique strategy of using Graph Neural Nets with sentence embeddings. Moreover, the proposed method used multitask learning with label shift prediction and contrastive learning to separate close labels like ARG_petitioner and ARG_respondent. Combining entity-based GNN (GAT (Veličković et al., 2018)) with multitask learning and contrastive learning helped to improve performance.
TeamShakespeare (Jin and Wang, 2023) highlighted the significance of considering inter and intra-sentence representations in RR predictions. The method also reported improvements achieved by replacing BERTbase with LegalBERT as a backbone in the architecture.
NLP-Titan (Kataria and Gupta, 2023) followed a two-stage Hierarchical strategy where in the first stage, similar classes were merged, and a sequence sentence classification algorithm based on Hierarchical Sequential Labeling Network (HSLN) was used. In the second stage, the SetFit model is trained over the fine-grained merged classes.
YNU-HPCC (Chen et al., 2023) proposed the use of LegalBERT with attention pooling and a CRF module. Predictions are made using the Deep ensembles (Lakshminarayanan et al., 2017) strategy, where multiple models were trained with different seeds followed by a voting scheme.
TeamUnibo (Noviello et al., 2023) used a pretrained BERT with two BiLSTM layers followed by average pooling and a linear layer. The proposed method was motivated to maintain topical coherence by taking the surrounding sentences as context for RR classification.
IRIT_IRIS_(A) (Lima et al., 2023) proposed a unique style of segmenting long documents referred to as DFCSC (Dynamic Filled Contextualized Sentence Chunks), where every chunk contained a variable number of core and edge sentences. The paper further described 3 variants of the proposed method and reported the results.
LRL_NC (Tandon and Chatterjee, 2023) used a hierarchical sequence encoder with contextual sentence representations followed by a CRF layer for classifying sentences into RR categories.
ResearchTeam_HCN (Singh et al., 2023) used back translations to balance the imbalanced classes and proposed using attention pooling with sentence transformers.
NITK_Legal (Sindhu et al., 2023) also used a BiLSTM architecture with the CRF layer, which follows a hierarchical structure. The paper also reported its findings on various sentence encoding approaches like BiLSTM embeddings, sentence2vec (sent2vec) (Moghadasi and Zhuang, 2020), Universal Sentence Encoder (Cer et al., 2018), Sentence BERT (SBERT) (Reimers and Gurevych, 2019), trigram embeddings, weighted trigram and handcrafted features + sent2vec and reported the importance of handcrafted features along with sent2vec embeddings for RR classification.
TüeReuth-Legal (Manegold and Girrbach, 2023) explored the learning of a continuous distribution over positions in the document to indicate the likelihood of respective RRs appearing at cer-tain positions. The method further incorporated domain knowledge through hard-coded lexical rules to improve performance.
NITS_Legal (Jain et al., 2023) created a different version of the dataset for capturing local context and used a BiLSTM-based sentence classification approach, highlighting the significance of context in RR prediction.
Steno AI ) created a graphical structure for capturing the relationship between the sentences in the document and applied GNNbased classification to predict the rhetorical roles.
CSECU-DSG (Tasneem et al., 2023) framed the task as a multi-class sequence classification task. In the proposed approach the authors utilized the finetuned Legal-BERT for classifying legal sentences into 13 rhetorical roles.
LTRC (Baswani et al., 2023) used several algorithms to classify sentence embeddings into different classes of Rhetorical Roles. The authors showed that using 512 instead of 256 as embedding size gives better results. Further, they showed that enriching the embeddings and adding a CRF layer to the classifier helps to improve results for classes with fewer samples.
UO-LouTAL (Minard et al., 2023) proposed a Bag-of-Words-based system and used NER labels to replace entities with their types to reduce sentence variability.

Task-B (Named Entity Recognition)
For the NER task, out of 17 participating teams, 11 teams submitted system description papers. Most teams (7) used variations of the BERT-based transformers like RoBERTa, DeBertaV3 (He et al., 2021), LegalBERT, Albert (Lan et al., 2019). techniques based on XLNet (Yang et al., 2019b), LUKE-base, and Legal-LUKE are also implemented. BiLSTM and CRF layers are used in the final layers on the top of transformers.
ResearchTeam_HCN (Singh et al., 2023) proposed two baseline models and one main model with an MTL (Multi-Task Learning) framework. They used data augmentation (back-translation) as a way to make up for the imbalance in class data. Further, they used a general-purpose NER system to aid the Legal-NER model. For example, using rule-based techniques they assigned labels to an entity predicted by general-purpose NER.
VTCC-NER (Tran and Doan, 2023) proposed the use of ensembles to improve the NER pre-dictions via two approaches. The first approach considered the use of multiple transformer backbones (RoBERTa, BERT, Legal-BERT, InLegal-BERT, and XLM-RoBERTa) in the spaCy framework to make the predictions, which are further combined using a weighted voting scheme to generate the NER predictions. In the second approach, they adapted MultiCoNER (Multilingual Complex Named Entity Recognition (Malmasi et al., 2022)) with multiple pre-trained backbones to generate the contextualized features, which are further concatenated with dependency parsing features and part-of-speech features. The obtained concatenated features were further passed through a CRF module to generate the predictions.
Jus Mundi (Cabrera-Diego and Gheewala, 2023) proposed the method FEDA (Frustratingly Easy Domain Adaptation), and trained it over multiple legal corpora. The FEDA layer consisted of a GeLU activation layer, followed by a linear layer and layer normalization. The method extracted features from DeBERTa V3 and passed them through two splits: the general FEDA layer and the specialized FEDA layer. The specialized FEDA layer also incorporated features from the general FEDA layer, and the final output was passed through a CRF module to generate the predictions.
UOttawa.NLP23 (Almuslim et al., 2023) converted the text to B-I-O format and tokenized it using the Legal-Bert (Chalkidis et al., 2020). The authors proposed a token classification approach where each token was classified into a particular entity type in the dataset.
TeamUnibo (Noviello et al., 2023) presented an approach to handle raw text without any preprocessing. A sliding window approach handled long texts and avoided input truncation. The system used the B-I-O labeling format and mapped labels from words to tokens. Several baseline models, including RoBERTa, InLegalBERT, and XLNet, were tested, along with two custom models that use RoBERTa or XLNet as the transformer backbone, followed by a BiLSTM and CRF layers. The best-performing model was found to be RoBERTa -BiLSTM -CRF and XLNet -BiLSTM -CRF.
AntContentTech (Huo et al., 2023) used a SciBERT-HSLN to get word embeddings of each sentence which were then passed through a Bi-LSTM layer and an attention-based pooling layer to get sentence representations. These representations are directly fed into a CRF layer to perform the span classification.
PoliToHFI (Benedetto et al., 2023) applied the entity-aware attention mechanism implemented in the LUKE model (Yamada et al., 2020). (Ginn and Khamov, 2023) proposed a three step augmented approach. First, a classification model, where the sentence class was added as an input token, then a twin transformer model with two identical NER models was used and each batch of sentences was split into preamble and judgment using the previous classification model. The sentence class was used to pick one of the two sub-models, and finally, a modified token classification linear layer took the sentence class provided by the classifier predictions to learn the differences between documents.

Ginn-Khamov
TeamShakespeare  introduced Legal-Luke, a legal entity recognition model based on the bidirectional transformer encoder of LUKE (Yamada et al., 2020). The model generated the Legal contextual embeddings for words and entities by adding the token embeddings, type embeddings. Position embeddings were added to compute the word and entity embeddings. (Minard et al., 2023) aimed to achieve interpretability in the L-NER task by using a standard CRF. The input to the CRF model was a structure containing token features retrieved from the text using SpaCy, and the features of its preceding and succeeding tokens. Optionally, postprocessing was applied to extract specific patterns such as dates or case numbers.

UO-LouTAL
Legal_try (Zhao et al., 2023) proposed the use of multiple heterogeneous models, which were further selected using a recall score. Later, all the selected models were used in a voting scheme to generate the predictions. The heterogeneous models included ALBERT -BiLSTM -CRF, ALBERT -IDCNN -CRF, and RoBERTa (dynamic fusion) -BiLSTM/IDCNN -CRF.
Nonet (Nigam et al., 2023) used a combination of results produced by the custom-trained spaCy model and fine-tuned BERT-CRF model for the final prediction. spaCy was trained by replacing the RoBERTa embeddings with the law2vec embeddings and the LegalBERT was used in the pipeline for fine-tuning the model. The BERT-CRF model was trained over the pre-processed text using the BIO format. If the span of the entities predicted by both models intersected then the entity with a larger span was considered. For the rest, a union of the result produced by both models was considered.

Task C1 and C2: Legal Judgment Prediction and Legal Judgement Prediction with Explanation
For the Legal Judgement Prediction task (C1) transformers were used as the base for 7 out of 8 participating systems. Longformer, RoBERTa, InLegal-BERT, LegalBERT, and CaseLawBERT are used. One team nclu_team proposed the use of CNN to solve this task. In the Explanation for Judgement Prediction task (C2), various techniques were employed to generate the explanations. Some of those included the use of occlusion, clustering of similar sentences and picking sentences from the clusters, using relevance scores based on the proportion of legal entities, use of attention scores, extractive summarization, span lengths, and generation using the T5-based transformer model. Viettel-AI (Hoang et al., 2023) used Long-Former (Beltagy et al., 2020) to process documents. High and low-frequency words were filtered out and a TF-IDF representation of the document was created. The TF-IDF vector representation was used along with LongFormer to learn representations in an end-to-end fashion. The authors further extended the proposed method for explanations by using a trained judgment prediction model and masking the sentences present in the document. With the help of this occlusion technique, the authors ranked the sentences based on the difference in the judgment prediction probability scores.
IRIT_IRIS_(C) (Prasad et al., 2023) proposed a multi-level encoder-based classifier using a pretrained BERT-based encoder as a backbone and fine-tuned over the chunked version of the documents. Further, the extracted chunk embeddings were used to represent the entire document, followed by another transformer and recurrent layer in level-3 to learn the relation between the chunk representations. For explanations, the proposed method followed the idea of occlusion sensitivity as an extension to the judgment prediction module.
UOttawa.NLP23 (Almuslim et al., 2023) explored the performance of different transformerbased models and hierarchical BiLSTM-CRF models for legal text classification on the Indian Legal Document Corpus (ILDC). Different models, including Legal-BERT, CaseLawBERT, and In-CaseLawBERT were fine-tuned and used as classifiers. The best-performing model was CaseLaw-  BERT + BiGRU. The successful performance of CaseLawBERT over InCaseLawBERT can be attributed to the pretraining corpus used. However, the models struggled to generalize to longer texts, and the validation results were better than the test results due to the longer texts in the test set. The authors used a masking technique with CaseLaw-BERT and BiGRU models to extract explanations for legal outcomes. They clustered similar sentences and picked one or two sentences from each cluster to form explanations using two different approaches. The authors concluded that their low score was due to selecting only one or two sentences per chunk, which resulted in the lower overlap.
PoliToHFI (Benedetto et al., 2023) used a three-staged approach where in the first stage, 4 transformer based encoders were employed to find the section exerting the most significant influence on the classification performance, and the bestperforming encoder was identified. In the second stage, document representation was obtained by average pooling the sentence representations obtained using the chosen model. This representation was passed to the linear layers to compute the final prediction. Finally, an ensemble prediction for the test set was generated from the combination of hierarchical models. They leveraged post hoc input attribution explanation methods to generate prediction explanations. The explanation was derived by adopting the post hoc attribution methods that leverage the input document and the model. Further, a relevance score based on each sentence's proportion of legal entities was assigned to boost the explanations.
ResearchTeam_HCN (Singh et al., 2023) followed a hierarchical model approach in which a Legal pre-trained RoBERTa model was fine-tuned using the last 512 tokens from each document. Later, the features were extracted using the fine-tuned Legal-RoBERTa model and were passed through the BiGRU model and a classification layer. For explanation, they generated relevant sentences using the T5-based transformer model and considered the task as a SEQ2SEQ approach.
AntContentTech (Huo et al., 2023) used a self-attention layer to generate contextualized sentence representations. A global token pre-pended to the input sequence was utilized to generate the final prediction of the judgment. The attention scores were used to extract the explanations. Based on these attention scores provided by the first token, the top 30% of sentences were selected.
nclu_team (Rusnachenko et al., 2023)   CNN to generate the explanations. The proposed use of a sliding window over the most attentive part is unique in terms of using the ranking order of the obtained attention scores and it provided better and comparable results when compared to other explanation approaches.
Nonet (Nigam et al., 2023) used transformers like XLNet, Roberta, LegalBERT, InLawBert, In-CaseLaw, and BERT-Large and also tried Hierarchical Transformer model architecture with a moving window approach to reduce the document into chunks. The CLS representation of these chunks was then used as input to sequential models (Bi-GRU + attention). In the Explanation for the prediction task, various keyword-based matching techniques along with Rhetorical Roles (ratio) were used to find the court judgment decision. Different span lengths from the ending portions of the legal judgments were considered for the court judgment explanation.
6 Results and Discussion

Baselines
We used transformer-based models as a baseline for each of the tasks. The baseline model for RR tasks was SciBERT-HSLN (Brack et al., 2021) (model provided at: https: //github.com/Legal-NLP-EkStep/ rhetorical-role-baseline).
We used RoBERTa base with the transitionbased parser as the baseline for the NER task (model provided at https://github. com/Legal-NLP-EkStep/legal_NER). Similarly, the baseline model for CJPE was a Hierarchical model with XL-Net at the first level and Bi-GRU at the next level model available at: https://github.com/ Exploration-Lab/CJPE. Table 2. Overall, the use of BiLSTM layers with a CRF module seemed to be a popular, showing promising results. In general, the strategy of augmenting the dataset seemed to be a better choice which showed promising improvements and generalized predictions, and worked well on the unseen test set. One of the teams (VTCC-NER) also used GNNs to capture the relationship between the entities to classify sentences into RR categories. The bestobtained results on the test set were around 86% (Team AntContentTech) followed by scores ranging from 84% to 80% where the majority of the methods used pre-trained transformer-based architectures trained on Legal text (e.g., LegalBERT, InLegalBERT) as the primary backbone for legal text representations. The baseline model has a performance of 79% F1 score. The proposed methods show an improvement over the baseline, however, there is further scope for improvement.

Task B
Results for Task B are shown in Table 3. Most of the teams used transformer-based backbone (e.g., RoBERTa, LegalBERT). Surprisingly, the use of the Legal-BERT-based performs similarly to standard pre-trained RoBERTa, highlighting the generic nature of named entity recognition tasks that may not require specialized legal knowledge.   Augmentation techniques like back-translation seem to improve the performance of NER systems on Legal text. The best-performing system makes use of ensembles to further improve the results of the proposed system. In comparison to the baseline (F1 score of 91.1%), there was not much improvement in the best-performing system. This highlights the development of better techniques to improve the Legal-NER system further.
6.4 Task C Table 4 and Table 5 show the results for the task of Legal Judgement Prediction (LJP) and Legal Judgment Prediction with Explanation (LJPE) respectively. For the LJP task, all the systems used transformer-based models with the best-performing model using LongFormer which was able to capture long-range dependencies in legal documents. For the LJPE task, systems used occlusion-based and attention-based methods for the explanation. The best-performing system used Rhetorical Roles to extract the ratio of the case and that was subsequently used to provide the explanation, this is similar to the approach proposed by Malik et al. (2022a). Viettel-AI (Hoang et al., 2023) group achieved the highest F1 score of 74.85% in task C1 and Nonet (Nigam et al., 2023) has the highest score of 54.18% in task C2.

Conclusion
In this paper, we present the findings of 'SemEval 2023 Task 6: LegalEval -Understanding Legal Texts', consisting of three Subtasks: i) Subtask A, Rhetorical Role identification, aimed to segment a legal document by assigning a rhetorical role to each of the sentence present in the document ii) Subtask B, Named Entity Recognition, for identifying NER tags present in a legal text and iii) Subtask C, Legal Judgment Prediction with Explanation, which aims to predict the final judgment of a legal case and in addition, provides explanations for the prediction. In total, 40 teams participated and 26 teams (approximately 100 participants) submitted systems papers and presented various models for each task. We provide an overview of methods proposed by the teams. Analysis of the systems and results points out more scope for improvement. We hope these findings will provide valuable insights to improve approaches for each of the sub-tasks.