LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Laws and their interpretations, legal arguments and agreements are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.


Introduction
Law is a field of human endeavor dominated by the use of language. As part of their professional training, law students consume large bodies of text as they seek to tune their understanding of the law and its application to help manage human behavior. Virtually every modern legal system produces massive volumes of textual data (Katz et al., 2020). Lawyers, judges, and regulators continuously author legal documents such as briefs, memos, statutes, regulations, contracts, patents and judicial decisions (Coupette et al., 2021). Beyond the consumption and production of language, law and the art of lawyering is also an exercise centered around the analysis and interpretation of text.
Pre-trained Transformers, including BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), T5 (Raffel et al., 2020), BART (Lewis et al., 2020), DeBERTa (He et al., 2021) and numerous variants, are currently the state of the art in most natural language processing (NLP) tasks. Rapid performance improvements have been witnessed, to the extent that ambitious multi-task benchmarks (Wang et al., 2018(Wang et al., , 2019b are considered almost 'solved' a few years after their release and need to be made more challenging (Wang et al., 2019a). Recently, Bommasani et al. (2021) named these pre-trained models (e.g., BERT, DALL-E, GPT-3) foundation models. The term may be controversial, but it emphasizes the paradigm shift these models have caused and their interdisciplinary potential. Studying the latter includes the question of how to adapt these models to legal text (Bommarito et al., 2021). As discussed by Zhong et al. (2020b) and Chalkidis et al. (2020b), legal text has distinct characteristics, such as terms that are uncommon in generic corpora (e.g., 'restrictive covenant', 'promissory estoppel', 'tort', 'novation'), terms that have different meanings than in everyday language (e.g., an 'executed' contract is signed and effective, a 'party' is a legal entity), older expressions (e.g., pronominal adverbs like 'herein', 'hereto', 'wherefore'), uncommon expressions from other languages (e.g., 'laches', 'voir dire', 'certiorari', 'sub judice'), and long sentences with unusual word order (e.g., "the provisions for termination hereinafter appearing or will at the cost of the borrower forthwith comply with the same") to the extent that legal language is often classified as a 'sublanguage' (Tiersma, 1999;Williams, 2007;Haigh, 2018). Furthermore, legal documents are often much longer than the maximum length state-ofthe-art deep learning models can handle, including those designed to handle long text (Beltagy et al., 2020;Zaheer et al., 2020;Yang et al., 2020).
Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018(Wang et al., , 2019b, the subsequent more difficult SuperGLUE (Wang et al., 2019a), other previous multi-task NLP benchmarks (Conneau and Kiela, 2018;McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce LexGLUE, a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven English existing legal NLP datasets, selected using criteria largely from SuperGLUE (discussed in Section 3.1).
We anticipate that more datasets, tasks, and languages will be added in later versions of LexGLUE. 1 As more legal NLP datasets become available, we also plan to favor datasets checked thoroughly for validity (scores reflecting real-life performance), annotation quality, statistical power, and social bias (Bowman and Dahl, 2021).
As in GLUE and SuperGLUE (Wang et al., 2019b,a), one of our goals is to push towards generic (or 'foundation') models that can cope with multiple NLP tasks, in our case legal NLP tasks, possibly with limited task-specific fine-tuning. Another goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legal NLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways, discussed below, to make it easier for newcomers and generic models to address all tasks. We provide Python APIs integrated with Hugging Face (Wolf et al., 2020;Lhoest et al., 2021) to easily import all the datasets we experiment with and evaluate the performance of different models (Section 4.4). By unifying and facilitating the access to a set of law-related datasets and tasks, we hope to attract not only more NLP experts, but also more interdisciplinary researchers (e.g., law doctoral students willing to take NLP courses). More broadly, we hope LexGLUE will speed up the adoption and transparent evaluation of new legal NLP methods and approaches in the commercial sector, too. Indeed, there have been many commercial press releases in the legal tech industry on high-performing systems, but almost no independent evaluation of the performance of machine learning and NLPbased tools. A standard publicly available benchmark would also allay concerns of undue influence in predictive models, including the use of metadata which the relevant law expressly disregards.
A core task in this area has been legal judgment prediction (forecasting), where the goal is to predict the outcome (verdict) of a court case. In this direction, there have been at least three lines of work. The first one (Aletras et al., 2016;Chalkidis et al., 2019a;Medvedeva et al., 2020Medvedeva et al., , 2021 predicts violations of human rights in cases of the European Court of Human Rights (ECtHR). The second line of work (Luo et al., 2017;Zhong et al., 2018;Yang et al., 2019) considers Chinese criminal cases where the goal is to predict relevant law articles, criminal charges, and the term of the penalty. The third line of work (Ruger et al., 2004;Katz et al., 2017;Kaufman et al., 2019) includes methods for predicting the outcomes of cases of the Supreme Court of the United States (SCOTUS).
Another popular task is legal topic classification. Nallapati and Manning (2008) highlighted the challenges of legal document classification compared to more generic text classification by using a dataset including docket entries of US court cases. Chalkidis et al. (2020a) classify EU laws into Eu-roVoc concepts, a task earlier introduced by Mencia and Fürnkranzand (2007), with a special interest in few-and zero-shot learning. Luz de Araujo et al. (2020) also studied topic classification using a dataset of Brazilian Supreme Court cases. There are similar interesting applications in contract law (Lippi et al., 2019;Tuggener et al., 2020).
Several studies (Chalkidis et al., 2018(Chalkidis et al., , 2019cHendrycks et al., 2021) explored information extraction from contracts, to extract important information such as the contracting parties, agreed payment amount, start and end dates, applicable law, etc. Other studies focus on extracting information from legislation (Cardellino et al., 2017;Angelidis et al., 2018) or court cases (Leitner et al., 2019).
Legal Question Answering (QA) is another task of interest in legal NLP, where the goal is to train models for answering legal questions (Kim et al., 2015;Ravichander et al., 2019;Kien et al., 2020;Zhong et al., 2020a,c;Louis and Spanakis, 2022). Not only is this task interesting for researchers but it could support efforts to help laypeople better understand their legal rights. In the general task setting, this requires identifying relevant legislation, case law, or other legal documents, and extracting elements of those documents that answer a particular question. A notable venue for legal QA has been the Competition on Legal Information Extraction and Entailment (COLIEE) (Kim et al., 2016;Kano et al., 2017Kano et al., , 2018. More recently, there have also been efforts to pre-train Transformer-based language models on legal corpora (Chalkidis et al., 2020b;Zheng et al., 2021;Xiao et al., 2021), leading to state-of-theart results in several legal NLP tasks, compared to models pre-trained on generic corpora.
Overall, the legal NLP literature is overwhelming, and the resources are scattered. Documentation is often not available, and evaluation measures vary across articles studying the same task. Our goal is to create the first unified benchmark to access the performance of NLP models on legal NLU. As a first step, we selected a representative group of tasks, using datasets in English that are also publicly available, adequately documented and have an appropriate size for developing modern NLP methods. We also introduce several simplifications to make the new benchmark more standardized and easily accessible, as already noted.

LexGLUE Tasks and Datasets
We present the Legal General Language Understanding 2 Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks.

Dataset Desiderata
The datasets of LexGLUE were selected to satisfy the following desiderata: • Language: In this first version of LexGLUE, we only consider English datasets, which also makes experimentation easier for researchers across the globe. We hope to include other languages in future versions of LexGLUE.
• Substance: 3 The datasets should check the ability of systems to understand and reason about legal text to a certain extent in order to perform tasks that are meaningful for legal practitioners. where top-ranked models now achieve average scores higher than 90%). Unlike SuperGLUE (Wang et al., 2019a), we did not rule out, but rather favored, datasets requiring domain (in our case legal) expertise.
• Availability & Size: We consider only publicly available datasets, documented by published articles, avoiding proprietary, untested, poorly documented datasets. We also excluded very small datasets, e.g., with fewer than 5K documents. Although large pre-trained models often perform well with relatively few task-specific training instances, newcomers may wish to experiment with simpler models that may perform disappointingly with small training sets. Small test sets may also lead to unstable and unreliable results.

Tasks and Datasets
LexGLUE comprises seven datasets. Table 1 shows core information for each of the LexGLUE datasets and tasks, described in detail below. 4 ECtHR Tasks  For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of the ECHR that were violated (if any). In Task A, the input to a model is the list of facts of a case, and the output is the set of violated articles. In the most recent version of the dataset (Chalkidis et al., 2021c), each case is also mapped to articles of ECHR that were allegedly violated (considered by the court). In Task B, the input is again the list of facts of a case, but the output is the set of allegedly violated articles.
The total number of ECHR articles is currently 66. Several articles, however, cannot be violated, are rarely (or never) discussed in practice, or do not depend on the facts of a case and concern procedural technicalities. Thus, we use a simplified version of the label set (ECHR articles) in both Task A and B, including only 10 ECHR articles that can be violated and depend on the case's facts.

SCOTUS
The US Supreme Court (SCOTUS) 5 is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. We release a new dataset combining information from SCOTUS opinions 6 with the Supreme Court DataBase (SCDB) 7 (Spaeth et al., 2020). SCDB provides metadata (e.g., decisions, issues, decision directions) for all cases (from 1946 up to 2020). We opted to use SCDB to classify the court opinions in the available 14 issue areas (e.g., Criminal Procedure, Civil Rights, Economic Activity, etc.). This is a single-label multi-class classification task (Table 1). The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute). The SCOTUS cases are chronologically split into training (5k, 1946-1982), development (1.4k, 1982-1991), test (1.4k, 1991-2016) sets.  Table 2: Key specifications of the examined models. We report the number of parameters, the size of vocabulary, the maximum sequence length, the core pre-training specifications (training steps and batch size), and the training corpora (OWT = OpenWebText, BC = BookCorpus). Starred models have been warm-started from RoBERTa. document, the task is to predict its EuroVoc labels (concepts). The dataset is chronologically split in training (55k, 1958-2010) CaseHOLD The CaseHOLD (Case Holdings on Legal Decisions) dataset (Zheng et al., 2021) contains approx. 53k multiple choice questions about holdings of US court cases from the Harvard Law Library case law corpus. Holdings are short summaries of legal rulings that accompany referenced decisions relevant for the present case, e.g.:

EUR-LEX
". . . to act pursuant to City policy, re d 503, 506-07 (3d Cir.l985)(holding that for purposes of a class certification motion the court must accept as true all factual allegations in the complaint and may draw reasonable inferences therefrom)." The input consists of an excerpt (or prompt) from a court decision, containing a reference to a particular case, where the holding statement (in boldface) is masked out. The model must identify the correct (masked) holding statement from a selection of five choices. We split the dataset in training (45k), development (3.9k), test (3.9k) sets, excluding samples that are shorter than 256 tokens. Chronological information is missing from CaseHOLD, thus we cannot perform a chronological re-split.

Pre-trained Transformer Models
We experiment with Transformer-based (Vaswani et al., 2017) pre-trained language models, which achieve state of the art performance in most NLP tasks (Bommasani et al., 2021) and NLU benchmarks (Wang et al., 2019a). These models are pretrained on very large unlabeled corpora to predict masked tokens (masked language modeling) and typically also to perform other pre-training tasks that still do not require any manual annotation (e.g., predicting if two sentences were adjacent in the corpus or not, dubbed next sentence prediction). The pre-trained models are then fine-tuned (further trained) on task-specific (typically much smaller) annotated datasets, after adding task-specific layers. We fine-tune and evaluate the performance of the following publicly available models ( Table 2).
BERT (Devlin et al., 2019) is the best-known pretrained Transformer-based language model. It is pre-trained to perform masked language modeling and next sentence prediction.

RoBERTa (Liu et al., 2019) is also a pre-trained
Transformer-based language model. Unlike BERT, RoBERTa uses dynamic masking, it eliminates the next sentence prediction pre-training task, uses a larger vocabulary, and has been pre-trained on much larger corpora. Liu et al. (2019) reported improved results on NLU benchmarks using RoBERTa, compared to BERT.
DeBERTa (He et al., 2021) is another improved BERT model that uses disentangled attention, i.e., four separate attention mechanisms considering the content and the relative position of each token, and an enhanced mask decoder, which explicitly considers the absolute position of the tokens. De-BERTa has been reported to outperform BERT and RoBERTa in several NLP tasks (He et al., 2021).
Longformer (Beltagy et al., 2020) extends Transformer-based models to support longer sequences, using sparse-attention. The latter is a combination of local (window-based) attention and global (dilated) attention that reduces the computational complexity of the model and thus can be deployed in longer documents (up to 4096 tokens). Longformer outperforms RoBERTa on long document tasks and QA benchmarks.
BigBird (Zaheer et al., 2020) is another sparseattention based transformer that uses a combination of a local (window-based) attention, global (dilated), and random attention, i.e., all tokens also attend a number of random tokens on top of those in the same neighborhood (window) and the global ones. BigBird has been reported to outperform Longformer on QA and summarization tasks.
Legal-BERT (Chalkidis et al., 2020b) is a BERT model pre-trained on English legal corpora, consisting of legislation, contracts, and court cases. It uses the original pre-training BERT configuration. The sub-word vocabulary of Legal-BERT is built from scratch, to better support legal terminology.
CaseLaw-BERT (Zheng et al., 2021) is another law-specific BERT model. It also uses the original pre-training BERT configuration and has been pre-trained from scratch on the Harvard Law case corpus, 12 which comprises 3.4M legal decisions from US federal and state courts. This model is called Custom Legal-BERT by Zheng et al. (2021). We call it CaseLaw-BERT to distinguish it from the previously published Legal-BERT of Chalkidis et al. (2020b) and to better signal that it is trained exclusively on case law (court opinions).
Hierarchical Variants Legal documents are usually much longer (i.e., consisting of thousands of words) than other text types (e.g., tweets, customer reviews, news articles) often considered in various NLP tasks. Thus, standard Transformer-based models that can typically process up to 512 subword units cannot be directly applied across all LexGLUE datasets, unless documents are severely truncated to the model's limit. Figure 2 shows the distribution of text input length across all LexGLUE datasets. Even for Transformer-based models specifically designed to handle long text (e.g., Longformer, BigBird), handling longer legal documents remains a challenge.
Given the length of the text input in three of the seven LexGLUE tasks, i.e., ECtHR (A and B) and SCOTUS, we employ a hierarchical variant of each pre-trained Transformer-based model that has not been designed for longer text (BERT, RoBERTa, DeBERTa, Legal-BERT, CaseLaw-BERT) during fine-tuning and inference. The hierarchical variants are similar to those of Chalkidis et al. (2021c). They use the corresponding pre-trained Transformer-based model to encode each paragraph of the input text independently and obtain the top-level representation h [cls] of each paragraph. A second-level shallow (2-layered) Transformer encoder with always the same (across BERT, RoBERTa, DeBERTa etc.) specifications (e.g., hidden units, number of attention heads) is fed with the paragraph representations to make them context-aware (aware of the surrounding paragraphs). We then max-pool over the context-aware paragraph representations to obtain a document representation, which is fed to a classification layer. 13

Task-Specific Fine-Tuning
Text Classification Tasks For EUR-LEX, LEDGAR and UNFAIR-ToS tasks, we feed each document to the pre-trained model (e.g., BERT) and obtain the top-level representation h [cls] of the special [cls] token as the document representation, following Devlin et al. (2019). The latter goes through a dense layer of L output units, one per label, followed by a sigmoid (in EUR-LEX, UNFAIR-ToS) or softmax (in LEDGAR) activation, respectively. For the two ECtHR tasks (A and B) and SCOTUS, where the hierarchical variants are employed, we feed the max-pooled (over paragraphs) document representation to a classification linear layer. The linear layer is again followed by a sigmoid (EctHR) or softmax (SCOTUS) activation.
Multiple-Choice QA Task For CaseHOLD, we convert each training (or test) instance (the prompt and the five candidate answers) into five input pairs following Zheng et al. (2021). Each pair consists of the prompt and one of the five candidate answers, separated by the special delimiter token [sep]. The top-level representation h [cls] of each pair is fed to a linear layer to obtain a logit, and the five logits are then passed through a softmax yielding a probability distribution over the five candidate answers.

Data Repository and Code
For reproducibility purposes and to facilitate future experimentation with other models, we pre-process 13 In Appendix D, we present results from preliminary experiments using the standard version of BERT for ECtHR Task A (-12.2%), Task B(-10.6%), and SCOTUS (-3.5%). and release all datasets on Hugging Face Datasets (Lhoest et al., 2021). 14 We also release the code 15 of our experiments, which relies on the Hugging Face Transformers (Wolf et al., 2020) library. 16 Appendix A explains how to load the datasets and run experiments with our code.

Experimental Set Up
For TFIDF-based linear SVM models, we use the implementation of Scikit-learn (Pedregosa et al., 2011) and grid-search for hyper parameters (number of features, C, and loss function). For all the pre-trained models, we use publicly available Hugging Face checkpoints. 17 We use the *-base configuration of each pre-trained model, i.e., 12 Transformer blocks, 768 hidden units, and 12 attention heads. We train models with the Adam optimizer (Kingma and Ba, 2015) and an initial learning rate of 3e-5 up to 20 epochs using early stopping on development data. We use mixed precision (fp16) to decrease the memory footprint in training and gradient accumulation for all hierarchical models. The hierarchical models can read up to 64 paragraphs of 128 tokens each. We use Longformer and BigBird in default settings, i.e., Longformer uses windows of 512 tokens and a single global token ([cls]), while BigBird uses blocks of 64 tokens (windows: 3× block, random: 3× block, global: 2× initial block; each token attends 512 tokens in total). The batch size is 8 in all experiments. We run five repetitions with different random seeds and report the test scores based on the seed with the best scores on development data. We evaluate performance using micro-F1 (µ-F 1 ) and macro-F1 (m-F 1 ) across all datasets to take into account class imbalance. For completeness, we also report the arithmetic, harmonic, and geometric mean across tasks following Shavrina and Malykh (2021). 18 Table 3 presents the test results for all models across all LexGLUE tasks, while Table 4 Method   presents the aggregated (averaged) results. We observe that the two legal-oriented pre-trained models (Legal-BERT, CaseLaw-BERT) perform overall better, especially considering m-F 1 that accounts for class imbalance (considers all classes equally important). Their in-domain (legal) knowledge seems to be more critical in the two datasets relying on US case law data (SCOTUS, CaseHOLD) with an improvement of approx. +2-4% p.p. (m-F 1 ) over equally sized Transformer-based models, which are pre-trained on generic corpora. These results are explained by the fact that these tasks are more domain-specific in terms of language, compared to the rest. No single model performs best in all tasks, and the results of Table 3 show that there is still large scope for improvement (Section 6).

Main Results
An exceptional case of the dominance of the pretrained Transformer models is the SCOTUS dataset, where the TFIDF-based linear SVM performs best. We suspect the large size of the SCOTUS opinions ( Figure 2) to be the main reason, i.e., in many cases full paragraphs or parts of them are not considered by the hierarchical models (limited to 64 paragraphs of 128 tokens each).
Legal-oriented Models Interestingly, the performance of Legal-BERT and CaseLaw-BERT, the two legal-oriented pre-trained models, is almost identical on CaseHOLD, despite the fact that CaseLaw-BERT is solely trained on US case law. On the other hand, Legal-BERT has been exposed to a wider variety of legal corpora, including EU and UK legislation, ECtHR, ECJ and US court cases, and US contracts. Legal-BERT performs as well as or better than CaseLaw-BERT on all datasets. These results suggest that domain-specific pre-training (and learning a domain-specific subword vocabulary) is beneficial, but over-fitting a specific (niche) sub-domain (e.g., US case law), similarly to Zheng et al. (2021), has no benefits.

Vision -Future Considerations
Beyond the scope of this work and the examined baseline models, we identify four major factors that could potentially advance the state of the art in LexGLUE and legal NLP more generally: Long Documents: Several Transformer-based models (Beltagy et al., 2020;Zaheer et al., 2020;Liu et al., 2022) have been proposed to handle long documents by exploring sparse attention mechanisms. These models can handle sequences up to 4096 sub-words, which is largely exceeded in three out of seven LexGLUE tasks ( Figure 2). Contrary, the hierarchical model of Section 4.2 can handle sequences up to 8192 sub-words in our experiments, but a part of the model (the additional Transformer blocks that make the paragraph embeddings aware of the other paragraphs) is not pre-trained, which possibly negatively affects performance. Structured Text: Current models for long documents, like Longformer and BigBird, do not consider the document structure (e.g., sentences, paragraphs, sections). For example, window-based attention may consider a sequence of sentences across paragraph boundaries or even consider truncated sentences. To exploit the document structure, Yang et al. (2020) proposed SMITH, a hierarchi-cal Transformer model that hierarchically encodes increasingly larger blocks (e.g., words, sentences, documents). SMITH is very similar to the hierarchical model of Section 4.2, but it is pre-trained endto-end with two objectives: token-level masked and sentence block language modeling.
Large-scale Legal Pre-training: Recent studies (Chalkidis et al., 2020b;Zheng et al., 2021;Bambroo and Awasthi, 2021;Xiao et al., 2021) introduced language models pre-trained on legal corpora, but of relatively small sizes, i.e., 12-36 GB. In the work of Zheng et al. (2021), the pre-training corpus covered only a narrowly defined area of legal documents, US court opinions. The same applies to Lawformer (Xiao et al., 2021), which was pre-trained on Chinese court opinions. Future work could curate and release a legal version of the C4 corpus (Raffel et al., 2020), containing multijurisdictional legislation, court decisions, contracts and legal literature at a size of hundreds of GBs. Given such a corpus, a large language model capable of processing long structured text could be pre-trained and it might excel in LexGLUE.
Even Larger Language Models: Scaling up the capacity of pre-trained models has led to increasingly better results in general NLU benchmarks (Kaplan et al., 2020), and models have been scaled up to billions of parameters (Brown et al., 2020;Raffel et al., 2020;He et al., 2021). In Appendix E, we observe that using the large version of RoBERTa leads to substantial performance improvements compared to the base version. The results are comparable or better -in some casescompared to the legal-oriented language models (Legal-BERT, CaseLaw-BERT). Considering that the two legal-oriented models are much smaller and have been pre-trained with (5−10×) less data (Section 2), we have a strong indication for performance gains by pre-training larger legal-oriented models using larger legal corpora.

Limitations and Future Work
Although, our benchmark inevitably cannot cover "everything in the whole wide (legal) world" (Raji et al., 2021), we include a representative collection of English datasets that also ground to a certain degree in practically interesting applications.
In its current version, LexGLUE can only be used to evaluate English models. As legal documents are typically written in the official language of the particular country of origin, there is an increasing need for developing models for other languages. The current scarcity of datasets in other languages (with the exception of Chinese) makes a multilingual extension of LexGLUE challenging, but an interesting avenue for future research.
Beyond language barriers, legal restrictions currently inhibit the creation of more datasets. Important document types, such as contracts and scholarly publications are protected by copyright or considered trade secrets. As a result, their owners are concerned with data-leakage when they are used for model training and evaluation. Providing both legal and technical solutions, e.g., using privacy-aware infrastructure and models (Downie, 2004;Feyisetan et al., 2020) is a challenge to be addressed.
Access to court decisions can also be hindered by bureaucratic inertia, outdated technology and data protection concerns, which collectively result in these otherwise public decisions not being publicly available (Pah et al., 2020). While the anonymization of personal data provides a solution to this problem, it is itself an open challenge for legal NLP (Jana and Biemann, 2021). In lack of suitable datasets and benchmarks, we have refrained from including anonymization in this version of LexGLUE, but plan to do so at a later stage.
Another limitation of the current version of LexGLUE is that human evaluation is missing. All datasets rely on ground truth labels automatically extracted from data (e.g., court decisions) produced as part of official judicial or archival procedures. These resources should be highly reliable (valid), but we cannot statistically assess their quality. In the future, re-annotating part of the datasets with multiple legal experts would provide an estimation of human level performance and inter-annotator agreement, though the cost would be high, because of the required legal expertise.
While LexGLUE offers a much needed unified testbed for legal NLU, there are several other critical aspects that need to be studied carefully. These include multi-disciplinary research to better understand the limitations and challenges of applying NLP to law (Binns, 2020), while also considering fairness and robustness (Angwin et al., 2016;Dressel and Farid, 2018;Baker Gillis, 2021;Wang et al., 2021;Chalkidis et al., 2022), and broader legal considerations of AI technologies in general (Schwemer et al., 2021;Tsarapatsanis and Aletras, 2021;Delacroix, 2022). Education and Research (BMBF) kmu-innovativ program under funding code 01IS18085. We would like to thank Desmond Elliott for providing valuable feedback (baselines for truncated documents presented in Appendix D), Xiang Dai and Joel Niklaus for reviewing and pointing out issues in the new resources (code, datasets).

Ethics Statement Original Work Attribution
All datasets included in LexGLUE, except SCO-TUS, are publicly available and have been previously published. If datasets or the papers that introduced them were not compiled or written by ourselves, we referenced the original work and encourage LexGLUE users to do so as well. In fact, we believe this work should only be referenced, in addition to citing the original work, when experimenting with multiple LexGLUE datasets and using the LexGLUE evaluation infrastructure. Otherwise only the original work should be cited.

Social Impact
We believe that this work does not contain any grounds for ethical concerns. A transparent and rigorous benchmark for NLP in the legal domain might serve as an orientation for scholars and industry researchers. As a result, the capabilities of tools that are trained using natural language data from the legal domain will become clearer, thereby helping their users to better understand them. This increased certainty would also raise the awareness within research and industry communities to potential risks associated with the use of these tools. We regard this contribution to a more realistic, more informed discussion as an important use case of the work presented. Ideally, it could help both beginners and seasoned professionals to understand the limitations of using NLP tools in the legal domain and thereby prevent exaggerated expectations and potential applications that might risk endangering fundamental rights or the rule of law. We currently cannot imagine use cases of this particular work that would lead to ethical concerns or potential harm (Tsarapatsanis and Aletras, 2021

Licensing & Personal Information
LexGLUE comprises seven datasets: ECtHR Task A and B, SCOTUS, EUR-LEX, LEDGAR, UNFAIR-ToS, and CaseHOLD that are available for re-use and re-share with appropriate attribution. The data is in general partially anonymized in accordance with the applicable national law. The data is considered to be in the public sphere from a privacy perspective. This is a very sensitive matter, as the courts try to keep a balance between transparency (the public's right to know) and privacy (respect for private and family life).
ECtHR contains personal data of the parties and other people involved in the legal proceedings. Its data is processed and made public in accordance with the European data protection laws. This includes either implied consent or legitimate interest to process the data for research purposes. As a result, their processing by us or other future users of the benchmark is not likely to raise ethical concerns.
SCOTUS contains personal data of a similar nature. Again, the data is processed and made available by the US Supreme Court, whose proceedings are public. While this ensures compliance with US law, it is very likely that similarly to the ECtHR any processing could be justified by either implied consent or legitimate interest under European law.
EUR-LEX by contrast is merely a collection of legislation material and therefore not likely to contain personal data, except signatory information (e.g., president of EC). It is openly published by the European Union and processed by the EU's Publication Office. In addition, since our work qualifies as research, it is privileged pursuant to Art. 6 (1) (f) GDPR.
LEDGAR contains publicly available contract provisions published in the EDGAR database of the US Securities and Exchange Commission (SEC). As far as personal information might be contained, it should equally fall into the public sphere and be covered by research privilege. Our processing does not focus on personal information at all, rather attributing content labels to provisions.
UNFAIR-ToS contains Terms of Services from business entities such as YouTube, Ebay, Facebook, etc., which makes it unlikely for the data to include personal information. These companies keep user data separate from contractual provisions, so to the best of our knowledge not contained in this dataset.
CaseHOLD contains parts of legal decisions from US Court decisions, obtained from the Harvard library case law corpus. All of the decisions were previously published in compliance with US law. In addition, most instances (case snippets) are too short to contain identifiable information. Should such data be contained, their processing would equally be covered either by implicit consent or a public interest exception. We use all datasets in accordance with copyright terms and under the licenses set forth by their creators.

Limitations & Potential Harms
We have not employed any crowd-workers or annotators for this work. The paper outlines the main limitations with regard to speaker population (English) and generalizability in a dedicated section (Section 7). As a benchmark paper, our claims naturally match the results of the experiments, which -given the current detail of instructions -should be easily reproduced. We provide several ways of accessing the datasets and running the experiments both with and without Hugging Face infrastructure. We do not currently foresee any potential harms for vulnerable or marginalized populations and we do not use, to the best of our knowledge, any identifying characteristics for populations of these kinds. (considered by the court) in every case; however, there is such a rare scenario after the simplifications we introduced, i.e., some cases were originally labeled only with rare labels that were excluded from our benchmark (Section 3.2). In UNFAIR-ToS, the vast majority of sentences are not labeled with any type of unfairness (unfair term against users), i.e., most sentences do not raise any questions of possible violations of the European consumer law.
In multi-label classification, the set of labels per instance is represented as a one-hot vector Y = [y 1 , y 2 , . . . , y L ], where y i = 1 if the instance is labeled with the i-th class, and y i = 0 otherwise. If an instance is not labeled with any class, its Y includes only zeros. During training, binary cross-entropy correctly penalizes such instances, if the predictions (Ŷ = [ŷ 1 ,ŷ 2 . . . ,ŷ L ]) diverge from zeros. During evaluation, however, the F1-score (F1 = In order to make F1 sensitive to the correct labeling of such examples, during evaluation (not training) we include an additional label (y 0 orŷ 0 ) in both targets (Y) and predictions (Ŷ), whose value is 1 (positive) if the original (without y 0 ,ŷ 0 ) Y andŶ are Y = [0, 0, . . . , 0] orŶ = [0, 0, . . . , 0], respectively, and 0 (negative) otherwise. This is particularly important for proper evaluation, as across three datasets a considerable portion of the examples are unlabeled (11.5% in ECtHR Task A, 1.6% in ECtHR Task B, and 95.5% in UNFAIR-ToS).

C Additional Results
Tables 5 and 6 show development results for all examined models across datasets. We report the mean and standard deviations (±) for the three seeds (among the five used) with the best development scores per model to exclude catastrophic failures, i.e., runs with severely low performance. The standard deviations are relatively low across models and datasets (up to 0.5% for µ-F 1 and up to 1% for m-F 1 ). The development results are generally higher compared to the test ones (cf. Table 3) in many cases, as one would expect. Table 7 reports training times per dataset and model; both the time per epoch (T/e), and the total training time (T ) across all epochs. All fullattention BERT models, except Longformer and Table 6: Development m-F 1 results for all examined models across all LexGLUE tasks. We report the mean and standard deviation (±) for the three seeds with the best development scores per model. In starred datasets, we use the hierarchical variant of each model, except for Longformer and BigBird, as discussed in Section 4.2.
Big-Bird, have comparable times with the exception of DeBERTa that has four separate attention mechanisms. We observe that when the hierarchical variant of these models is deployed, i.e., in EC-tHR tasks and SCOTUS, it is approximately twice (2×) as fast compared to Longformer and BigBird. Figure 3: Development m-F 1 scores of standard BERT (up to 512 tokens) and its hierarchical variant (Section 4.2, 64×128 tokens) in ECtHR (Task A, B) and SCOTUS, i.e., the datasets with long documents. Light blue denotes the average score across 5 runs for the hierarchical variant (used in Table 3 for these datasets), while dark blue corresponds to standard BERT (not used in in Table 3 for these datasets). The error bars show the standard error.

D Use of 512-token BERT models
In Figure 3, we show results for the standard BERT model of Devlin et al. (2019), which can process up to 512 tokens, compared to its hierarchical variant (Section 4.2), which can process up to 64×128 tokens. We observe that across all datasets that contain long documents (ECtHR A & B, SCOTUS, cf. Fig. 2(a)), the hierarchical variant clearly outperforms the standard model fed with truncated documents (ECtHR A: +10.2% p.p., ECtHR B: 7.5% p.p., SCOTUS: 4.9% p.p.). Compared to the ECtHR tasks, the gains are lower in SCOTUS, a topic classification task where long-range reasoning is not needed; by contrast, for ECtHR multiple distant facts need to be combined. Based on these results, we conclude that using severely truncated documents is not a plausible option for LexGLUE, and other directions for processing long documents should be considered in the future, ideally fully pre-trained hierarchical models, contrary to our semi-pre-trained hierarchical models (Section 6).
Method ECtHR (A)* ECtHR (B)* SCOTUS* EUR-LEX LEDGAR UNFAIR-ToS CaseHOLD µ-F 1 m-F 1 µ-F 1 m-F 1 µ-F 1 m-F 1 µ-F 1 m-F 1 µ-F 1 m-F 1 µ-F 1 m-F 1 µ-F 1 / m-F 1 case, we use the AdamW optimizer with a 1e-5 maximum learning rate, warm-up ratio of 0.1, and a weight decay rate of 0.06, and we use a similar mini-batch size of 8 examples. 20 Table 8 reports the development and test results using the seed (run) with the best development scores. We observe that using the large version of RoBERTa, dubbed RoBERTa (L), with more than 2× parameters (355M), leads to substantial performance improvements compared to the base version of RoBERTa, dubbed RoBERTa (B), in many tasks. The results are comparable, or better in some cases, compared to the legal-oriented language models (Legal-BERT, CaseLaw-BERT). Considering that the two legal-oriented models are much smaller and have been pre-trained with (5−10×) less data (Section 2), we have a strong indication for performance gains by pre-training larger legal-oriented models using larger legal corpora (Section 6).