LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain

,

Specifically, the emergence of pre-trained Language Models (PLMs) has led to significant performance boosts on popular benchmarks like GLUE (Wang et al., 2019b) or SuperGLUE (Wang et al., 2019a), emphasizing the need for more challenging benchmarks to measure progress.Legal benchmark suites have also been developed to systematically evaluate the performance of PLMs, showcasing the superiority of legal-oriented models over generic ones on downstream tasks such as legal document classification or question answering (Chalkidis et al., 2022a;Hwang et al., 2022).
Even though these PLMs are shown to be effective for numerous downstream tasks, they are general-purpose models that are trained on broaddomain resources, such as Wikipedia or News, and therefore, can be insufficient to address tasks specific to the legal domain (Chalkidis et al., 2020;Hua et al., 2022;Niklaus and Giofre, 2023).Indeed, the legal domain is strongly characterized both by its lexicon and by specific knowledge typically not available outside of specialized domain resources.Laypeople even sometimes call the language used in legal documents "legalese" or "legal jargon", emphasizing its complexity.Moreover, the length of a legal document usually exceeds the length of a Wikipedia or news article, and in some tasks the relationships between its entities may span across the entire document.Therefore, it is necessary to develop specialized Legal PLMs trained on extensive collections of legal documents and evaluate them on standardized legal benchmarks.While new PLMs capable of handling long documents have been developed in the last years, they are predominantly trained for the general domain and on English data only.
The rising need to build NLP systems for languages other than English, the lack of textual resources for such languages, and the widespread use of code-switching in many cultures (Torres Cacoullos, 2020) is pushing researchers to train models on massively multilingual data (Conneau et al., 2020).Nonetheless, to the best of our knowledge, no multilingual legal language model has been proposed so far.Consequently, there is a need for standardized multilingual benchmarks that can be used to evaluate existing models and assess whether more research efforts should be directed toward the development of domain-specific models.This is particularly important for legal NLP where inherently multinational (European Union, Council of Europe) or multilingual (Canada, Switzerland) legal systems are prevalent.
In this work, we propose a challenging multilingual benchmark for the legal domain, named LEXTREME.We survey the literature from 2010 to 2022 and select 11 relevant NLU datasets covering 24 languages in 8 subdivisions (Germanic, Romance, Slavic, Baltic, Greek, Celtic, Finnic, and Hungarian) from two language families (Indo-European and Uralic).We evaluate five widely used multilingual encoder-based language models as shown in Figure 1 and observe a correlation between the model size and performance on LEX-TREME.Surprisingly, at the low end, DistilBERT (Sanh et al., 2019) strongly outperforms MiniLM (Wang et al., 2020) (36.7 vs 19.0 LEXTREME score) while only having marginally more parameters (135M vs 118M).
For easy evaluation of future models, we release the aggregate dataset on the huggingface hub1 along with a public leaderboard and the necessary code to run experiments on GitHub.2Knowing that our work can not encompass "Everything in the Whole Wide Legal World" (Raji et al., 2021), we design LEXTREME as a living benchmark and provide detailed guidelines on our repository and encourage the community to contribute high-quality multilingual legal datasets. 3Finally, we integrated LEXTREME together with the popular English legal benchmark LexGLUE (Chalkidis et al., 2022a) into HELM (Liang et al., 2022) (an effort to evaluate language models holistically using a large number of datasets from diverse tasks) to ease the adoption of curated legal benchmarks also for the evaluation of large language models such as GPT-3 (Brown et al., 2020), PALM (Chowdhery et al., 2022) or LLaMA (Touvron et al., 2023).Contributions of this paper are two-fold: 1. We review the legal NLP literature to find relevant legal datasets and compile a multilingual legal benchmark of 11 datasets in 24 languages from 8 language groups.2. We evaluate several baselines on LEXTREME to provide a reference point for researchers and practitioners to compare to.
2 Related Work

Benchmarks
Benchmarking is an established method to enable easy and systematic comparison of approaches.GLUE (Wang et al., 2019b) is one of the first benchmarks to evaluate general-purpose neural language models.It is a set of supervised sentence understanding predictive tasks in English that were created through aggregation and curation of several existing datasets.However, it became quickly obsolete due to advanced contextual language models, such as BERT (Devlin et al., 2019), which excelled on most tasks.Subsequently, its updated version, named SUPERGLUE (Wang et al., 2019a) was proposed, incorporating new predictive tasks that are solvable by humans but are difficult for machines.Both benchmarks proposed an evaluation score computed as an aggregation of the scores obtained by the same model on each task.They are also agnostic regarding the pre-training of the model, and do not provide a specific corpus for it.Inspired by these works, numerous benchmarks have been proposed over the years.We describe some well-known ones in Table 1.
The MMLU benchmark is specifically designed to evaluate the knowledge acquired during the pretraining phase of the model by featuring only zeroshot and few-shot learning tasks (Hendrycks et al., 2021).Similarly, SUPERB (Yang et al., 2021) and SUPERB-SG (Tsai et al., 2022) were proposed for speech data, unifying well-known datasets.However, they mainly vary in tasks, e.g., SUPERB-SG includes both predictive and generative tasks, which makes it different from the other benchmarks discussed in this section.Additionally, SUPERB-SG includes diverse tasks, such as speech translation and cross-lingual automatic speech recognition, which require knowledge of languages other than English.Neither of the two (SUPERB or SUPERB-SG) proposes an aggregated score.
XTREME (Hu et al., 2020) is a benchmark specifically designed to evaluate the ability of crosslingual generalization of models.It includes six cross-lingual predictive tasks over ten datasets of miscellaneous texts, covering a total of 40 languages.While some original datasets in it were al-ready designed for cross-lingual tasks, others were extended by translating part of the data using human professionals and automatic methods.

Benchmarks for the Legal Domain
LEXGLUE (Chalkidis et al., 2022a) is the first benchmark for the legal domain and covers six predictive tasks over five datasets made of textual documents in English from the US, EU, and Council of Europe.While some tasks may not require specific legal knowledge to be solved, others would probably need, or at least benefit from, information regarding the EU or US legislation on the specific topic.Among the main limitations of their benchmark, Chalkidis et al. highlight its monolingual nature and remark that "there is an increasing need for developing models for other languages".In parallel, Chalkidis et al. (2022b) released FairLex, a multilingual benchmark for the evaluation of fairness in legal NLP tasks.With a similar aim, Hwang et al. (2022) released the LBOX benchmark, covering two classification tasks, two legal judgment prediction tasks, and one Korean summarization task.Motivated by LEXGLUE and LBOX, we propose a benchmark to encourage multilingual models, diverse tasks, and datasets for the legal domain.Guha et al. (2022) proposed the LEGAL-BENCH initiative that aims to establish an open and collaborative legal reasoning benchmark for few-shot evaluation of LLMs where legal practitioners and other domain experts can contribute by submitting tasks.At its creation, the authors have already added 44 lightweight tasks.While most tasks require legal reasoning based on the common law system (mostly prevalent in the UK and former colonies), there is also a clause classification task.For a more comprehensive overview of the many tasks related to automated legal text analysis, we recommend reading the works of Chalkidis et al. (2022a) andZhong et al. (2020).

Dataset and Task Selection
To find relevant datasets for the LEXTREME benchmark we explore the literature of NLP and the legal domain, identifying relevant venues such as ACL, EACL, NAACL, EMNLP, LREC, ICAIL, and the NLLP workshop.We search the literature on these venues for the years 2010 to 2022.We search for some common keywords (case insensitive) that are related to the legal domain, e.g., criminal, judicial, judgment, jurisdictions, law, legal, legislation, and dataset, e.g., dataset, and corpus vie their union.These keywords help to select 108 potentially relevant papers.Then, we formulate several criteria to select the datasets.Finally, three authors analyze the candidate papers and perform the selection.We handled the disagreement between authors based on mutual discussion and the majority voting mechanism.Exclusion criteria: E1: The dataset is not publicly available E2: The dataset does not contain a public license or does not allow data redistribution E3: The dataset contains labels generated with ML systems E4: It is not a peer-reviewed paper After applying the above criteria, we select 11 datasets from 108 papers.We provide the list of all these datasets in our repository.

LEXTREME Datasets
In the following, we briefly describe the selected datasets.Table 2 provides more information about the number of examples and label classes per split for each task.For a detailed overview of the jurisdictions as well as the number of languages covered by each dataset, see Table 3.Each dataset can have several configurations (tasks), which are the basis of our analyses, i.e., the pre-trained models have always been fine-tuned on a single task.LEXTREME consists of three task types: Single Brazilian Court Decisions (BCD).Legal systems are often huge and complex, and the information is scattered across various sources.Thus, predicting case outcomes from multiple vast volumes of litigation is a difficult task.Lage-Freitas et al. (2022) propose an approach to predict Brazilian legal decisions to support legal practitioners.We use their dataset from the State Supreme Court of Alagoas (Brazil).The input to the models is always the case description.We perform two SLTC tasks: In the BCD-J subset models predict the approval or dismissal of the case or appeal with the three labels no, partial, yes, and in the BCD-U models predict the judges' unanimity on the decision alongside two labels, namely unanimity, not-unanimity.Therefore, the dataset is used to perform three different SLTC tasks along volume level (GLC-V), chapter level (GLC-C), and subject level (GLC-S).

German Argument Mining (GAM
The input to the models is the entire document, and the output is one of the several topic categories. Swiss Judgment Prediction (SJP).Niklaus et al. (2021Niklaus et al. ( , 2022)), focus on predicting the judgment outcome of 85K cases from the Swiss Federal Supreme Court (FSCS).The input to the models is the appeal description, and the output is whether the appeal is approved or dismissed (SLTC task).
Online Terms of Service (OTS).While multilingualism's benefits (e.g., cultural diversity) in the EU legal world are well-known (Commission, 2005), creating an official version of every legal act in 24 languages raises interpretative challenges.Drawzeski et al. (2021) attempt to automatically detect unfair clauses in terms of service documents.We use their dataset of 100 contracts to perform a SLTC and MLTC task.For the SLTC task (OTS-UL), model inputs are sentences, and outputs are classifications into three unfairness levels: clearly fair, potentially unfair and clearly unfair.The MLTC task (OTS-CT) involves identifying sentences based on nine clause topics.
COVID19 Emergency Event (C19).The COVID-19 pandemic showed various exceptional measures governments worldwide have taken to contain the virus.Tziafas et al. (2021), presented a dataset, also known as EXCEPTIUS, that contains legal documents with sentence-level annotation from several European countries to automatically identify the measures.We use their dataset to perform the MLTC task of identifying the type of measure described in a sentence.The input to the models are the sentences, and the output is neither or at least one of the measurement types.

MultiEURLEX (MEU).
Multilingual transfer learning has gained significant attention recently due to its increasing applications in NLP tasks.Chalkidis et al. (2021a) explored the cross-lingual transfer for legal NLP and presented a corpus of 65K EU laws annotated with multiple labels from the EUROVOC taxonomy.We perform a MLTC task to identify labels (given in the taxonomy) for each document.Since the taxonomy exists on multiple levels, we prepare configurations according to three levels (MEU-1, MEU-2, MEU-3).
Greek Legal NER (GLN).Identifying various named entities from natural language text plays an important role for Natural Language Understanding (NLU).Angelidis et al. (2018) compiled an annotated dataset for NER in Greek legal documents.The source material are 254 daily issues of the Greek Government Gazette over the period 2000-2017.In all NER tasks of LEXTREME the input to the models is the list of tokens, and the output is an entity label for each token.et al., 2016) for NER annotated at a coarse-grained (MAP-C) and fine-grained (MAP-F) level.

Models Considered
Since our benchmark only contains NLU tasks, we consider encoder-only models for simplicity.Due to resource constraints, we did not fine-tune models larger than 1B parameters.

Multilingual
We considered the five multilingual models listed in Table 4, trained on at least 100 languages each (more details are in Appendix B).For XLM-R we considered both the base and large version.Furthermore, we used ChatGPT (gpt-3.5-turbo)for zero-shot evaluation of the text classification tasks with less than 50 labels. 5To be fair across tasks we did not consider few-shot evaluation or more sophisticated prompting techniques because of prohibitively long inputs in many tasks.

Monolingual
In addition to the multilingual models, we also fine-tuned available monolingual models on the language specific subsets.We chose monolingual models only if a certain language was represented in at least three datasets. 6We made a distinction between general purpose models, i.e., models that have been pre-trained on generic data aka Native-BERTs, and legal models, i.e., models that have been trained (primarily) on legal data aka NativeLe-galBERTs.A list of the monolingual models can be found in the appendix in Table 8.

Hierarchical Variants
A significant part of the datasets consists of very long documents, the best examples being all variants of MultiEURLEX; we provide detailed using different tokenizers on all datasets in our online repository.However, the models we evaluated were all pre-trained with a maximum sequence length of only 512 tokens.Directly applying pre-trained language models on lengthy legal documents may necessitate substantial truncation, severely restricting the models.To overcome this limitation, we use hierarchical versions of pre-training models for datasets containing long documents.
The hierarchical variants used here are broadly equivalent to those in (Chalkidis et al., 2021b;Niklaus et al., 2022).First, we divide each document into chunks of 128 tokens each.Second, we use the model to be evaluated to encode each of these paragraphs in parallel and to obtain the [CLS] embedding of each chunk which can be used as a context-unaware chunk representation.In order to make them context-aware, i.e. aware of the surrounding chunks, the chunk representations are fed into a 2-layered transformer encoder.Finally, maxpooling over the context-aware paragraph representations is deployed, which results in a document representation that is fed to a classification layer.
Unfortunately, to the best of our knowledge models capable of handling longer context out of the box, such as Longformer (Beltagy et al., 2020) and SLED (Ivgi et al., 2023) are not available multilingually and predominantly trained on English data only.

Experimental Setup
Multilingual models were fine-tuned on all languages of specific datasets.Monolingual models used only the given model's language subset.Some datasets are highly imbalanced, one of the best examples being BCD-U with a proportion of the minority class of about 2%.Therefore, we applied random oversampling on all SLTC datasets, except for GLC, since all its subsets have too many labels, which would have led to a drastic increase in the data size and thus in the computational costs for fine-tuning.For each run, we used the same hyperparameters, as described in Section A.3.
As described in Section 4.3, some tasks contain very long documents, requiring the usage of hierarchical variants to process sequence lenghts of 1024 to 4096 tokens.Based on the distribution of the sequence length per example for each task (cf.Appendix H), we decided on suitable sequence lengths for each task before fine-tuning.A list of suitable sequence lengths are in A.1.

Evaluation Metrics.
We use the macro-F1 score for all datasets to ensure comparability across the entire benchmark, since it can be computed for both text classification and NER tasks.Mathew's Correlation Coefficient (MCC) (Matthews, 1975) is a suitable score for evaluating text classification tasks but its applicability to NER tasks is unclear.For brevity, we do not display additional scores, but more detailed (such as precision and recall, and scores per seed) and additional scores (such as MCC) can be found online on our Weights and Biases project.7

Aggregate Score
We acknowledge that the datasets included in LEX-TREME are diverse and hard to compare due to variations in the number of samples and task complexity (Raji et al., 2021).This is why we always report the scores for each dataset subset, enabling a fine-grained analysis.However, we believe that by taking the following three measures, an aggregate score can provide more benefits than drawbacks, encouraging the community to evaluate multilingual legal models on a curated benchmark, thus easing comparisons.
We (a) evaluate all datasets with the same score (macro-F1) making aggregation more intuitive and easier to interpret, (b) aggregating the F1 scores again using the harmonic mean, since F1 scores are already rates and obtained using the harmonic mean over precision and recall, following Tatiana and Valentin (2021), and (c) basing our final aggregate score on two intermediate aggregate scores -the dataset aggregate and language aggregate score -thus weighing datasets and languages equally promoting model fairness, following Tatiana and Valentin (2021) and Chalkidis et al. (2022a).
The final LEXTREME score is computed using the harmonic mean of the dataset and the language aggregate score.We calculate the dataset aggregate by successively taking the harmonic mean of (i) the languages in the configurations (e.g., de,fr,it in SJP), (ii) configurations within datasets (e.g., OTS-UL, OTS-CT in OTS), and (iii) datasets in LEXTREME (BCD, GAM).The language aggregate score is computed similarly: by taking the harmonic mean of (i) configurations within datasets, (ii) datasets for each language (e.g., MAP, MEU for lv), and (iii) languages in LEXTREME (bg,cs).
We do not address the dimension of the jurisdiction, which we consider beyond the scope of this work.

Results
In this section, we discuss baseline evaluations.Scores and standard deviations for validation and test datasets across seeds are on our Weights and Biases project or can be found in Table 11,12,13,14.Comparisons with prior results on each dataset can be drawn from the tables provdided in section G in the appendix.Aggregated results by dataset and language are in Tables 5 and 6.Larger models are better For both, we see a clear trend that larger models perform better.However, when looking at the individual datasets and languages, the scores are more erratic.Note that XLM-R Base underperforms on MEU (especially on MEU-3; see Table 11 and Table 12) leading to a low dataset aggregate score due to the harmonic mean.Additionally, low performance on MEU-3 has a large impact on its language aggregate score, since it affects all 24 languages.

Differing model variance across datasets
We observe significant variations across datasets such as GLC, OTS or C19, with differences as large as 52 (in OTS) between the worst-performing MiniLM and the best-performing XLM-R large.
MiniLM seems to struggle greatly with these three datasets, while even achieving the best performance on GAM.On other datasets, such as GAM, SJP, and MAP the models are very close together (less than 6 points between best and worst model).Even though XLM-R Large takes the top spot on aggregate, it only has the best performance in six out of eleven datasets.

Less variability across languages
In contrast to inconsistent results on the datasets, XLM-R Large outperforms the other multilingual models on most languages (21 out of 24).Additionally, we note that model variability within a language is similar to the variability within a dataset, however, we don't see extreme cases such as GLC, OTS, or C19.
Monolingual models are strong Monolingual general-purpose models (NativeBERT in show strong performance with only a few exceptions (on Bulgarian, Spanish, Polish, and Slovak).
In 13 out of 19 available languages they reach the top performance, leading to the top language aggregate score.The few available models pre-trained on legal data (NativeLegalBERT) slightly outperform multilingual models of the same size.
ChatGPT underperforms We show a comparison of ChatGPT with the best performing multilingual model XLM-R Large in Table 7.To save costs, we limited the evaluation size to 1000 samples for ChatGPT.We use the validation set instead of the test set to be careful not to leak test data into ChatGPT, possibly affecting future evaluation.Chalkidis (2023) showed that ChatGPT is still outperformed by supervised approaches on LexGLUE.
Similarly, we find that much smaller supervised models clearly outperform ChatGPT in all of tested tasks, with very large gaps in GAM and OTS.

Conclusions and Future Work
Conclusions We survey the literature and select 11 datasets out of 108 papers with rigorous criteria to compile the first multilingual benchmark for legal NLP.By open-sourcing both the dataset and the code, we invite researchers and practitioners to evaluate any future multilingual models on our benchmark.We provide baselines for five popular multilingual encoder-based language models of different sizes.We hope that this benchmark will foster the creation of novel multilingual legal models and therefore contribute to the progress of natural legal language processing.We imagine this work as a living benchmark and invite the community to extend it with new suitable datasets.

Future Work
In future work, we plan to extend this benchmark with other NLU tasks and also generation tasks such as summarization, simplification, or translation.Additionally, a deeper analysis of the differences in the behavior of monolingual general-purpose models versus models trained on legal data could provide useful insights for the development of new models.Another relevant aspect that deserves further studies is the impact of the jurisdiction and whether the jurisdiction information is predominantly learned as part of the LLM or is instead learned during fine-tuning.Finally, extending datasets in more languages and evaluating other models such as mT5 (Xue et al., 2021) can be other promising directions.

Limitations
It is important to not exceed the enthusiasm for language models and the ambitions of benchmarks: many recent works have addressed the limits of these tools and analyzed the consequences of their misuse.For example, Bender and Koller (2020) argue that language models do not really learn "meaning".Bender et al. ( 2021) further expand the discussion by addressing the risks related to these technologies and proposing mitigation methods.Koch et al. (2021) evaluate the use of datasets inside scientific communities and highlight that many machine learning sub-communities focus on very few datasets and that often these dataset are "borrowed" from other communities.Raji et al. ( 2021) offer a detailed exploration of the limits of popular "general" benchmarks, such as GLUE (Wang et al., 2019b) and ImageNET (Deng et al., 2009).Their analysis covers 3 aspects: limited task design, de-contextualized data and performance reporting, inappropriate community use.
The first problem concerns the fact that typically tasks are not chosen considering proper theories and selecting what would be needed to prove generality.Instead, they are limited to what is considered interesting by the community, what is available, or other similar criteria.These considerations hold also for our work.Therefore, we cannot claim that our benchmark can be used to assess the "generality" of a model or proving that it "understands natural legal language".
The second point addresses the fact that any task, data, or metric are limited to their context, therefore "data benchmarks are closed and inherently subjective, localized constructions".In particular, the content of the data can be too different from real data and the format of the tasks can be too homogeneous compared to human activities.Moreover, any dataset inherently contains biases.We tackle this limitation by deciding to include only tasks and data that are based on real world scenarios, in an effort to minimize the difference between the performance of a model on our benchmark and its performance on a real world problem.
The last aspect regards the negative consequences that benchmarks can have.The competitive testing may encourage misbehavior and the aggregated performance evaluation does create a mirage of cross-domain comparability.The presence of popular benchmarks can influence a scientific community up to the point of steering towards techniques that perform well on that specific benchmark, in disfavor of those that do not.Finally, benchmarks can be misused in marketing to promote commercial products while hiding their flaws.These behaviours obviously cannot be forecasted in advance, but we hope that this analysis of the shortcomings of our work will be sufficient to prevent misuses of our benchmark and will also inspire research directions for complementary future works.For what specifically concerns aggregated evaluations, they provide an intuitive but imprecise understanding of the performance of a model.While we do not deny their potential downsides, we believe that their responsible use is beneficial, especially when compared to the evaluation of a model on only an arbitrarily selected set of datasets.Therefore, we opted to provide an aggregated evaluation and to weigh languages and tasks equally to make it as robust and fair as possible.
While Raji et al. and Koch et al. argument against the misrepresentations and the misuses of benchmarks and datasets, they do not argue against their usefulness.On the contrary, they consider the creation and adoption of novel benchmarks a sign of a healthy scientific community.
Finally, we want to remark that for many datasets the task of outcome prediction is based not on the document provided by the parties, but on the document provided by the judge along with its decision.For example, Semo et al. (2022) provide a more realistic setup of judgment prediction than other datasets, using actual complaints as inputs.However, due to very limited access to the complaint documents, especially multilingually, creating such datasets is extremely challenging.Thus, most recent works used text from court decisions as proxies.However, predicting the judgment outcome based on text written by the court itself can still be a hard task (as evidenced by results on these datasets).Moreover, it may still require legal reasoning capabilities from models because of the need to pick out the correct information.Additionally, we believe that these tasks can also be interesting to conduct post hoc analyses of decisions.

Ethics Statement
The scope of this work is to release a unified multilingual legal NLP benchmark to accelerate the development and evaluation of multilingual legal language models.A transparent multilingual and multinational benchmark for NLP in the legal domain might serve as an orientation for scholars and industry researchers by broadening the discussion and helping practitioners to build assisting technology for legal professionals and laypersons.We believe that this is an important application field, where research should be conducted (Tsarapatsanis and Aletras, 2021) to improve legal services and democratize law, while also highlight (inform the audience on) the various multi-aspect shortcomings seeking a responsible and ethical (fair) deployment of legal-oriented technologies.
Nonetheless, irresponsible use (deployment) of such technology is a plausible risk, as in any other application (e.g., online content moderation) and domain (e.g., medical).We believe that similar technologies should only be deployed to assist human experts (e.g., legal scholars in research, or legal professionals in forecasting or assessing legal case complexity) with notices on their limitations.
All datasets included in LEXTREME, are publicly available and have been previously published.We referenced the original work and encourage LEXTREME users to do so as well.In fact, we believe this work should only be referenced, in addition to citing the original work, when experimenting with multiple LEXTREME datasets and using the LEXTREME evaluation infrastructure.Otherwise, only the original work should be cited.

A.2 Total compute
We used a total of 689 GPU days.

A.3 Hyperparameters
We used learning rate 1e-5 for all models and datasets without tuning.We ran all experiments with 3 random seeds (1-3).We always used batch size 64.In case the GPU memory was insufficient, we additionally used gradient accumulation.
We trained using early stopping on the validation loss with an early-stopping patience of 5 epochs.Because MultiEURLEX is very large and the experiment very long, we just train for 1 epoch and evaluated after every 1000 th step when finetuning multilingual models on the entire dataset.For finetuning the monolingual models on languagespecific subsets of MultiEURLEX, we evaluated on the basis of epochs.We used AMP mixed precision training and evaluation to reduce costs.Mixed precision was not used in combination with microsoft/mdeberta-v3-base because it led to errors.For the experiments we used the following NVIDIA GPUs: 24GB RTX3090, 32GB V100 and 80GB A100.

B Model Descriptions
MiniLM.MiniLM (Wang et al., 2020) is the result of a novel task-agnostic compression technique, also called distillation, in which a compact model -the so-called student -is trained to reproduce the behaviour of a larger pre-trained model -the so-called teacher.This is achieved by deep self-attention distillation, i.e.only the self-attention module of the last Transformer layer of the teacher, which stores a lot of contextual information (Jawahar et al., 2019), is distilled.The student is trained by closely imitating the teacher's final Transformer layer's self-attention behavior.To aid the learner in developing a better imitation, (Wang et al., 2020)    Table 13: Arithmetic mean of macro-F1 and the standard deviation over all seeds for monolingual models from the validation set.Table 14: Arithmetic mean of macro-F1 and the standard deviation over all seeds for monolingual models from the test set.

G Original Paper Results
In this section, we present an overview of scores for each configuration of the LEXTREME dataset as provided in the original papers.When certain configurations were not available, no scores were obtained.
It should be noted that different papers provide varying scores, making direct comparisons with our results challenging.Additionally, the variability in the training and evaluation procedure used across different papers may impact the resulting scores, which is an important factor to consider.To gain a better understanding of the training and evaluation procedure please refer to the cited references.The LEXTREME scores are calculated by taking the arithmetic mean of each seed (three in total).

H Histograms
In the following, we provide the histograms for the distribution of the sequence length of the input (sentence or entire document) from each dataset.The length is measured by counting the tokens using the tokenizers of the multilingual models, i.e., DistilBERT, MiniLM, mDeBERTa v3, XLM-R base, XLM-R large.We only display the distribution within the 99th percentile; the rest is grouped together at the end.

Figure 1 :
Figure 1: Overview of multilingual models on LEX-TREME.The bubble size and text inside indicate the parameter count.The bold number below the model name indicates the LEXTREME score (harmonic mean of the language agg.score and the dataset agg.score).

Table 1 :
Characteristics of popular existing NLP benchmarks.

Table 2 :
Dataset and task overview.# Examples and # Labels show values for train, validation, and test splits.
).Similar toGLN, Pais et al.  (2021)manually annotated Romanian legal documents for various named entities.The dataset is derived from 370 documents from the larger MAR-CELL Romanian legislative subcorpus.
4LeNER BR (LNB).Luz de Araujo et al. (2018) compiled a dataset for NER for Brazilian legal documents.66legal documents from several Brazilian Courts and four legislation documents were collected, resulting in a total of 70 documents annotated for named entities.MAPA (MAP).de Gibert et al. (2022) built a multilingual corpus based on EUR-Lex (Baisa

Table 4 :
Multilingual models.All models support a maximum sequence length of 512 tokens.The third column shows the total number of parameters, including the embedding layer.

Table 5 :
ModelBCD GAM GLC SJP OTS C19 MEU GLN LNR LNB MAP Agg.Dataset aggregate scores for multilingual models.The best scores are in bold.

Table 6 :
Language aggregate scores for multilingual models.The best scores are in bold.For each language, we also list the best-performing monolingual legal model under NativeLegalBERT and the best-performing monolingual non-legal model under NativeBERT.Missing values indicate that no suitable models were found.

Table 7 :
Results with ChatGPT on the validation sets performed on June 15, 2023.Best results are in bold.
also introduce the self-attention value-relation transfer in addition to the self-attention distributions.The addition of a teacher assistant results in further improvements.For the training of multilingual MiniLM, XLM-R BASE was used.

Table 8 :
Monolingual models.BS is short for batch size.For a detailed overview of the pretraining corpora, we refer to the publications.For some models we were not able to find publications/specs.

Table 15 :
BCD-J.The best scores are in bold.

Table 16 :
BCD-U.The best scores are in bold.

Table 17 :
C19.The best scores are in bold.

Table 18 :
GAM.The best scores are in bold.

Table 19 :
GLC-C.The best scores are in bold.

Table 20 :
GLC-S.The best scores are in bold.

Table 21 :
GLC-V.The best scores are in bold.

Table 22 :
GLN.The best scores are in bold.

Table 23 :
LNR.The best scores are in bold.

Table 24 :
LNB.The best scores are in bold.