MultiEURLEX - A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

We introduce MULTI-EURLEX, a new multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. We highlight the effect of temporal concept drift and the importance of chronological, instead of random splits. We use the dataset as a testbed for zero-shot cross-lingual transfer, where we exploit annotated training documents in one language (source) to classify documents in another language (target). We find that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to other languages. Adaptation strategies, namely partial fine-tuning, adapters, BITFIT, LNFIT, originally proposed to accelerate fine-tuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the pretrained model used and the size of the label set.


Introduction
Multilingual learning is an active field of research in NLP. Starting from neural machine translation (Stahlberg, 2020), multilingual neural models are increasingly being considered across NLP tasks and multilingual benchmark datasets for cross-lingual language understanding are becoming available (Hu et al., 2020;Ruder et al., 2021), complementing previous monolingual benchmarks (Wang et al., 2018). The initial paradigm of multilingual word embeddings (Ruder et al., 2017) was rapidly expanded to pretrained multilingual models (Conneau et al., 2018), including work on zero-shot cross-lingual transfer (Artetxe and Schwenk, 2019). Multilingual models based on TRANSFORMERs (Vaswani et al., 2017), jointly pretrained on large  (Table 1) from 7 families (illustrated per EU country in the map). The UK was an EU member until 2020. The map should not be taken to imply that no other languages are spoken in EU countries. corpora across multiple languages, have significantly advanced the state-of-the-art in cross-lingual tasks (Conneau et al., 2020;Xue et al., 2021).
In another interesting direction, legal NLP (Aletras et al., 2019;Zhong et al., 2020) is an emerging field targeting tasks such as legal judgment prediction (Aletras et al., 2016), legal topic classification (Chalkidis et al., 2019), legal question answering (Kim et al., 2015), contract understanding (Hendrycks et al., 2021), to name a few. Generic pretrained language models for legal text in particular have also been introduced (Chalkidis et al., 2020b). But despite rapid growth, cross-lingual transfer has not yet been explored in legal NLP. To facilitate research on cross-lingual transfer for text classification and legal topic classification in particular, we introduce a new multilingual dataset, MULTI-EURLEX, which includes 65k European Union (EU) laws, officially translated in the 23 EU official languages (Fig. 1). Each document is annotated with multiple labels from EU-ROVOC, where concepts are organized hierarchi- Figure 2: Examples from levels (L i ) 1 to 3 from the EUROVOC hierarchy. More general concepts become more specific as we move from higher to lower levels. cally (Fig. 2). 1 We use the dataset as a testbed for zero-shot cross-lingual transfer in cases where we wish to exploit labeled training documents in one language (source) to classify documents in another language (target). This would allow, e.g., classifiers trained in resource-rich languages to be reused in languages with fewer or no training instances.
We experiment with monolingual and multilingual TRANSFORMER-based models, i.e., monolingual BERT models (Devlin et al., 2019), XLM-ROBERTA (Conneau et al., 2020), and MT5 (Xue et al., 2021). We find that fine-tuning a multilingual model in a single source language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer to target languages. We show that adaptation strategies, namely, not fine-tuning some layers, adapters (Houlsby et al., 2019), BITFIT (Zaken et al., 2021), and LNFIT inspired by Frankle et al. (2021), originally proposed to accelerate fine-tuning for new end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer, but their impact also depends on the particular pretrained model used and the size of the label set. We also compare chronological vs. random splits, highlighting the impact of temporal concept drift in legal topic classification, which causes random splits to over-estimate performance (Søgaard et al., 2021). Our main contributions are: • A parallel multilingual annotated dataset for legal topic classification with 65k EU laws in 23 languages, which can be used as a testbed for cross-lingual multi-label classification. • Extensive experiments with state-of-the-art monolingual and multilingual models in 23 languages, which establish strong baselines for research on cross-lingual (legal) text classification.
• Experiments with several adaptation strategies showing that adaptation is beneficial in zero-shot cross-lingual transfer, apart from task transfer. • Comparison of chronological vs. random splits, showing the temporal concept drift in legal topic classification and problems with random splits.

Related Work
Legal topic classification has been studied for EU legislation (Mencia and Fürnkranz, 2007;Chalkidis et al., 2019) in a monolingual setting (English). While there are several legal NLP studies with non-English datasets (Kim et al., 2015;Waltl et al., 2017;Nguyen et al., 2018;Angelidis et al., 2018;Luz de Araujo et al., 2020), cross-lingual transfer has not been studied in the legal domain. Cross-lingual transfer is a very active area of wider NLP research, currently dominated by large multilingually pretrained models (Conneau et al., 2018;Eisenschlos et al., 2019;Liu et al., 2020;Xue et al., 2021). Recent work explores adapter modules (Houlsby et al., 2019) to transfer monolingually pretrained (Artetxe et al., 2020) or multilingually pretrained (Pfeiffer et al., 2020) models to new (target) languages. We examine more adaptation strategies, apart from adapter modules, in truly zero-shot cross-lingual transfer. Unlike Pfeiffer et al. (2020), we do not train language-specific adapters per target language; we use adapters to fine-tune a single multilingual model on the source language, which is then used in all target languages.
In the broader field of multilingual legal studies, Gonalves and Quaresma (2010) examined legal topic classification with a dataset comprising 2.7k EU laws in 4 languages (English, German, Spanish, Portuguese). They experimented with monolingual SVM classifiers and their combination as a multilingual ensemble. More recently, Galassi et al. (2020) transferred sentence-level gold labels from annotated English to non-annotated German sentences, for the task of identifying unfair clauses in Terms of Service (2.7k sentences) and Privacy Policy documents (1.8k). They experimented with similarity-based methods aligning the English sentences to machine-translations of the German sentences. We experiment with state-of-the-art multilingual TRANSFORMER-based models considering many more languages (23) and a much larger dataset (65k EU laws). Although MULTI-EURLEX is largely parallel, we use it as a testbed for zeroshot cross-lingual transfer, without requiring parallel training data or machine translation systems.   Multi-granular Labeling: EUROVOC has eight levels of concepts ( Fig. 2 illustrates three). Each document is assigned one or more concepts (labels). If a document is assigned a concept, the ancestors and descendants of that concept are typically not assigned to the same document. The documents were originally annotated with concepts from levels 3 to 8. We created three alternative sets of labels per document, by replacing each assigned concept by its ancestor from level 1, 2, or 3, respectively. Thus, we provide four sets of gold labels per document, one for each of the first three levels of the hierarchy, plus the original sparse label assignment. 5 Table 2 presents the distribution of labels across label sets.
Supported Tasks: Similarly to EURLEX57K (Chalkidis et al., 2019), MULTI-EURLEX can be used for legal topic classification, a multi-label classification task where legal documents need to be assigned concepts (in our case, from EUROVOC) reflecting their topics. Unlike EURLEX57K, however, MULTI-EURLEX supports labels from three different granularities (EUROVOC levels). More importantly, apart from monolingual (one-to-one) experiments, it can be used to study cross-lingual transfer scenarios, including one-to-many (systems trained in one language and used in other languages with no training data), and many-to-one or many-tomany (systems jointly trained in multiple languages and used in one or more other languages).
Data Split and Concept Drift: MULTI-EURLEX is chronologically split in training (55k, 1958-2010), development (5k, 2010-2012), test (5k, 2012-2016)   To verify that the chronological split of MULTI-EURLEX in training, development, test subsets leads to a temporal concept drift, we compare the KL-divergence between the label distributions of the subsets using the chronological vs. a random split. Table 3 shows a random split leads to almost zero divergence for levels 1-3 and low divergence when using all labels. With the chronological split, the divergence increases as the number of labels increases, and is larger between the train and test subsets, which have a larger temporal distance compared to the train and development subsets.

Data Split
Training  Table 4: Results of MULTI-EURLEX for the original sparse annotation (7,390 labels) with BERT using a random or chronological split. Here the model is finetuned and tested on English data only (one-to-one).
To further highlight the temporal concept drift, we fine-tune BERT (Devlin et al., 2019) on the English part of MULTI-EURLEX using all labels, following Chalkidis et al. (2019). Table 4 shows that although the performance on training data is very high with both splits, it deteriorates more rapidly on development data with the chronological split. Also, performance is stable when moving from development to test data with the random split, since both subsets contain randomly sampled unseen documents; but with the chronological split, performance continues to decline on test data. This confirms our hypothesis of a temporal concept drift and shows that the random split over-estimates real performance, contrary to the chronological split. classification) as text generation. This approach (text-to-text) is reasonable in single-label multiclass classification tasks like those of GLUE (Wang et al., 2018), where the output is expected to be the textual descriptor of a single class. But in our case, we have a multi-label task with 5 labels per document on average and label sets containing hundreds or thousands of labels; hence a textual output would be unnecessarily complex. Also, requiring a sequence of labels as output would be problematic, since the correct labels are not ordered. Hence, we use only the encoder of MT5. Similarly to XLM-ROBERTA, we add a [cls] special token, always at the beginning of the sequence, and use its top-level hidden state to represent the document. 8

Cross-lingual Adaptation Strategies
We mainly study zero-shot cross-lingual transfer, where we fine-tune (further train) a multilingual model (pretrained on a multilingual corpus) only on annotated documents of a source language, and evaluate it (without any further training) on test (and development) documents in the other 22 languages (one-to-many). To avoid catastrophically forgetting the multilingual pretraining when finetuning only for the source language, we examine adaptation strategies, where the model is only partially fine-tuned. These were originally proposed to accelerate fine-tuning when moving to new endtasks, but we employ them to retain multilingual knowledge. The four strategies are the following: Frozen layers: In this case, we follow Rosenfeld and Tsotsos (2019) and do not update the parameters of the first N or all (N =12) stacked TRANS-FORMER blocks in fine-tuning; we also never update any input embeddings (of tokens, positions, segments). We experiment with N = 3, 6, 9, 12.
Adapter modules: In this case, we follow Houlsby et al. (2019), placing adapter modules after each feed-forward layer (FFNN) inside each TRANS-FORMER encoder block. Each block contains two FFNN layers: one after the attention layer and one at the very end. An adapter module consists of a down-projection dense layer (W down ∈ R D h ×K , assuming row-vectors, where K D h ) and a consecutive up-projection (W up ∈ R K×D h ), followed by a residual connection (He et al., 2016). The rest of the Transformer block is not updated, except for layer normalization components (Ba et al., 2016).
BitFit: BITFIT (Zaken et al., 2021) keeps the whole network frozen during fine-tuning, except for bias terms. Zaken et al. showed that applying BITFIT on the English BERT (updating 0.09% of parameters) is competitive with fully fine-tuning the entire model in the GLUE benchmark (Wang et al., 2018). The randomly-initialized classification (dense) layer on top of the encoder is always fine-tuned.

Configuration of Models and Training Details:
We implemented all methods in TENSORFLOW 2, obtaining pretrained models from the Hugging Face library. We release our code and data for reproducibility. 9 All models follow the BASE configuration with 12 stacked TRANSFORMER encoder blocks, each with D h = 768 and 12 attention heads. We use the Adam optimizer (Kingma and Ba, 2015) across all experiments. We grid-search to tune the learning rate per method, considering classification performance on development data. 10 Evaluation: Given the large number and skewed distribution of labels, retrieval measures have been favored in large-scale multi-label text classification literature (Mullenbach et al., 2018;Chalkidis et al., 2019). Following Chalkidis et al. (2019, 2020a), we report mean R-Precision (mRP) (Manning et al., 2009). That is, for each document, the model ranks the labels it selects by decreasing confidence, and we compute Precision@k, where k is the document's number of gold labels; we then average over documents. For all experiments, we use the chronological data split and report the average across three runs. Unless stated otherwise, we use level 3 with L = 567 labels (Table 2), which has a highly skewed (long-tail) label distribution and temporal concept drift (Table 3). In Section 6.2, we also consider label sets from the other levels. 9 Our code is available on Github (https://github. com/nlpaueb/multi-eurlex).
10 See Appendix A for details on hyper-parameter tuning.

Experiments and Discussion
For the main experiments, we mainly use XLM-ROBERTA in a one-to-many setting (fine-tuning in English, testing in all languages). We also report key MT5 results for completeness. As a ceiling for cross-lingual transfer, in Section 6.1 we first evaluate monolingual (native) BERT models and XLM-ROBERTA, both in a one-to-one manner (finetuning and testing in the same language), which requires annotated training data in the target language. For completeness, in Section 6.3, we also report many-to-many results, where XLM-ROBERTA is jointly fine-tuned and tested in all languages.

Monolingual Classification (one-to-one)
Table 5 (top) shows that in the one-to-one setting, XLM-ROBERTA is competitive to native (monolingually pretrained) BERTs with a minor decrease of 0.7 mRP on average across languages. Of course, the one-to-one setting requires training data in the target language. We report these results as an upper bound for zero-shot cross-lingual transfer. Also, the native BERTs are pretrained on corpora of different sizes and quality, which explains why they are not consistently better than XLM-ROBERTA.
6.2 Cross-lingual Transfer (one-to-many) XLM-ROBERTA adaptation: In the one-to-many setting, where we fine-tune in English and test in all languages, Table 5 (middle) shows that all adaptation strategies vastly improve the performance of XLM-ROBERTA across languages (up to 6.8 All mRP increase) comparing to no adaptation (endto-end fine-tuning), while remaining competitive in English (source). This indicates that not finetuning the full set of parameters helps the model retain more of its multilingual knowledge obtained during pretraining. We observe no big difference among the block freezing strategies for N = 3, 6, 9, but performance deteriorates substantially when all blocks are frozen (N = 12). 11 We speculate there is a trade-off between freezing more blocks to retain multilingual knowledge and freezing fewer blocks to benefit end-task (classification) performance. Adapters consistently lead to the best results in all languages and overall (All mRP 56.1), with practically no decrease in English (source) performance (67.3). BITFIT, which only fine-tunes bias terms (4e-2% of parameters), and LNFIT, which fine-tunes even fewer parameters (1e-2%), are the second and third best strategies. These results highlight the expressive power of the few parameters BITFIT and LNFIT modify; this observation has been also discussed in previous studies (Frankle et al., 2021;Zaken et al., 2021), but not in a multi-lingual setting. Overall, fine-tuning in a single language leads to substantial forgetting of multilingual knowledge, but adaptation strategies, especially adapter modules, alleviate this problem and improve cross-lingual end-task performance. Figure 3: Test results (mRP, %) for Level 3 (567 labels) with XLM-ROBERTA, when fine-tuning in one language (source, rows) and testing in all languages (columns), without adaptation (end-to-end, left) and with adapter modules (right). The languages are grouped (framed) in language families (Germanic, Romance, Slavic, Uralic).  Table 6: Test results of MT5 fine-tuned in English (en). We show mRP (%) in English (Src), and averaged across all 23 languages (All). We also report the trainable parameters (excl. the classification layer).
all layers (N = 12) harms performance. Surprisingly adapter modules, which are the best adaptation strategy for XLM-ROBERTA (Table 5, middle), lead to very poor performance (average mRP 44); there are similar results with LNFIT (average mRP 38.7). We speculate this happens because the encoder of MT5 needs to 're-program' itself during fine-tuning to perform as a stand-alone encoder; in adapters and LNFIT 're-programming' is only facilitated by very few parameters and the model is 'forced' (due to low adaptable capacity) to discard multilingual knowledge aggressively. XLM-ROBERTA follows the opposite pattern (fewer parameters lead to better cross-lingual transfer), because it is pre-trained as a stand-alone encoder. We leave a more thorough investigation of the tradeoff between the number of trainable parameters vs. end-task (Src/All) performance for future work.
Different source languages: In the cross-lingual experiments so far, we fine-tuned the model in English (source) and evaluated it in all 23 languages. In Fig. 3, we repeat these experiments using a dif- ferent source language in each repetition (rows), evaluating again in all languages (columns). 13 We use XLM-ROBERTA without adaptation (end-to-end, left) or with adapter modules (right). Despite the dominance of English in multi-lingual NLP literature, we observe that using alternative source languages (e.g., Romanian or French) lead to better target results. Similar results have been presented in Turc et al. (2021) for other NLP tasks. As in the previous one-to-many experiments with XLM-ROBERTA (Table 5, middle), adapters vastly improve cross-lingual transfer across all cases (e.g., English-en to Danish-da improves from 57 to 62 mRP), with occasionally slightly lower monolingual performance (e.g., German-de drops from 68 to 67). Cross-lingual transfer performs overall better when the source and target languages are in the same family (frames of Fig. 3), especially for Romance languages (Fig. 4, diagonal). 13 Also, when using adapters, cross-lingual performance often drops less abruptly when moving outside of the family of the source language. For example, when fine-tuning in Danish-da, if the test set changes from Swedish-sv to Spanish-es, performance drops from 58 to 50 without adapters (Fig. 3, left), but the change is smoother, from 62 to 59, with adapters ( Fig. 3, right). This is better illustrated in the right part of Fig. 4 (smoother changes across cells per row). These results confirm that adapter modules help retain more multilingual knowledge.
Transfer from one family to another: In the previous experiment ("Different source languages"), we used a different source language in each repetition and evaluated in all languages. To better understand how linguistic proximity between families affects performance, in Figure 5 we present additional experiments in a many-to-many setting, where each model is trained across all languages in the same family (source) and evaluated across all languages. We use again XLM-ROBERTA without adaptation (end-to-end, left) or with adapter modules (right). We observe (Fig. 5, left) that crosslingual transfer performs overall better when the source and target families are the same. Also, when using adapters (Fig. 5, right), cross-lingual performance drops less abruptly when moving to another family, different from the one (source) whose languages were used for fine-tuning. As expected, the cross-lingual performance of these models (jointly fine-tuned in a language family) is substantially higher than the ones trained in a one-to-one setting ( Fig. 3-4), and closer to that of the models jointly fine-tuned in all 23 languages (many-to-many, results reported in the lower part of Table 5).  Removing digits and shared words: In an ablation study, during inference we remove digits and words that are shared across languages to see to what extent label predictions depend on them. Initially, we eliminate digits, which constitute approx. 10% of the average document length measured in white-space separated tokens. Digits often participate in legal references (e.g., "established by Regulation No 1468/81") or other coding schemes that may hint EUROVOC concepts (e.g., when specific laws are highly cited). Moreover, inspecting training documents, we observe that vocabulary words (e.g., of Latin origin) are shared to a substantial degree (23% on average) across languages; thus as a second step we remove approx. 25k words used more than 25 times in English documents to break direct cross-lingual alignment. Table 7 shows that removing digits leads to a small decrease in one-to-one performance (-0.2) and a larger, though still small, decrease in one-to-many performance (-1.1). Eliminating shared words (present in the English vocabulary) leads to a further decrease (-3.5) in cross-lingual performance, and English performance of course plunges as the remaining text is very short and severely corrupted.  Table 8: Test results of XLM-ROBERTA fine-tuned in English, for all adaptation strategies and different label granularities (EUROVOC levels, Table 2). We show mRP results (%) for English (Src) and averaged over all 23 languages (All). We also count the trainable parameters, excl. the classification layer, which remains the same.
Different label granularities: Table 8 shows XLM-ROBERTA results with labels from different EUROVOC levels (Table 2) for all adaptation strategies. As expected, performance deteriorates (approx. 5-10% per level) as the size of the label set increases. Nonetheless, we observe consistent improvements with adaptation strategies compared to full (end-to-end) fine-tuning for all label sets, with the exception of the fully (all 12 blocks) frozen model (last row). Adapters have the best overall performance, but the ranking and impact of the different adaptation strategies varies across levels. Specifically, as the size of the label set increases, the average (Table 8, second-to-last line) adaptation zero-shot (All) performance: (a) improves compared to no adaptation (end-to-end fine-tuning), approx. +0.3 → +1.6 → +4.2 → +3.6, as we move from level 1 to the full (original) label set, with a small drop from level 3 to the full label set; and (b) deteriorates more aggressively when comparing it to English (Src) performance, approx. -6.4 → -10.3 → -12.0 → -15.8. The latter (b) is due to the need to model increasingly finer concepts (labels), which complicates cross-lingual concept alignment and, hence, hurts transfer, leaving more scope for adaptation strategies to make a difference (a).

Multilingual Fine-tuning (many-to-many)
In the lower part of Table 5, we report results for XLM-ROBERTA, fine-tuned end-to-end or using adapters, when the model is jointly fine-tuned in all languages. In this case, for each epoch and batch we randomly select a language of the document, among the available ones; not all documents are available in all 23 languages (Table 1). Adapter modules again consistently improve performance. The many-to-many models largely outperform the one-to-many models (

Conclusions and Future Work
We introduced MULTI-EURLEX, a new multilingual legal topic classification dataset with 65k documents (EU laws) in 23 languages, where each document is annotated with multiple labels (concepts) from the EUROVOC taxonomy, with alternative label granularities. To the best of our knowledge, this is one of the most diverse, in terms of languages, classification datasets. We mainly used the dataset as a testbed for zero-shot cross-lingual transfer. Experimental results showed that fine-tuning a multilingually pretrained model (XLM-ROBERTA, MT5) in a single language leads to catastrophic forgetting of multilingual knowledge and, consequently, poor zero-shot transfer. We found that adaptation strategies, originally proposed to accelerate fine-tuning for end-tasks, help retain multilingual knowledge from pretraining, substantially improving zero-shot cross-lingual transfer. However, their impact depends on the size of the label set, i.e., the gains increase as the label set increases. Interestingly, even adaptation strategies (BITFIT, LNFIT) that fine-tune a very small fraction of parameters (<0.05%) are competitive. Experimental results also showed that multilingual models are competitive to monolingual models in the one-toone set-up; and that a single multilingual model jointly fine-tuned in all languages is also competitive. We also used MULTI-EURLEX to highlight the effect of temporal concept drift and the importance of chronological, instead of random, splits.
In future, we would like to examine alternative cross-lingual adaptation strategies (Pfeiffer et al., 2020(Pfeiffer et al., , 2021 and distributionally robust optimization techniques (Sagawa et al., 2020;Koh et al., 2021) to address the temporal concept drift.

Ethics Statement
The dataset contains publicly available EU laws that do not include personal or sensitive information, with the exception of trivial information presented by consent, e.g., the names of the active presidents of the European Parliament, European Council, or other official administration bodies. The collected data is licensed under the Creative Commons Attribution 4.0 International licence (https://eur-lex.europa.eu/content/  Table 9: Development results for different values of K in adapter modules. We show mRP results (%) on English development data (Src), and development mRP averaged over all 23 languages (All). We also report the number of trainable parameters (in millions).

A.2 Other Technical Details
Given the large length of the documents (450 tokens on average), presented in   Table 11: Comparing MT5 variants. The first two variants use only the encoder; the latter two use both the encoder and the decoder. We show mRP results (%) on English development data (Src), and development mRP averaged over all 23 languages (All). We also report the number of trainable parameters (in millions), and training time in epochs (e) and hours (h).
performs multiple timesteps, and at each one it is fed with the output generated so far (or the corresponding gold output up to the previous timestep during training). In effect, in its single timestep, the decoder of decode-cls iteratively (at each decoder block) performs cross-attention over the encoder's output, using an updated query ([cls] representation). We pass the final representation of the decoder's [cls] to the same classification layer we use in the encoder-only variant of MT5 (Section 4). Both decode-cls and generative use 12 encoder and 12 decoder blocks, 391M parameters.
first-pool, last-pool: Finally, we examine another encoder-only variant of MT5, last-pool, in addition to the encoder-only variant of Section 4, which we now call first-pool to highlight the difference between them. In last-pool, we use the encoder's top-level representation of the </s> special token of MT5, which is always at the end of the input, to represent the document. Since the position of </s> is not always the same, its representation is also affected by its positional embedding. By contrast, in first-pool the [cls] token is always first, hence its positional embedding does not vary. Table 11 reports results on development data. As expected, the generative version of MT5 performs terribly (mRP 2.5), as the model tries to learn an unnecessary label ordering; in fact the model cannot learn and stops training after five epochs due to early stopping. By contrast, the decode-cls variant, which feeds the decoder only with the [cls] token and uses its output embedding, has comparable performance with the encoder-only variants (firstpool, last-pool). It uses, however, approximately 40% more parameters, because of the additional cross-attention layers in the decoder blocks.
Both encoder-only variants of MT5 are comparable with XLM-ROBERTA (English mRP approx. 73; the All mRP scores are also comparable or bet- Table 12: Monolingual (native) BERT models used. We also report the training corpora used to pretrain each model. ter). These results show that the encoder of MT5 can be used alone (without the decoder) for text classification, similarly to TRANSFORMER-based encoder-only models (Devlin et al., 2019;Liu et al., 2019), despite its text-to-text generative pretraining, unlike the generative fine-tuning proposed by the creators of MT5 (Xue et al., 2021). Table 12 lists all native BERT models used in the experiments of Section 6.1 in the one-to-one set-up. All models are hosted by Hugging Face (https://huggingface.co/models). All models follow the BASE configuration with 12 layers of stacked TRANSFORMERs, each with D h = 768 hidden units and 12 attention heads. We use case sensitive models, when available. We cannot guarantee the quality of the different models, as they come from different sources (organizations or individuals), although we tried to select the best possible options, i.e., those trained on more data for a longer period, in case there were many alternatives. We found 16 monolingual models; we found no monolingual models for Bulgarian, Slovak, Croatian, Slovene, Lithuanian, Latvian, Maltese.

C Monolingual BERT Models
Most monolingual BERT models use a vocabulary of approx. 30k sub-words and have approx. 110M parameters in total (24M for embeddings and 86M for TRANSFORMER blocks), while XLM-ROBERTA has a much larger vocabulary of 250k sub-words to support 100 languages and 278M parameters (192M for embeddings and 86M for TRANSFORMER blocks). Similarly, MT5 uses a vocabulary of equal size, thus its encoder has 86M parameters, while its decoder has 120M parameters; as in the work of Vaswani et al. (2017), the decoder TRANSFORMER blocks of MT5 have more parameters than the encoder blocs, as they use ad-ditional cross-attention layers. Based on the aforementioned details, the encoder's capacity is almost identical across the examined models.
Languages: The EU has 24 official languages. When new members join the EU, the set of official languages usually expands, unless the new languages are already included. MULTI-EURLEX covers 23 languages from seven language families (Germanic, Romance, Slavic, Uralic, Baltic, Semitic, Hellenic). EU laws are published in all official languages, except Irish, for resource-related reasons. 15 This wide coverage makes MULTI-EURLEX a valuable testbed for cross-lingual transfer. All languages use the Latin script, except for Bulgarian (Cyrillic script) and Greek. Several other languages are also spoken in EU countries. The EU is home to over 60 additional indigenous regional or minority languages, e.g., Basque, Catalan, Frisian, Saami, and Yiddish, among others, spoken by approx. 40 million people, but these additional languages are not considered official (in terms of EU), and EU laws are not translated to them.
Annotation: All the documents of the dataset have been annotated by the Publications Office of EU (https://publications.europa.eu/en) with multiple concepts from EUROVOC (http:// eurovoc.europa.eu/). EUROVOC has eight levels of concepts. Each document is assigned one or more concepts (labels). If a document is assigned a concept, the ancestors and descendants of that concept are typically not assigned to the same document. The documents were originally annotated with concepts from levels 3 to 8. We augmented the annotation with three alternative sets of labels per document, replacing each assigned concept by its ancestor from level 1, 2, or 3, respectively. Thus, we provide four sets of gold labels per document, one for each of the first three levels of the hierarchy, plus the original sparse label assignment. 16 Data Split and Concept Drift: MULTI-EURLEX is chronologically split in training (55k), development (5k), test (5k) subsets, using the English documents. The test subset contains the same 5k documents in all 23 languages (Table 1). 17 For the official languages of the seven oldest member countries, the same 55k training documents are available; for the other languages, only a subset of the 55k training documents is available (Table 1). Compared to EURLEX57K (Chalkidis et al., 2019), MULTI-EURLEX is not only larger (8k more documents) and multilingual; it is also more challenging, as the chronological split leads to temporal real-world concept drift across the training, development, test subsets, i.e., differences in label distribution and phrasing, representing a realistic temporal generalization problem (Lazaridou et al., 2021). Søgaard et al. (2021) showed this setup is more realistic, as it does not over-estimate real performance, contrary to random splits (Gorman and Bedrick, 2019).
Supported Tasks: MULTI-EURLEX can be used for legal topic classification, a multi-label classification task where legal documents need to be assigned concepts reflecting their topics. MULTI-EURLEX supports labels from three different granularities (EUROVOC levels). More importantly, apart from monolingual (one-to-one) experiments, it can be used to study cross-lingual transfer scenarios, including one-to-many (systems trained in one language and used in other languages with no training 16 Levels 4 to 8 cannot be used independently, as many documents have gold concepts from the third level; thus many documents will be mislabeled, if we discard level 3. 17 The development subset also contains the same 5k documents in 23 languages, except Croatian. Croatia is the most recent EU member (2013); older laws are gradually translated. data), and many-to-one or many-to-many (systems jointly trained in multiple languages and used in one or more other languages).
Data Fields: The following data fields are provided for all documents of MULTI-EURLEX: • 'celex_id': (str) The official ID of the document.
The CELEX number is the unique identifier for all publications in both EUR-LEX and CELLAR, the EU Publications Office's common repository of metadata and content.
• 'text': (dict[str]) A dictionary with (key, value) pairs, where the key is the 2-letter ISO code of each language and the value is the content of each document in this language.
• 'eurovoc_concepts': (dict[List[str]]) A dictionary with (key, value) pairs, where the key is the label set (level 1-3) and the value is a list of the relevant EUROVOC concepts (labels).

D.2 Initial Data Collection and Normalization
The original data are available at the EUR-LEX portal (https://eur-lex.europa.eu) in unprocessed formats (HTML, XML, RDF). The documents were downloaded from the EURLEX portal in HTML. The relevant EUROVOC concepts were downloaded from the SPARQL endpoint of the Publications Office of EU (http://publications.europa.eu/ webapi/rdf/sparql). We stripped HTML markup to provide the documents in plain text format. We inferred the labels for EUROVOC levels 1-3, by backtracking the EUROVOC hierarchy branches, from the originally assigned labels to their ancestors in levels 1-3, respectively.

D.3 Personal and Sensitive Information
The dataset contains publicly available EU laws that do not include personal or sensitive information, with the exception of trivial information presented by consent, e.g., the names of the current presidents of the European Parliament and European Council, and other administration bodies.

D.4 Licensing Information
We provide MULTI-EURLEX with the same licensing as the original EU data (CC-BY-4.0): The Commission's document reuse policy is based on Decision 2011/833/EU. Unless otherwise specified, you can re-use the legal documents published in EUR-LEX for commercial or non-commercial purposes.
The copyright for the editorial content of this website, the summaries of EU legislation and the consolidated texts, which is owned by the EU, is licensed under the Creative Commons Attribution 4.0 International licence. This means that you can re-use the content provided you acknowledge the source and indicate any changes you have made.

E More Detailed Results
For completeness, in Table 14 we present detailed results across all 23 languages for XLM-ROBERTA fine-tuned end-to-end or using the alternative adaptation strategies in the one-to-many setting for English. We observe that (a) native BERT models have the best results in 12 out of 15 languages; (b) XLM-ROBERTA trained in a monolingual (one-to-one) setting has competitive results; and (c) fine-tuning with adapter modules leads to the best overall results in cross-lingual transfer and in the many-tomany setting. Tables 15-16 show the results when fine-tuning end-to-end or using adapters, considering each one of the 23 languages as a source language in a oneto-many setting. Table 18 shows XLM-ROBERTA results for all EUROVOC levels across all 23 languages.