Regulatory Compliance through Doc2Doc Information Retrieval: A case study in EU/UK legislation where text similarity has limitations

Major scandals in corporate history have urged the need for regulatory compliance, where organizations need to ensure that their controls (processes) comply with relevant laws, regulations, and policies. However, keeping track of the constantly changing legislation is difficult, thus organizations are increasingly adopting Regulatory Technology (RegTech) to facilitate the process. To this end, we introduce regulatory information retrieval (REG-IR), an application of document-to-document information retrieval (DOC2DOC IR), where the query is an entire document making the task more challenging than traditional IR where the queries are short. Furthermore, we compile and release two datasets based on the relationships between EU directives and UK legislation. We experiment on these datasets using a typical two-step pipeline approach comprising a pre-fetcher and a neural re-ranker. Experimenting with various pre-fetchers from BM25 to k nearest neighbors over representations from several BERT models, we show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR. We also show that neural re-rankers under-perform due to contradicting supervision, i.e., similar query-document pairs with opposite labels. Thus, they are biased towards the pre-fetcher’s score. Interestingly, applying a date filter further improves the performance, showcasing the importance of the time dimension.


Abstract
Major scandals in corporate history have urged the need for regulatory compliance, where organizations need to ensure that their controls (processes) comply with relevant laws, regulations, and policies. However, keeping track of the constantly changing legislation is difficult, thus organizations are increasingly adopting Regulatory Technology (RegTech) to facilitate the process. To this end, we introduce regulatory information retrieval (REG-IR), an application of document-to-document information retrieval (DOC2DOC IR), where the query is an entire document making the task more challenging than traditional IR where the queries are short. Furthermore, we compile and release two datasets based on the relationships between EU directives and UK legislation. We experiment on these datasets using a typical two-step pipeline approach comprising a prefetcher and a neural re-ranker. Experimenting with various pre-fetchers from BM 25 to k nearest neighbors over representations from several BERT models, we show that fine-tuning a BERT model on an in-domain classification task produces the best representations for IR. We also show that neural re-rankers underperform due to contradicting supervision, i.e., similar query-document pairs with opposite labels. Thus, they are biased towards the prefetcher's score. Interestingly, applying a date filter further improves the performance, showcasing the importance of the time dimension.

Introduction
Major scandals in corporate history, from Enron to Tyco International, Olympus, and Tesco, 1 have led to the emergence of stricter regulatory mandates and highlighted the need for regulatory compliance where organizations need to ensure that they comply with relevant laws, regulations, and policies (Lin, 2016). However, keeping track of the constantly changing legislation ( Figure 1) is hard, thus organizations are increasingly adopting Regulatory Technology (RegTech) to facilitate the process.
Typically, a compliance regimen includes three distinct but related types of measures, corrective, detective, and preventive (Sadiq and Governatori, 2015). Corrective measures are usually undertaken when new regulations are introduced to update existing controls. Detective measures, ensure "afterthe-fact" compliance, i.e., following a procedure, a manual or automated check is carried out, to ensure that every step of the procedure complied with the corresponding regulations. Finally, preventive measures ensure compliance "by design", i.e., during the creation of new controls. All types of measures include an underlying information retrieval (IR) task, where laws need to be retrieved given a control or vice versa. We identify two use cases: 1. Given a new law retrieve all the controls of the organization affected by this law. The organization can then apply corrective measures to ensure compliance for these controls.
2. Given a control retrieve all relevant laws the control should comply with. This is useful for ensuring compliance after a procedure has been carried out (detective measures) or when creating new controls (preventive measures).
Regulatory information retrieval (REG-IR), similarly to other applications of document-todocument (DOC2DOC) IR, is much more challenging than traditional IR where the query typically contains a few informative words and the documents are relatively small (Table 1). In DOC2DOC IR the query is a long document (e.g., a regulation) containing thousands of words, most of which are uninformative. Consequently, matching the query with other long documents where the informative words are also sparse, becomes extremely difficult. Although legislation is available, organizations' controls are strictly private and very hard to obtain. Fortunately, the European Union (EU) has a legislation scheme analogous to regulatory compliance for organizations. According to the Treaty on the Functioning of the European Union (TFEU), 2 all published EU directives must take effect at the national level. Thus, all EU member states must adopt a law to transpose a newly issued directive within the period set by the directive (typically 2 years). Notably, the United Kingdom (UK) having a high compliance level with the EU (Figure 2), 3 is a good test-bed for REG-IR. Thus we compile and release two datasets for REG-IR, EU2UK and UK2EU, containing EU directives and UK regulations, which 2 Articles 291 (1) and 288 paragraph 3. 3 Data for Figures 1 and 2 obtained from ec.europa.eu /internal market/scoreboard/performance b y governance tool/eu pilot. can serve both as queries and documents under the ground truth assumption that a UK law is relevant to the EU directives it transposes and vice versa.

Dataset
Domainqd IR datasets in the literature TREC ROBUST (Voorhees, 2005) News 3 / 14 254 BIOASQ (Tsatsaronis et al., 2015) Biomedical 9 197 IR datasets with verbose queries GOV2 (Clarke et al., 2004) Web 11 / 57 682 WT10G (Chiang et al., 2005) Web  Since REG-IR is a new task, our starting point is the two-step pipeline approach followed by most modern neural information retrieval systems (Guo et al., 2016;Hui et al., 2017;McDonald et al., 2018). First, a conventional IR system (pre-fetcher) retrieves the k most prominent documents. Then a neural model attempts to rank relevant documents higher than irrelevant ones. In most approaches, the pre-fetcher is based on Okapi BM 25 (Robertson et al., 1995), a bag-of-words scoring function that does not consider possible synonyms or contextual information. To overcome the first limitation, we follow Brokos et al. (2016) who employed k nearest neighbors over tf-idf weighted centroids of word embeddings, without however improving the results, probably because the centroids are noisy considering many uninformative words. Furthermore, we employ BERT (Devlin et al., 2019) to extract contextualized representations for queries and documents but again the results are worse than BM 25 . We also experiment with S-BERT (Reimers

Transpositions:
We have retrieved all transposition relations (approx. 3.7K) between EU directives and UK laws from the CELLAR database. CELLAR only provides the mapping between the CELLAR ids of EU directives and the title of each UK law. Therefore we aligned the CELLAR ids with the official UK ids based on the law title. 7 One or more UK laws may transpose one or more EU directives.

Datasets compilation
Let E, U be the sets of EU directives and UK laws, respectively. We define REG-IR as the task where the query q is a document, e.g, an EU directive, and the objective is to retrieve a set of relevant documents, R q , from the pool of all available documents, e.g., all UK laws. We create two datasets: Table 3 shows the statistics for the two datasets, which are split in three parts, train, development, and test, retaining a chronological order for the queries. EU2UK has a much larger pool of available documents than UK2EU (52.5K vs. 3.9K) which may impose an extra difficulty during retrieval. More importantly, the average number of relevant documents per query is small (at most 2) for both datasets, as our ground truth assumption is strict, i.e., relevant documents are those linked to the query with a transposition relation. Also, EU legislation is frequently amended (Figure 1) which also  imposes difficulty in the retrieval task. Let d 1 ∈ E be a directive transposed by u 1 ∈ U and d 2 ∈ E be a directive amending d 1 . The UK must adopt a law, u 2 , to transpose d 2 . Both d 2 and u 2 cover similar concepts to those of d 1 (d 2 is an amendment and u 2 must comply with d 2 ), but, strictly speaking u 2 is relevant only to d 2 . Table 2 shows an example from EU2UK, where the top-5 documents seem very similar to the query but are not considered relevant. Note that the documents ranked 1st, 3rd and 5th, are amendments of the relevant documents.

IR pipelines
Modern neural IR systems usually follow a twostep pipeline approach. First, a conventional IR system (pre-fetcher) retrieves the top-k most prominent documents aiming to maximize its recall. Then a neural model attempts to re-rank the documents by scoring relevant higher than irrelevant ones. While this configuration is widely adopted in literature, the re-ranking step could be omitted provided an effective pre-fetching mechanism, i.e., the pre-fetcher will act as an end-to-end IR system.

Document pre-fetching
Okapi BM 25 (Robertson et al., 1995) is a bag-ofwords scoring function estimating the relevance of a document d to a query q, based on the query terms appearing in d, regardless their proximity within d: where q i is the i-th query term, with idf(q i ) inverse document frequency and tf(q i , d) term frequency. L is the length of d in words,L is the average length of the documents in the collection, k 1 is a parameter that favors high tf scores and b is a parameter penalizing long documents. 8 W2V-CENT: Following Brokos et al. (2016), we represent query/document terms with pre-trained embeddings. For each query/document we calculate the tf-idf weighted centroid of its embeddings: where t is a text (query or document) and x i is the i-th text term with embedding x i . The documents are ranked, with respect to the query, by a k nearest neighbours (kNN) algorithm with cosine distance: BERT, similarly to W2V-CENT, relies in pre-trained representations which now are extracted from BERT, thus being context-aware. A text can be represented by its [cls] token or by the centroid of its token embeddings. In the latter case the embeddings can be extracted from any of the 12 layers of BERT. 9 Note that the texts in our datasets do not entirely fit in BERT. We thus split them into c chunks (2 to 3 per text) and pass each chunk through BERT to obtain a list of token embeddings per layer (i.e, the concatenation of c token embeddings lists) or c [cls] tokens. The final representation is either the centroid of the token embeddings or the centroid of the [cls] tokens.
S-BERT (Reimers and Gurevych, 2019) is a BERT model fine-tuned for NLI. According to the authors, training S-BERT for NLI results in better representations than BERT for tasks involving text comparison, like IR. We use the same setting as in BERT.
LEGAL-BERT: Our datasets come from the legal domain which has distinct characteristics compared to generic corpora, such as specialized vocabulary, particularly formal syntax, semantics based on extensive domain-specific knowledge, etc., to the extent that legal language is often classified as a 'sublanguage' (Tiersma, 1999;Williams, 2007;Haigh, 2018). BERT and S-BERT were trained on generic corpora and may fail to capture the nuances of legal language. Thus we used a BERT model further pretrained on EU legislation (Chalkidis et al., 2020), dubbed here LEGAL-BERT, in a similar fashion.
C-BERT: EU laws are annotated with EUROVOC concepts covering the core subjects of EU legislation (e.g., environment, trade, etc.). Our intuition is that a UK law transposing an EU directive will most probably cover the same subjects. Thus we expect that a BERT model, fine-tuned to predict EUROVOC concepts, will learn rich representations describing these concepts which may be useful for prefetching. We fine-tune BERT following Chalkidis et al. (2019) 10 and use the resulting model to extract query and document representations similarly to the previous BERT-based methods.
ENSEMBLE is simply a combination of our best two pre-fetchers, C-BERT and BM 25 : where CB is the score of C-BERT and α is tuned on development data and the scores of the pre-fetchers are normalized in [0, 1].

Document re-ranking
Modern neural re-rankers operate on pairs of the form (q, d) to produce a relevance score, rel(q, d), for a document d with respect to a query q. Note, however, that the main objective is to rank relevant documents higher than irrelevant. Thus, during training the loss is calculated as: where d + is a relevant document and d − is an irrelevant document. We have experimented with several neural re-ranking methods each having a function that produces a relevance score s r for each of the top-k documents returned by the best prefetcher. The final relevance score of a document is calculated as: rel(q, d) = w r · s r + w p · s p , where s p is the normalized score of the pre-fetcher and w s , w p are learned during training. Given the concerns on the strictness of the ground truth assumption raised in Section 2.2, we hypothesize that re-rankers will eventually overutilize the pre-fetcher score, s p , when calculating document relevance, rel(q, d). As shown in Table 2, in many cases both relevant and irrelevant documents may have high similarity with the query. This in turn may confuse and therefore degenerate the re-ranker's term matching mechanism, i.e., MLPs or CNNs over term similarity matrices. DRMM (Guo et al., 2016) uses pre-trained word embeddings to represent query and document terms. A histogram captures the cosine similarities of a query term, q i , with all the terms of a particular document. Then an MLP consumes the histograms to produce a document-aware score for each q i , which is weighted by a gating mechanism assessing the importance of q i . The sum of the weighted scores is the relevance score of the document. A caveat of DRMM is that it completely ignores the context of the terms which could be of particular importance in our datasets where texts are long.
PACRR (Hui et al., 2017) represents query and document terms with pre-trained embeddings and calculates a matrix S containing the cosine similarities of all query-document term pairs. A row-wise k-max pooling operation on S keeps the highest similarities per query term (matrix S k ). Then, wide convolutions of different kernel (filter) sizes (n×n) with multiple filters per size are applied on S. Each filter of size n × n attempts to capture n-gram similarities between queries and documents. A max-pooling operation keeps the strongest signals across filters and a row-wise k-max pooling keeps the strongest signals per query n-gram, resulting in the matrix S n,k . Subsequently, a row-wise concatenation of S k with all S n,k matrices (for different values of n) is performed and a column containing the softmax-normalized idf scores of the query terms is concatenated to the resulting matrix (S sim ). In effect, each row of the matrix contains different n-gram based similarity views of the corresponding query term, q i , along with an idf-based importance score. The relevance score is produced as the last hidden state of an LSTM with one hidden unit, which consumes the rows of S sim . PACRR tries to take into account the context of the query and document terms using n-grams but this context sensitivity is weak and we do not expect much benefits in our datasets which contain long texts.
BERT-based re-rankers: Recent work tries to exploit BERT to improve re-ranking. Following MacAvaney et al. (2019), we use DRMM and PACRR on top of contextualized BERT embeddings derived from BERT. Based on the results of Figure 4, we use C-BERT as the most promising BERT model. We call these two models C-BERT-DRMM and C-BERT-PACRR. We also experiment with two settings depending on whether C-BERT weights are updated (tuned) or not (frozen) during training. As several methods rely on word embeddings, we trained a new WORD2VEC model (Mikolov et al., 2013) in both corpora (EU and UK legislation) to better accommodate legal language. Preliminary experiments showed that domain-specific embeddings perform better than generic 200-dimensional GloVe embeddings (Pennington et al., 2014) in development data (EU2UK: 66.5 vs. 59.3 at R@100 and UK2EU: 72.6 vs. 69.8 at R@100). 11 All BERT (pre-fetching) encoders and BERTbased re-rankers use the -BASE version, i.e., 12 layers, 768 hidden units and 12 attention heads, similar to the one of Devlin et al. (2019). 12

Pre-processing -document denoising
One of the major challenges in DOC2DOC IR, as opposed to traditional IR, is the length of the queries and the documents which may induce noise (many uninformative words) during retrieval. Thus we applied several filters (stop-word, punctuation and digits elimination) on both queries and documents and reduced their length by approx. 55% (778 words for UK laws and 1,222 words for EU directives on average). Further on, we filtered both queries and documents by eliminating words with idf score less than the average idf score of the stopwords. Our intuition is that words (e.g., regulation, EU, law, etc.) with such a small idf score are uninformative. Still, the texts are much longer (387 words for UK laws and 631 words for EU directives on average) than the queries used in traditional IR (Table 1). As an alternative to drastically decrease the query size, we experimented with using only the title of a legislative act as a query but the results were worse, i.e., approx. 5-20% lower R@100 on average across datasets, indicating that the full-text is more informative, although the information is sparse. Hence, we only consider the full-text, including the title, for the rest of the experiments.

Evaluation measures
Pre-fetching aims to bring all the relevant documents in the top-k, thus we report R@k. We observe that for k > 100 the best pre-fetchers have not significant gains in performance in development data, thus we select k = 100, as a reasonable threshold. 13 For re-ranking we report R@20, nDCG@20 and R-Precision (RP) following the literature (Manning et al., 2009). We report the average and standard deviation across three runs considering the best set of hyper-parameters on development data for neural re-rankers.

Tuning BM 25 : The case of DOC2DOC IR
The effectiveness of BM 25 is highly dependant on properly selecting the values of k 1 and b. In traditional (ad-hoc) IR, k 1 is typically evaluated in the range[0, 3] (usually k 1 ∈ [0.5, 2.0]); b needs to be in [0, 1] (usually b ∈ [0.3, 0.9]) (Taylor et al., 2006;Trotman et al., 2014;Lipani et al., 2015). As a general rule of thumb BM 25 with k 1 =1.2 and b=0.75 seems to give good results in most cases (Trotman et al., 2014). We observe that in the case of DOC2DOC IR where the queries are much longer, the optimal values are outside the proposed ranges ( Figure 3). In both datasets the optimal values for k 1 and b are relatively high, favoring terms with high tf, while penalizing long documents. In effect BM 25 uses k 1 and b as a denoising regularizer to over-utilize highly frequent query terms normalized by document length.

Extracting representations from BERT
Recently there has been a lot of research on understanding the effectiveness of BERT's different layers (Liu et Figure 4 shows heatbars comparing representations extracted from different layers of the various BERT-based prefetchers we experimented with. 14 LEGAL-BERT and C-BERT which have been adapted in the legal domain perform much better than BERT and S-BERT which were trained on generic corpora. An interesting observation is that the [cls] token is a powerful representation only in C-BERT where it was trained to predict EUROVOC concepts. Also, in UK2EU the embedding layer produces the best representations in all BERT variants except C-BERT, where the embedding layer achieves comparable results to the top-2 representations ([cls], Layer-12). This is an indication that the context in this dataset is not as important as in EU2UK.

Implementation details
All neural models were implemented using the Tensorflow 2 framework. Hyper-parameters were tuned on development data, using early stopping and the Adam optimizer (Kingma and Ba, 2015). 14 Recall that a text can be represented by its [cls] token or by the centroid of its token embeddings which can be extracted from any of the 12 layers of BERT.

Experimental results
Pre-fetching: Table 4 shows R@100 on the test datasets for the various pre-fetchers considered.
On EU2UK, C-BERT is the best method by a large margin, followed by S-BERT and LEGAL-BERT, verifying our assumption that the concept classification task is a good proxy for obtaining rich representations with respect to IR. Both S-BERT and LEGAL-BERT are better than BERT for different reasons. LEGAL-BERT was adapted to the legal domain and is, therefore, able to capture the nuances of the legal language. S-BERT was trained to produce representations suitable for comparing texts with cosine similarity, a task highly related to IR. Nonetheless, having been trained on generic corpora with small texts, it performs much worse than C-BERT. Interestingly, BM 25 is comparable to both S-BERT and LEGAL-BERT despite its simplicity. As expected, combining C-BERT with BM 25 further improves the results. In UK2EU R@100 is much higher compared to EU2UK probably because of the shortest queries. Also, as discussed in Section 4.5, the contextual information is not so critical in this dataset, thus we expect the context unaware BM 25 and W2V-CENT to perform well. Indeed, BM 25 achieves the best results followed closely by C-BERT and LEGAL-BERT, while W2V-    Table 5: Re-ranking results across test datasets. The upper zone shows the results of neural re-rankers on top of the best pre-fetchers with respect to (w s , w p ). It also reports re-ranking results of the best pre-fetchers. The lower zone reports the re-ranking results after applying temporal filtering.
CENT outperforms S-BERT and BERT. Again the ENSEMBLE improves the results. Re-ranking: Table 5 shows the ranking results on test data for EU2UK and UK2EU. We also report results for BM 25 , C-BERT, ENSEMBLE and an ORA-CLE, which re-ranks the top-k documents returned by the pre-fetcher placing all relevant documents at the top. On EU2UK ENSEMBLE performs better than the other two pre-fetchers. Interestingly, neural re-rankers fall short on improving performance and are comparable (or even identical) with EN-SEMBLE in most cases, possibly because very similar documents may be relevant or not (Section 2.2, Table 2), leading to contradicting supervision. 15 As we hypothesized (Section 3.2), re-rankers overutilize the pre-fetcher score when calculating document relevance, as a defense mechanism (bias) against contradicting supervision, which eventually leads to the degeneration of the re-ranker's term matching mechanism. Inspecting the corresponding weights of the models, we observe that indeed w p >> w s across all methods. This effect seems more intense in BERT-based re-rankers (C-BERT + DRMM or PACRR), especially those that fine-tune C-BERT, possibly because these models perform term matching considering sub-word units, instead of full words. In other words, relying on the neural relevance score (s r ) is catastrophic. Similar observations can be made for UK2EU. In both datasets all methods have a large performance gap compared to the ORACLE, indicating that there is 15 By contradicting supervision we mean similar training query-document pairs with opposite labels. still large room for improvement, possibly utilizing information beyond text. Filtering by year: We have already highlighted the difficulties imposed to our datasets by the frequently amended EU directives (Section 2.2, Table 2). Also, recall that each EU directive defines a deadline (typically 2 years) for the transposition to take place. On the other hand, as we observe in Figure 5, EU directives may already be transposed by earlier legislative acts of member states (the member states act in a proactive manner), or they may delay the transposition for political reasons. In effect, the relevance of a document to a query depends both on the textual content and the time the laws were published. Thus, we filter out documents that are outside a predefined distance (in years) from the query in two ways, pre-filtering and post-filtering. Pre-filtering is applied to the pre-fetcher, i.e., prior to re-ranking, while postfiltering is applied after the re-ranking. Note that our main goal is to improve re-ranking. We thus apply the filtering scheme to the ENSEMBLE, DRMM and PACRR. The lower zone of Table 5 shows the results of the whole process. In EU2UK, the hardest out of the two datasets, the time filtering has a positive impact, improving the results by a large margin. On the other hand, filtering seems to have a minor effect in UK2EU.

EU2UK = UK2EU
Across experiments, we observe that best practices vary between the EU2UK and UK2EU datasets. EU2UK benefits from C-BERT representations, while in UK2EU context-unaware and domainagnostic BM 25 has comparable or better performance than C-BERT. Similarly, we observe that time filtering further improves the performance in EU2UK, while we have a contradicting effect in UK2EU. Given the overall results, we conclude the two datasets have quite different characteristics. Thus, it is important to consider both EU2UK and UK2EU independently, although one may initially consider them to be symmetric.
6 Related work IR in the legal domain is widely connected with the Competition on Legal Information Extraction/Entailment (COLIEE). From 2015 to 2017 (Kim et al., 2015(Kim et al., , 2016Kano et al., 2017), the task was to retrieve Japanese Civil Code articles given a question, while in COLIEE 2018 and 2019 (Kano et al., 2018;Rabelo et al., 2019), the task was to retrieve supporting cases given a short description of an unseen case. However, the texts of these competitions are small compared to our datasets. Also, most submitted systems do not consider recent advances in IR, i.e, neural ranking models (Guo et al., 2016;Hui et al., 2017;McDonald et al., 2018;MacAvaney et al., 2019), which have recently managed to improve rankings of conventional IR, or end-to-end neural models which have recently been proposed (Fan et al., 2018;Khattab and Zaharia, 2020). Again, these end-to-end methods were applied on small texts. On the other hand, there has been some work trying to cope with larger queries, i.e., verbose or expanded queries, (Paik and Oard, 2014;Gupta and Bendersky, 2015;Cum-mins, 2016). Nonetheless, the considered queries are at most 60 tokens long, contrary to our datasets where, depending on the setting, the average query length is 1.8K or 2.6K tokens (Table 1). Neural methods greatly rely on text representations, thus Reimers and Gurevych (2019) proposed S-BERT which is trained to compare texts for an NLI task and could thus be used to extract representations suitable for IR. Towards the same direction, Chang et al. (2020) experimented with several auxiliary tasks to extract better representations. However, the latter two methods have been evaluated on datasets with much smaller texts than the ones we consider.

Conclusions and future work
We proposed DOC2DOC IR, a new family of IR tasks, where the query is an entire document, thus being more challenging than traditional IR. This family of tasks is particularly useful in regulatory compliance, where organizations need to ensure that their controls comply with the existing legislation. In the absence of publicly available DOC2DOC datasets, we compile and release two datasets, containing EU directives and UK laws transposing these directives. Experimenting with conventional (BM 25 ) and neural pre-fetchers we showed that a BERT model fine-tuned on an in-domain classification task, i.e., predict EUROVOC concepts, is by far the best pre-fetcher in our datasets. We also showed that neural re-rankers fail to improve the performance, as their term matching mechanisms degenerates, and over-utilize the pre-fetcher score. In the future, we would like to investigate alternatives in exploiting additional information that may be critical in the newly introduced tasks (EU2UK, UK2EU). In this direction naively utilizing chronological information leads to vast performance improvement in EU2UK dataset. One possible direction is to model the cross-document relations (e.g., amendments) using Graph Convolutional Networks (Kipf and Welling, 2016), while better modeling the dimension of time (i.e., chronological difference between a query and a document) is also crucial. Further on, to better deal with long documents, we plan to investigate text summarization by employing a state-of-the-art neural summarizer, e.g., BART of Lewis et al. (2020), or sentence selection techniques, e.g., rationale extraction (Lei et al., 2016;, to find the most important sections or sentences and create shorter and more informative versions of queries/documents.

A Dataset Compilation: Technical Details
In this section, we present the technical details associated with the compilation of both datasets described in the main paper. More specifically we present the procedure of creating both corpora as well as modelling the transposition relations between EU and UK entries.

A.1 EU corpus
The compilation of the EU corpus is more straightforward than its UK counterpart but involves some in-domain knowledge to filter unwanted legislation.
• We initially download the core metadata associated with each document in the EU corpus by utilizing the SPARQL endpoint of the

A.2 UK corpus
Compiling the UK corpus is not as trivial, since the legislation.gov.uk API is not as evolved and we therefore have to manually crawl large parts of the database to build our corpus.
• The collected UK laws from the legislatio n.gov.uk portal form the initial corpus which includes approximately 100k documents.
• Similarly to our processing of the EU corpus, we only retain documents in specific legislation types (UK Public General Acts, UK Local Acts, UK Statutory Instruments and UK Ministerial Acts). We then eliminate laws that aim to align English legislation with the rest of the United Kingdom's, more specifically Scotland, Northern Ireland and Wales. The final UK corpus includes 52K UK entries.

A.3 EU2UK Transpositions
Transpositions are relations between entries in the EU and UK corpora which we use to define relevance for our retrieval tasks. Processing these relations is the most challenging aspect of compiling our datasets and involves several steps.
• We use the aforementioned SPARQL endpoint, to retrieve the transpositions between EU directives and the corresponding UK regulations that implement them. We initially collect approximately 10k EU2UK pairs. In these pairs the transposed EU law is referred to by its unique portal ID but the transposing UK law is referred to by its title. This is the primary challenge in modelling the transposition relations, since mapping legislation titles to unique entries in our UK corpus is not trivial. We hypothesize that these relations are manually inserted in the database and therefore human errors make performing exact matches often impossible. Apart from the matching difficulties, some of the pairs in the pool are inserted mistakenly and hence need to be filtered.
• We first filter the noisy pairs. Pairs are considered noisy either because they are duplicates or because the do not meet some manually set criteria. In turn, duplication can occur either because identical pairs are inserted more than once or because pairs in which the UK title is mildly paraphrased are erroneously considered different. Our pool is reduced to 8k pairs after resolving the former and to 7k pairs after also resolving the latter. We further reduce the pool size by filtering pairs in which the UK title refers to non-English legislation (Scotland, Northern Ireland, Wales or Gibraltar) Non-English legislation usually has an almost identical counterpart within the pure english corpus. 17 . or in which the title does not contain certain keywords (e.g., Act, Regulation, Order, Rule). Documents that do not contain 17 See https://www.legislation.gov.uk/uks i/2017/407/contents and https://www.legisl ation.gov.uk/nisr/2017/81/contents any of these keywords are not officially published in the legislation.gov.uk portal. Most of these are official releases from national governmental bodies, e.g. Ministries.
For instance the First Annual Report of the Inter-Departmental Ministerial Group on Human Trafficking is not part of the UK's national legislation..
• To resolve the matching challenge, we employ a complex matching scheme where for each pair we gradually normalize the UK title until we find either a singular match or multiple ones. In the latter case, we resolve the matches with heuristics. Our normalizations include lower-casing, leading and trailing phrase removal, punctuation elimination, date removal and manually inserted substitutions.
• After reducing our pair pool and then implementing our matching scheme we can with high confidence present 4k transposition pairs which we use in our datasets.

B BERT models
All BERT variants (BERT, S-BERT, LEGAL-BERT) are publicly available from Hugging Face: • S-BERT: This is the original BERT fine-tuned in STS-B NLI dataset. Available at https://