Automatic Anonymization of Swiss Federal Supreme Court Rulings

Releasing court decisions to the public relies on proper anonymization to protect all involved parties, where necessary. The Swiss Federal Supreme Court relies on an existing system that combines different traditional computational methods with human experts. In this work, we enhance the existing anonymization software using a large dataset annotated with entities to be anonymized. We compared BERT-based models with models pre-trained on in-domain data. Our results show that using in-domain data to pre-train the models further improves the F1-score by more than 5% compared to existing models. Our work demonstrates that combining existing anonymization methods, such as regular expressions, with machine learning can further reduce manual labor and enhance automatic suggestions.


Introduction
The Swiss Federal Supreme Court (SFSC) is the highest judicial authority in Switzerland.It is the final arbiter in legal disputes and ensures the uniform application of federal law throughout the country.It consists of several divisions specialized in different areas of law, including civil, criminal, administrative, and social security matters (Glaser et al., 2021).In a year, the SFSC roughly handles 7K cases and publishes its rulings.In this process, personal information must be anonymized from the rulings in order to protect involved parties.In the traditional setting, court rulings are anonymized by skilled experts.This task is highly complex, as the removal/anonymization of a word is dependent on the context it is written in.For example, Zuerich needs to be removed if it is part of the name of the legal entity "Zurich Insurance Group", but not if it is a reference to the city.At the SFSC, experts are * Equal contribution.already supported in their work through an application called Anom2 (see Figure 1).Anom2 provides access to various methods and algorithms for finding and replacing text entities (e.g., with regular expressions).The aim of this work is to enhance the capabilities of Anom2 with Machine Learning capabilities that provide the user with more suggestions that need to be anonymized.Our results show that this approach allows users to find more elements that require anonymization.

Related Work
For identifying elements that might require anonymization, a process called Named Entity Recognition (NER) is employed.Traditionally, NER recognizes and categorizes text parts according to a set of semantic categories like Location (LOC), Organization (ORG), or Person (PER) (Benikova et al., 2014).As these classes are not enough for the anonymization of court cases (Leitner et al., 2020) suggested expanding this list to seven coarse and 19 fine-grained classes, including entities such as Judge (RR), or Lawyer (AN).Using this dataset, Darji et al. (2023) finetuned GermanBERT (Chan et al., 2020), clearly outperforming a BiLSTM-CRF+ model.Similar approaches have been applied and tested in other languages, such as Romanian (Pais et al., 2021), Greek (Angelidis et al., 2018), Portuguese (Luz de Araujo et al., 2018), and multilingually (de Gibert et al., 2022;Niklaus et al., 2023a).
Domain-specific pretraining has flourished in the legal domain recently.Chalkidis et al. (2020) pretrained LegalBERT on EU and UK legislation, ECHR and US cases, and US contracts.Zheng et al. (2021) pretrained CaseHoldBERT on US case law, while Henderson et al. (2022) trained PoL-BERT on the 256 GB Pile of Law corpus.Niklaus and Giofré (2022) pretrained Longformer (Beltagy et al., 2020) models using the Replaced Token Detection (RTD) task on the Pile of Law.Hua et al. (2022) used RTD to pretrain Reformer (Kitaev et al., 2020) models on 6 GB of US case law.Finally, Niklaus et al. (2023b) released a large multilingual legal corpus and trained various legal models.We continue pretraining the German, French, and Italian models for 800K and 300K steps more for base and large models, respectively.Rasiah et al. (2023) pretrain models on Swiss legal data, termed Legal-Swiss-RoBERTa. Document anonymization has a long tradition in the medical domain, where personal data need to be removed from documents.Initially, this task was handled using methods such as semantic lexicons (Ruch et al., 2000) or regular expressions to replace text occurrences.Recently, this has been expanded to include BERT-style models as well (Mao and Liu, 2019).In the legal domain, Glaser et al. (2021) worked on 1400 anonymized German rulings.Using already anonymized rulings, they trained different Recurrent Neural Networks (RNN) using BERT embeddings.Using this approach, they achieved a maximum of 68.9% precision and 79.1% recall rates.Garat and Wonsever (2022) performed similar work on 80K documents from Uruguayan courts.Our work specifically tackles court decisions by the SFSC.We compare the generic cased mBERT model (Devlin et al., 2019) with models pre-trained on in-domain data (such as Legal-Swiss-RoBERTa-base (Rasiah et al., 2023)).We also investigate monolingual model performance in the three languages of the SFSC rulings: German, French, and Italian.
Much prior work used SFSC cases as data for their research because of wide availability in three languages, giving good coverage of the most impor-tant Swiss case law.Niklaus et al. (2021Niklaus et al. ( , 2022) )

Dataset
We used 119156 rulings (77262 German, 40099 French, 6795 Italian) Supreme court decisions and split them into sentences using Spacy (Honnibal et al., 2020).We prepared the decisions for NER based on the manual labels from the paralegals who performed manual anonymizations.The histograms in Figure 2 illustrate the distribution of four key measures, namely, Number of Tokens, Anonymized Tokens, Entities, and Anonymized Entities, in three languages: German (de), French (fr), and Italian (it).Different color schemes for each language enhance the visual interpretability of the plots.Measures concerning tokens and entities exhibit a long-tailed distribution, signifying a concentration of instances at the lower end of the value spectrum.Specifically, the distribution of Number of Tokens and Number of Entities is examined within a 10 to 100,000 range, capturing their broad spread.In contrast, anonymized tokens and entities are evaluated within a 1 to 10,000 range, reflecting their constrained distribution.

Legal Pretraining
To improve the SFSC anonymization system, we pretrained legal-specific models on diverse legal text in German, French, and Italian.
(a) We warm-start (initialize) our models from the original XLM-R checkpoints (base or large) of Conneau and Lample (2019).Model recycling is a standard process followed by many (Wei et al., 2021;Ouyang et al., 2022) to benefit from starting from an available "well-trained" PLM, rather from scratch (random).XLM-R was trained on 2.

Anonymization System
The SFSC employs an anonymization system, Anom2, to assist paralegals in anonymizing rul- ings for public access.The main UI is shown in Figure 1.Upon loading a ruling, the application auto-identifies terms requiring anonymization and lists them on the left, along with replacement text.The search function allows direct term marking for anonymization.Anom2 uses different algorithms for the search for text that needs to be anonymized: Conventional is based on a statistical analysis of the loaded ruling.Using polyglot2 an initial set of named entities is detected.Using the specific knowledge of the format, the rubrum is dynamically detected, allowing for the labelling of important names and addresses.
BERT performs the recognition of entities to be anonymized using a BERT (Devlin et al., 2019) model fine-tuned for NER.Entity recognition is performed on the sentence level, as the rulings are often too long for the model.This approach could lead to inconsistencies in recognition, as a term identified in one sentence might not be identified in another.This is solved in post-processing, where any identified term is automatically anonymized in the whole document.
Legal-Swiss-RoBERTa-base works analogously to the BERT method, but uses a fine-tuned Legal-Swiss-RoBERTa-base (Rasiah et al., 2023) model.

Experimental Setup
We used the following hyperparameters for all evaluated models: batch size of 64, learning rate of 5e-5, and weight decay of 0.01.We employed the seqeval metric for evaluation.We set the maximum sequence length to 192 tokens, which we deter-mined to be the optimal trade-off between average sentence size and training time for computational efficiency.We used early stopping based on the F1-score of the validation set, which constitutes 10% of the entire dataset, following an 80-10-10 split for the training, validation, and test sets, respectively.Training ceases once the F1-score on the validation set starts to decline.Due to resource constraints (we only had two Tesla T4 GPUs) we could only run one random seed per model.We define and configure two special parameters: 1) TruncationStrideRatio: We set this parameter to 0.5.When a sentence exceeds 192 tokens, we truncate it using a specific overlap strategy.The overlap consists of half of the previous snippet and half of the next snippet.2) NonAnonymizedSentencesRatioToAnonymized-Sentences: We set the ratio at 1.5, including only 150% of sentences without anonymization examples compared to those with examples.This minimizes data redundancy and maximizes utility.

Results
Table 1 presents a comprehensive evaluation of various BERT and RoBERTa-based models on two different conditions: Normal and Uniformizing.For the Normal condition, in the multilingual setting, Legal-XLM-RoBERTa-base exhibits the highest Precision at 94.84%, while Legal-Swiss-RoBERTa-base demonstrates superior Recall and F1-Score values, achieving 92.57% and 92.42% respectively.With Uniformizing, we describe the process of forcing the model to replace all occurrences of a detected term across the whole docu-ment.This approach leads to better Recall, but reduces Precision.In the Uniformized case, again Legal-XLM-RoBERTa-base shows highest Precision, while mBERT achieves highest Recall and F1-Score.The improved Recall and F1-Score in the Normal condition show that pre-training on legal data can improve the performance of models.We observe similar behavior for the monolingual models.All models pre-trained on legal data achieve a higher F1-Score than generic monolingual models.

Discussion
We pretrained models on Swiss legal data and performed a detailed comparison of legal and generic models, both multilingually and monolingually in the ruling anonymization task.Our experiments indicate that pretraining on legal data improves the performance of models significantly compared to generic multi-or monolingual models.
To reduce errors in sentence splitting, we suggest future work to use legal specific sentence splitters (Brugger et al., 2023).Due to computational constraints we only experimented with base size encoder models.Future work may expand this by also testing larger models.

Figure 1 :
Figure 1: Main window of Anom2.Anonymizations are configured on the left, and the anonymized court ruling appears on the right.The system highlights completed anonymizations in gold and the current setting in yellow.
introduced and studied judgment prediction on SFSC rulings.Brugger et al. (2023) investigated and improved multilingual sentence boundary detection in the legal domain using SFSC decisions.Christen et al. (2023) studied negation scope resolution and Nyffenegger et al. (2023) investigated how easily LLMs can re-identify persons occurring in anonymized SFSC decisions.Rasiah et al. (2023) created a large benchmark of ten text classification tasks, two text generation tasks, an information retrieval, and a citation extraction task.

Figure 2 :
Figure 2: Histograms illustrating the distribution of (anonymized) tokens and entities across the three languages.
5 TB of cleaned CommonCrawl data in 100 languages.(b) We train a new tokenizer of 32K BPEs on the training subsets to better cover legal language.However, we reuse the original XLM-R embeddings for all lexically overlapping tokens(Pfeiffer et al., 2021), i.e., we warm-start word embeddings for tokens that already exist in the original XLM-R vocabulary, and use random ones for the rest.(c) We continue pretraining our monolingual models on our pretraining corpus with batches of 512 samples for an additional 1M/500K steps for the base/large model.We do initial warm-up steps for the first 5% of the total training steps with a linearly increasing learning rate up to 1e−4, and then follow a cosine decay scheduling, following recent trends.For half of the warm-up phase (2.5%), the Transformer encoder is frozen, and only the embeddings, shared between input and output (MLM), are updated.We also use an increased 20/30% masking rate for base/large models respectively, where also 100% of the predictions are based on masked tokens, compared toDevlin et al. (2019) 1 , based on the findings ofWettig et al. (2023).(d) We consider mixed cased models, i.e., both upper-and lowercase letters covered, similar to recently developed large PLMs(Conneau and Lample, 2019;Raffel et al., 2020;Brown et al., 2020).(e) This leaves us with two models for each language (base and large).Additionally, we consider the multilingual legal models pretrained byNiklaus  et al. (2023b)  and the Swiss legal models pretrained byRasiah et al. (2023).

Table 1 :
Evaluation Results.Best results per setup are in bold.