Parameter-Efficient Legal Domain Adaptation

Seeking legal advice is often expensive. Recent advancements in machine learning for solving complex problems can be leveraged to help make legal services more accessible to the public. However, real-life applications encounter significant challenges. State-of-the-art language models are growing increasingly large, making parameter-efficient learning increasingly important. Unfortunately, parameter-efficient methods perform poorly with small amounts of data, which are common in the legal domain (where data labelling costs are high). To address these challenges, we propose parameter-efficient legal domain adaptation, which uses vast unsupervised legal data from public legal forums to perform legal pre-training. This method exceeds or matches the fewshot performance of existing models such as LEGAL-BERT on various legal tasks while tuning only approximately 0.1% of model parameters. Additionally, we show that our method can achieve calibration comparable to existing methods across several tasks. To the best of our knowledge, this work is among the first to explore parameter-efficient methods of tuning language models in the legal domain.


Introduction
Seeking legal advice from lawyers can be expensive. However, a machine learning system that can help answer legal questions could greatly aid laypersons in making informed legal decisions. Existing legal forums, such as Legal Advice Reddit and Law Stack Exchange, are valuable data sources for various legal tasks. On one hand, they provide good sources of labelled data, such as mapping legal questions to their areas of law (for classification), as shown in Figure 1. On the other hand, they contain hundreds of thousands of legal questions that can be leveraged for domain adaptation. Furthermore, questions on these forums can serve  as a starting point for tasks that do not have labels found directly in the dataset, such as classifying the severity of a legal question. In this paper, we show that this vast unlabeled corpus can improve performance on question classification, opening up the possibility of studying other tasks on these public legal forums.
In the past few years, large language models have shown effectiveness in legal tasks (Chalkidis et al., 2022). A widespread method used to train these models is finetuning. Although finetuning is very effective, it is prohibitively expensive; training all the parameters requires large amounts of memory and requires a full copy of the language model to be saved for each task. Recently, prefix tuning (Li and Liang, 2021; has shown great promise by tuning under 1% of the parameters and still achieving comparable performance to finetuning. Unfortunately, prefix tuning performs poorly in low-data (i.e., fewshot) settings (Gu et al., 2022), which are common in the legal domain. Conveniently, domain adaptation using large public datasets is an ideal setting for the legal domain with abundant unlabelled data (from public forums) and limited labelled data. To this end, we introduce prefix domain adaptation, which performs domain adaptation for prompt tuning to improve fewshot performance on various legal tasks.
Overall, our main contributions are as follows: • We introduce prefix adaptation, a method of domain adaptation using a prompt-based learning approach.
• We show empirically that performance and calibration of prefix adaptation matches or exceeds LEGAL-BERT in fewshot settings while only tuning approximately 0.1% of the model parameters.
• We contribute two new datasets to facilitate different legal NLP tasks on the questions asked by laypersons, towards the ultimate objective of helping make legal services more accessible to the public.

Related Works
Forums-based Datasets Public forums have been used extensively as sources of data for machine learning. Sites like Stack Overflow and Quora have been used for duplicate question detection (Wang et al., 2020;Sharma et al., 2019). Additionally, many prior works have used posts from specific sub-communities (called a "subreddit") on Reddit for NLP tasks, likely due to the diversity of communities and large amount of data provided. Barnes et al. (2021) used a large number of internet memes from multiple meme-related subreddits to predict how likely a meme is to be popular. Other works, such as Basaldella et al. (2020), label posts from biomedical subreddits for biomedical entity linking. Similar to the legal judgement prediction task, Lourie et al. (2021) suggest using "crowdsourced data" from Reddit to perform ethical judgement prediction; that is, they use votes from the "r/AmITheAsshole" subreddit to classify who is "in the wrong" for a given real-life anecdote. We explore using data from Stack Exchange and Reddit, which has been vastly underexplored in previous works for the legal domain.
Full Domain Adaptation Previous works such as BioBERT   Parameter-efficient Learning Language models have scaled to over billions of parameters (He et al., 2021;Brown et al., 2020), making research memory and storage intensive. Recently, parameter-efficient training methods-techniques that focus on tuning a small percentage of the parameters in a neural network-have been a prominent research topic in natural language processing. More recently, prefix tuning (Li and Liang, 2021) has attracted much attention due to its simplicity, ease of implementation, and effectiveness. In this paper, we use P-Tuning v2 , which includes an implementation of prefix tuning. Previously, Gu et al. (2022) explored improving prefix tuning's fewshot performance with pretraining by rewriting downstream tasks for a multiple choice answering task (in their "unified PPT"), and synthesizing multiple choice pre-training data (from OpenWebText). Unlike them, we focus on domain adaptation and not general pre-training. We show a much simpler method of prompt pretraining using the masked language modelling (MLM) task while preserving the format of downstream tasks. Ge et al. (2022) domain adapt continuous prompts (not prefix tuning) to improve performance with vision-transformer models for different image types (e.g., "clipart", "photo", or "product"). Zhang et al. (2021) domain adapt an adapter (Houlsby et al., 2019), which is another type of parameter-efficient training method where small neural networks put between layers of the large language model are trained. Vu et al. (2022) explored the transferability of prompts between tasks. They trained a general prompt for the "prefix LM" (Raffel et al., 2020) objective on the Colossal Clean Crawled Corpus (Raffel et al., 2020). They do not study the efficacy of their general-purpose prompt in fewshot scenarios. Though we use a similar unsupervised language modelling task (Devlin et al., 2019), we do not try to train a "general-purpose prompt" but instead aim to train a domain adapted prompt.

Background
Legal Forums Seeking legal advice from a lawyer can be incredibly expensive. However, public legal forums are incredibly accessible to laypersons to ask legal questions. One popular community is the Legal Advice Reddit community (2M+ members), where users can freely ask personal legal questions. Typically, the questions asked on the Legal Advice Subreddit are written informally and receive informal answers. Another forum is the Law Stack Exchange, a community for questions about the law. Questions are more formal than on Reddit. Additionally, users are not allowed to ask about a specific case and must ask about law more hypothetically, as specified in the rules.
In particular, data from the Legal Advice Subreddit is especially helpful in training machine learning models to help laypersons in law, as questions are in the format and language that regular people would write in (see Figure 1). We run experiments on Law Stack Exchange (LSE) for comprehensiveness, though we believe that the non-personal nature of LSE data makes it less valuable than Reddit data in helping laypersons.
Prefix Tuning As language models grow very large, storage and memory constraints make training impractical or very expensive. Deep prefix tuning addresses these issues by prepending continuous prompts to the transformer. These continuous prefix prompts, which are prepended to each attention layer in the model, and a task-specific linear head (such as a classification head) are trained.
More formally, for each attention layer L i (as per Vaswani et al., 2017) in BERT's encoder, we append some trainable prefixes P k (trained key prefix) and P v (trained value prefix) with length n to the key and value matrices for some initial prompts: { q,k,v } representing the respective query, key, or value matrices for the attention at layer i, and x denoting the input to layer i. Here, we assume single-headed attention for simplicity.
Note that in Equation 1 we do not need to leftpad any query values, as the shape of the query matrix does not need to match that of the key and value matrices.
Expected Calibration Error First suggested in Pakdaman Naeini et al. (2015) and later used for neural networks in Guo et al. (2017), expected calibration error (ECE) can determine how well a model is calibrated. In other words, ECE evaluates how closely a model's logit weights reflect the actual accuracy for that prediction. Calibration is important for two main reasons. First, having a properly calibrated model reduces misuse of the model; if output logits accurately reflect their realworld likelihood, then software systems using such models can better handle cases where the model is uncertain. Second, better calibration improves the interpretability of a model as we can better understand how confident a model is under different scenarios (Guo et al., 2017). Bhambhoria et al. (2022) used ECE in the legal domain, where it is especially important due the high-stakes nature of legal decision making.

Methods
Here we outline our approach and other baselines for comparison.

RoBERTa
To establish a baseline, we train RoBERTa (Liu et al., 2019) for downstream tasks using full model tuning. In addition to the state of the art performance that RoBERTa achieves in many general NLP tasks, it has also shown very strong performance in legal tasks (Shaheen et al., 2020;Bhambhoria et al., 2022). Unlike some transformer models, RoBERTa has an encoder-only architecture, and is normally pre-trained on the masked language modelling task (Devlin et al., 2019). We evaluate the model on both of its size variants, RoBERTa-base (approximately 125M parameters) and RoBERTa-large (approximately 335M parameters). LEGAL-BERT We evaluate the effectiveness of our approach against LEGAL-BERT, a fully domain-adapted version of BERT for the legal domain (Chalkidis et al., 2020). In our experiments, we further perform full finetuning for each downstream task. The number of parameters in LEGAL-BERT (109M) is comparable to RoBERTa-base (125M), as used in our other experiments.

P-Tuning v2
We compare our approach against P-Tuning v2 , an "alternative to finetuning" that only optimizes a fractional percentage of parameters (0.1%-3%). It works by freezing the entire model, then appending some frozen prompts in each layer. That is, trainable prompts are added as prefixes to each layer, with only the key and value matrices of the self-attention mechanism trained. We use P-Tuning v2 as a baseline, being the original parameter-efficient training method that we base our study on.
Prefix Domain Adaptation Inspired by domain adaptation, we introduce prefix domain adaptation (also referred to as "domain PA"), which domain adapts a deep prompt (Li and Liang, 2021) to better initialize it for downstream tasks. As the domain adapted deep prompt is very small (approximately 0.1% the size of the base model), it is easy to store and distribute. Once trained, the deep prompt is used as a starting point for downstream tasks. More specifically, we train a deep prompt, using prefix tuning as in  2 , for the masked language modelling task (Devlin et al., 2019) on a large, domain-specific unsupervised corpus, as shown in Figure 2(a). Next, we use this pre-trained prompt and randomly initialize a task-specific head (such as a classification head for a classification task) for each downstream task. Finally, we train the resulting model for the downstream task, using the same prompt tuning approach from . To the best of our knowledge, no prior works have trained a prefix prompt for a specific domain to better initialize it for downstream tasks using an unsupervised pre-training task (masked language modelling). Formally, we can treat a prefix-tuned model as having a trained prefix P , and a trained taskspecific head H. We group each downstream task into m domains in { D 1 , D 2 , ..., D m }, such that there is some overlap between the tasks in each domain D i . For each domain D i , we use a domainspecific corpus, C i , to train some prefix P i for the masked language modelling task with prompt tuning (Figure 3(a)). Then, for each downstream task in D i , we use the deep prefix P i to initialize the prompts, while randomly initializing the taskspecific head H i (Figure 3(b)).

General Prefix Adaptation
In addition to prefix domain adaptation, we conduct experiments using our approach in general settings, inspired by work Transfer to downstream task done in Vu et al. (2022) and Gu et al. (2022). That is, we test the performance of initializing a prompt with the masked language modelling task on a subset of the Colossal Clean Crawled Corpus (Raffel et al., 2020), instead of domain-specific texts (illustrated in Figure 2(b)). We call this method general prefix adaptation (or "general PA" for short). Formally, we use the same domain prefix adaptation approach as previously mentioned, but we group all tasks under one "General" domain D, and thus only train one prefix P .

Datasets
We evaluate each of the approaches listed above on three different datasets.
Legal Advice Reddit We introduce a new dataset from the Legal Advice Reddit community (known as "/r/legaldvice"), sourcing the Reddit posts from the Pushshift Reddit dataset (Baumgartner et al., 2020) 3 . The dataset maps the text and title of each legal question posted into one of eleven classes, based on the original Reddit post's "flair" (i.e., tag). Questions are typically informal and use non-legal-specific language. Per the Legal Advice Reddit rules, posts must be about actual personal circumstances or situations. We limit the number of labels to the top eleven classes and remove the other samples from the dataset (more details in Appendix B). To prefix adapt the model for Reddit posts, we use samples from the Legal Advice subreddit that are not labelled or do not fall under the top eleven classes. We use the provided "flair" for each question for a legal area classification task (Soh et al., 2019), as illustrated in Figure 1.

European Court of Human Rights
We use the European Court of Human Rights (ECHR) dataset (Chalkidis et al., 2019), which consists of a list of facts specific to a legal case, labelled with violated human rights articles (if any). Specifically, we evaluate our approach on the binary violation prediction task, where the task is to predict whether a given case violates any human rights articles given a list of facts. We undersample this relatively large dataset to simulate a fewshot learning environment.
To prefix adapt the model for ECHR cases, we use the original corpus of unlabelled cases (similar to what was done in Chalkidis et al., 2020). As the average document length is 700 words (above BERT's maximum length limit), we truncate the text to 500 tokens, concatenating the title and facts of the case together. is generally more formal (shown in Figure 1), and questions are generally more theoretical or hypothetical. We link the questions with their associated tags (e.g., "copyright" or "criminal-law"), and perform the multi-label classification task. Though posts can have multiple tags, we use the questions with only one tag in the top 16 most frequent tags (excluding tags associated with countries). Similarly to the Legal Advice Reddit dataset, we use other unused questions from the Law Stack Exchange to prefix domain adapt the model.

Experimental Setup
We test our approaches under a fewshot setting, where prompt tuning is known to perform poorly (Gu et al., 2022). We use RoBERTa-base and RoBERTa-large (Liu et al., 2019) for our experiments. To simulate a fewshot learning scenario, we randomly undersample the train and validation sets for each dataset, ensuring that the distribution of train and validation data roughly matches. Additionally, we vary the amount of data undersampled to study how fewshot size affects performance. In these tasks, we use a validation size of 256 (much smaller than the original) to represent true fewshot learning better (Perez et al., 2021). Considering that fewshot learning is quite unstable, we ran all of our experiments five times, using the seeds { 10, 20, 30, 40, 50 }. We provide more training details in Appendix A. There is often confusion around whether fewshot sizes represent the number of samples per class or the total number of samples (Perez et al., 2021). In our results, the fewshot sizes we show are the exact number of training samples used (i.e., total training samples). The exact number of samples is listed in Table 1. To keep the number of samples per class roughly equivalent, we use fewer total samples for the ECHR task, which only has two classes.

Results and Discussion
We make a few observations on our results, shown in Table 2. We observe that our method, prefix domain adaptation, outperforms both regular prefix tuning and full finetuning in most tasks across fewshot sizes, despite training considerably fewer parameters. We find that prefix adaptation is comparable to full domain adaptation; in some settings (such as ECHR and some Reddit fewshot settings), prefix adaptation even outperforms full domain adaptation. We argue that prefix domain adaptation achieves better fewshot performance relative to regular prefix tuning because the pre-trained prompts are closer to an effective prompt after our domain adaptation step. This is similar to full domain adaptation, which improves performance on downstream tasks relative to a base model (Chalkidis et al., 2020) by making parameters closer to optimal parameters. Consistent with Gu et al. (2022), we find that regular prefix tuning falls behind full parameter tuning in fewshot settings. Additionally, we find that LEGAL-BERT performs worse than other techniques on datasets with more informal language (such as the Reddit dataset). As LEGAL-BERT-SC (the model we use) was only trained on very formal legal text, it did not see many colloquialisms or slang during training that are prevalent in informal text. For this reason, we do not think LEGAL-BERT would be effective as initialization for tasks involving legal questions asked by laypersons, which typically do not use incredibly formal legal language.  Table 2: Classification results with RoBERTa-base (or similarly sized models), with fewshot size listed as italic numbers in the second row. Experiments run five times with different seeds, with subscripts representing the standard deviation of the five runs. Bolded results represent the best performance for the fewshot size, and underlined results represent second best. RoBERTa from Liu et al.  In contrast to other datasets, the ECHR dataset's train and test split have different distributions. In fewshot scenarios with very little data (i.e., 4-16 examples), we find that prefix tuning based approaches perform better than full-finetuning; this suggests that prefix tuning approaches are more robust to changes in distribution (and possibly noise). We also note that BERT with truncation (maximum token length of 500) performs a lot better than initially reported in Chalkidis et al. (2019), who report an F1 worse than random guessing (macro F1 of 66.5 in ours, 17 in theirs). We believe this underperformance of finetuning BERT could be caused by a mistake in their training process.
In Figure 4, we show the trend of performance on Reddit data as the number of samples increases. Prefix domain adaptation is comparable to finetuning, consistently outperforming regular prefix tuning. As shown by the larger shaded area around the lines, the stability of finetuning is worse than prefix domain adaptation for this task. Performance gradually converges increases as more data is given to each method.
Larger models typically provide better perfor-  mance on various tasks. Thus, we run experiments using RoBERTa-large (over 2x larger than RoBERTa-base) to see how our approach scales to larger models. As seen in Table 3, our approach is still comparable to or outperforms fullfinetuning with larger models. Impressively, in the fewshot sizes 32-128, prefix domain adaptation with RoBERTa-base is even comparable to fullfinetuning with RoBERTa-large. Additionally, we note that full domain adapation is more sensitive to learning rates in larger models, explaining weaker performance in fewshot sizes 32 and 64. Due to limitations in computational resources, we leave more extensive hyperparameter search as future work.

Calibration
While providing predictions to laypersons, it is vital that the distribution of the output logits accurately reflect the model's confidence. Thus, we use the ex- pected calibration error (ECE) (Pakdaman Naeini et al., 2015) to measure the calibration of each model resulting from each method. We show that the calibration of our approach is better than finetuning across tasks, as seen in Table 4. Additionally, we observe that our approach is comparable to LEGAL-BERT across tasks. In the case where questions are well formulated (i.e., in the LSE dataset), we found that legal models are better calibrated. However, in Reddit data, which is central to helping laypersons with legal questions, we find that our approach is very competitive.

Sample Efficiency
We study the effect of training time (i.e., number of training steps) for the domain-adapted prompt on downstream performance. To analyze the effect of additional training steps on the domain adapted prefix's performance, we initialize models using pre-trained prefixes from specific steps and plot the performance (over five runs) in Figure 5. We find that more optimization steps during the prefix adaptation step lead to better downstream performance. Intuitively, this makes sense as a longer training time means the prefix starts closer to an ideal one for a downstream task. Though each optimization step is faster with regular prefix tuning, it converges slowly and thus is not necessarily faster than finetuning. As shown in Figure 6, our approach converges faster than regular prefix tuning. Again, we argue that this is expected as the prompts are closer to a desired solution when compared to regular prefix tuning, meaning fewer training steps are needed to reach an effective solution.

Conclusions
In this paper, we propose a novel training framework, prefix adaptation, aiming to domain adapt a prompt using a large corpus of domain-specific text. We show that our approach matches or outperforms LEGAL-BERT or related techniques in performance while training fewer (0.1%) parameters. With our technique, we improve fewshot performance and convergence time compared to other parameter-efficient methods. We believe this will make fewshot data more usable (and thus reduce data labelling costs) while using parameterefficient methods to reduce computational and storage costs.
Additionally, we introduce two new datasets (Legal Advice Reddit and Law Stack Exchange) to lay foundations for future work in legal decisionmaking systems; as opposed to formal documents in ECHR, our two datasets are closer to legal questions asked by laypersons, helping to promote access to justice for all. Table 5: Learning rates searched for each configuration. The suffix "PT" means for prompt tuning based methods, and "FT" for finetuning based methods.

A Additional Training Details
We use the AdamW optimizer and a grid search of learning rates as in Table 5, mostly following Gu et al. (2022). For all of our experiments, we truncate the sequence to a length of 500 tokens (as opposed to 512 tokens) to allow space for a tuned deep prefix prompt. We report the calibration and general results using the checkpoint with the best validation macro F1, for each fewshot size and method.
Given that RoBERTa-base (~125M parameters) and RoBERTa-large (~355M parameters) can fit in a single NVIDIA 1080Ti GPU (using a smaller batch size), we do not perform any model or data parallelism. We use an effective batch size (i.e., factoring in gradient accumulation steps) of 32 for experiments on roberta-base, and due to memory constraints, an effective batch size of 24 for experiments on roberta-large. As the number of samples is low, we train for 100 epochs. However, while performing domain adaptation and prefix adaptation training steps, we train for 20 epochs as much more data as available (and therefore, more optimization steps are run in each epoch).
We use a prefix length of 8. Including the tuned linear head for classification, the largest number of parameters we tune for RoBERTa-base is 160K (varies slightly for each task depending on the number of classes), or~0.13% of the model's parameters.

B Data Processing
For Reddit data we take the top 11 classes that are not countries. We concatenate the title of the Reddit post and body text together, then use this combination to train our models for the masked language modelling and flair classification task.
For Stack Exchange data, we take only the questions with a single tag, and again. The stack ex-change data, taken from Internet Archive 5 , includes the post body in an HTML form. As our base models were not trained on HTML formatted text, we convert the HTML to Markdown footnote to make it much more similar to human readable text.
For the ECHR dataset, we concatenate each fact from the legal case together, along with the title of the case. Additionally, we found that some documents had numbered facts (such as "1. <fact>"), while some documents were not numbered. We used a simple regular expression to remove this inconsistency which could possibly create biases in the model (e.g., if numbered facts were more likely to mean a violation).
In our domain adapation experiements, we use all the data (i.e., including questions/posts that were previously filtered out because they didn't have top tags) for each dataset. We use the domain adapated checkpoint with the best validation cross-entropy loss for downstream tasks.