Do Language Models Learn about Legal Entity Types during Pretraining?

Language Models (LMs) have proven their ability to acquire diverse linguistic knowledge during the pretraining phase, potentially serving as a valuable source of incidental supervision for downstream tasks. However, there has been limited research conducted on the retrieval of domain-specific knowledge, and specifically legal knowledge. We propose to explore the task of Entity Typing, serving as a proxy for evaluating legal knowledge as an essential aspect of text comprehension, and a foundational task to numerous downstream legal NLP applications. Through systematic evaluation and analysis and two types of prompting (cloze sentences and QA-based templates) and to clarify the nature of these acquired cues, we compare diverse types and lengths of entities both general and domain-specific entities, semantics or syntax signals, and different LM pretraining corpus (generic and legal-oriented) and architectures (encoder BERT-based and decoder-only with Llama2). We show that (1) Llama2 performs well on certain entities and exhibits potential for substantial improvement with optimized prompt templates, (2) law-oriented LMs show inconsistent performance, possibly due to variations in their training corpus, (3) LMs demonstrate the ability to type entities even in the case of multi-token entities, (4) all models struggle with entities belonging to sub-domains of the law (5) Llama2 appears to frequently overlook syntactic cues, a shortcoming less present in BERT-based architectures.


Introduction
During the initial phase of pretraining, language models (LMs) are exposed to an extensive corpus of textual data, allowing them to acquire the capacity to represent the probabilistic structure of language.In this process, it has been theorized that they incidentally learn various linguistic signals and patterns, both syntactic and semantic.Work by Petroni et al. (2019) and subsequent studies (Jiang et al., 2020b) make the hypothesis that a side effect of the pretraining stage is that LMs also learn factual knowledge.On the other hand, Gururangan et al. (2020) research demonstrates the significance of both model pretraining and task-specific pretraining; pretraining a model with a specific focus on a particular task or a limited domain corpus yields notable advantages in enhancing model performance and adaptability.
Entity typing and extraction are crucial tasks for a range of use cases including named-entity recognition (NER), relation extraction, summarization, structuring raw data, and most specifically in law, legal search, and past cases retrieval.To gain more insights into entity typing and extraction, entity probing tasks have been designed for bidirectional LSTM conditional random field models (Augenstein et al., 2017), masked language models (Petroni et al., 2019;Jiang et al., 2020b) and autoregressive LMs (Epure and Hennequin, 2022), using GPT-2.
Conversely, one notable bottleneck of the application of NLP within the legal domain is the lack of resources and annotated datasets.Thereby, it is of particular interest to explore the extent to which LMs, during their pretraining phase, acquire a sufficient understanding of legal entities, serving as a surrogate for legal knowledge.Ultimately, LMs could be exploited as a source of weak and indirect supervision in downstream tasks such as legal NER or question answering (QA), as they constitute a good proxy to use natural text incidentally thanks to their pretraining stage.Indeed, humans do not exclusively rely on exhaustive supervision but instead make use of occasional feedback and learn from incidental signals originating from various sources.This approach holds potential for Pretrained Language Model Pretraining corpus # Parameters # Tokens Corpus size # Vocab Legal CaseHOLD (Zheng et al., 2021) Harvard Case Law 110M 43B 37 GB 32K Pile of law (Henderson et al., 2022) US, Canadian, ECthR 340M 130B 256 GB 32K LexLM (Chalkidis et al., 2023) US, Canada, EU, UK, India 125 2T + 256B 175 GB 50K

Generic
BookCorpus (Zhu et al., 2015) -16GB -CC_news (Nagel, 2016) -76GB -OpenWebText (Radford et al., 2019) -38GB -Stories (Trinh and Le, 2018) -31GB -RoBERTa (Liu et al., 2019) 125M 2T 160GB 50K DeBERTa (He et al., 2023) 86M 2T 160GB 128K Llama 2 (Touvron et al., 2023) Data from publicly available sources 7B 2T -32K Table 1: Overview of the models used.The table reports the description of the pretraining corpora, the number of parameters, the total number of tokens, the size of the corpus, and the vocabulary size increased flexibility in terms of entity types, in contrast to supervised methods, and presents an alternative to existing automated annotation extraction approaches (Tedeschi and Navigli, 2022;Savelka, 2023) which hold limitations in the set of entity types.It presents several advantages: it does not require human annotation, it can be easily combined with other sources of supervision such as legal knowledge bases, and it would support an open set of entities and user queries.It would offer the advantage of seamless and fast application to new datasets while facilitating transfers of knowledge between datasets and even potentially between different domains.In this paper, we study the intersection of entity knowledge and legal knowledge embedded within LMs, evaluated on a AsyLex, a dataset of Canadian Refugee Decisions.

Research questions
We are interested in evaluating the quality of the entity knowledge learned during pretraining in offthe-shelf LMs, specifically domain-specific entities, such as those pertinent to the legal field.2023) which investigates eight distinct legal knowledge probing tasks with a focus on legislation and legal terminology, we focus on Legal Entity Types.To be clear, we ask the LM to predict the entity type, similarly to Epure and Hennequin (2022), and not the actual entity.For example, in the prompt <Mask> is the capital of Germany, we expect the answer to be City or Location and not Berlin.
In addition, we adopt a comprehensive interpretation of entity types, aligned with the work of Barale et al. (2023), encompassing both essential factual knowledge (e.g., location and dates) and more abstract legal concepts, such as the credibility of a claimant and the rationale behind a judgment.Moreover, our approach diverges by allowing longer entities to be masked (Figure 4), where previous work was limited either to single token (Petroni et al., 2019) or 2-tokens entities (Jiang et al., 2020a;Chalkidis et al., 2023).Where most previous work focuses on masked language modeling objective models (MLM), we introduce the use of autoregressive LM (Llama2) in a zero-shot setting, similar to the approach employed in Epure and Hennequin (2022).
We make the hypothesis that pretrained LMs inherently contain structured knowledge about specific domains, which could be leveraged to generate incidental training instances.We seek to investigate the depth of a model's knowledge, its nature, and whether it predominantly acquires knowledge from semantic or syntactic cues.
We first conduct in section 3 an analysis of the pretraining corpus of selected models both generic and legal LMs.We then prompt the LM with two different styles of prompts, cloze text and questionbased, for the task of Entity Typing (section 5).After evaluating the experimental results in section 6, we analyze the type of failure cases (6.5) to highlight the strengths and weaknesses of the learning process and to draft directions for future work.
Our contributions are as follows: • We propose two new experiments on the task of Legal Entity Typing in a zero-shot setting on a large set of entity types: Experiment MLM that evaluates generic and legal BERT-based LM on cloze sentences, and Experiment Llama2 which evaluates Llama2 on QA-style prompts.
• We report the results for both experiments and show that Llama2 exhibits good performance on specific entities and has the potential for improvement with optimized prompts.However, law-oriented LMs display inconsistent results, likely influenced by training corpus variations and struggle with Refugee Law-specific vocabulary.
• We propose an in-depth analysis of the failure modes of the models on this task, opening the way for future work.
2 Background and related work

Legal NLP and Legal LMs
A range of tasks and use cases have been investigated in legal NLP (Zhong et al., 2018), including summarization, information retrieval, and extraction, or question answering.It is worth emphasizing that entity typing is foundational for many of these tasks.
The legal domain presents numerous challenges for self-supervised learning, primarily due to the specificity of legal language in contrast to ordinary language.This can lead to ambiguity in contextual meaning (that we aim to assess in this paper), potential implicit meanings, and variations in the significance of a term.A term that may be decisive in a legal context, such as "appeal," might not carry the same weight in a generic domain.
Given these specific challenges and the demonstrated benefits of pretraining LM on legal text to achieve better performance on downstream tasks  (Barale et al., 2023), there has been interest in pretraining models on legal texts (Zhong et al., 2018;Xiao et al., 2021).These models typically use an encoder-only architecture based on the BERT architecture: LegalBERT (Chalkidis et al., 2020), CaseHOLD (Zheng et al., 2021), Pile of Law (Henderson et al., 2022), and LexLM (Chalkidis et al., 2023), that we use in our first experiment (details in Table 1).To the best of our knowledge, there is no decoder-only legal LM, which is why we limit our second experiment to a Llama2.

Probing LMs for Entity Typing
The idea of latent language representations derived from pre-trained LMs holds promise as a source of structured knowledge.Similar to human learning, LMs accumulate domain-specific and linguistic knowledge, along with the development of general pattern recognition capabilities through their pretraining experiences (Brown et al., 2020).As noted in the introduction, our work follows Petroni et al. ( 2019)'s LAnguage Models Analysis framework (LAMA) and LegalLAMA (Chalkidis et al., 2023).Several probing methods have been investigated (Yin et al., 2023), evaluating multilingual extraction (Jiang et al., 2020a) as well as effective prompting for factual knowledge extraction (Haviv et al., 2021;Qin and Eisner, 2021;Blevins et al., 2023).Various types of tasks have been targeted by these works, including relation extraction, NER, or entity typing (Shen et al., 2023), our task of interest.Concurrently, there have been efforts to enhance entity typing pipelines, particularly to expand the range of entities beyond traditional categories like location or dates (Choi et al., 2018;Dai et al., 2021) or to entities unseen during training (Epure and Hennequin, 2022;Lin et al., 2020), and approaches leveraging supervision from other tasks such as QA (Zhang et al., 2022).However, to the best of our knowledge, there has been no work conducted in the legal domain specifically addressing entity typing, and no prior research on entity typing in this domain has made use of prompts in the form of questions.

Vocabulary
To understand the difference between pretraining corpora across LMs, we conduct an exploratory vocabulary analysis inspired by Gururangan et al. (2020) that investigates the impact of domainspecific pretraining on a range of downstream tasks.This preliminary study is destined to clarify and offer insights that will help explain the results of the experiments presented in section 5. We select a total of fifty thousand documents for each language model, perform basic cleaning, tokenize the text, and remove stopwords, which gives us a list of tokens per LM.From this list, we select the most common ten thousand tokens, that constitute the final vocabulary for a given LM.
For the three legal LMs, as the datasets used for pretraining are directly available, we randomly select the fifty thousand documents.To construct a generic pretraining corpus, we reconstitute a vocabulary based on the RoBERTa and DeBERTa corpus as indicated in Table 1.As for the other models, we gathered fifty thousand entries, selected proportionally to the size of each corpus.That is to say, we select 5,000 documents from BookCorpus which constitutes 10% of RoBERTa pretraining data, 23,750 entries from CC_news, 11,875 entries from OpenWebText (using the open source version: (Gokaslan and Cohen, 2019)), and 9,688 entries from CC_stories.Given our limited knowledge of the precise composition of Llama2's pretraining corpus, we propose utilizing the generic vocabularies of RoBERTa and DeBERTa as suitable proxies for our analysis.

Vocabulary Overlap
The vocabulary overlap is represented in percentage in the matrix in Figure 1.As anticipated, the legal LMs exhibit a greater overlap in vocabulary compared to their counterparts with generic training.However, significant disparities emerge among the legal LMs.For example, CaseHOLD shares only 44.5% of its vocabulary with LexLM.This observation may be attributed to the more extensive and more diverse set of jurisdictions included in the LexLM pretraining corpus.This aligns with the fact that LexLM shows a higher percentage of vocabulary overlap with Pile of Law, which, in contrast to CaseHOLD which is limited to the United States, also includes legal documents from a broader range of jurisdictions.

Dataset
We use AsyLex, a dataset of refugee decisions from Canada curated for entity typing and extraction 1 .This publicly available dataset comprises 19,115 human annotated instances, encompassing 20 distinct categories of entities that hold legal relevance as explained in Barale et al. (2023).These categories have been identified as categories of interest with the collaboration of legal professionals and experts in the field of refugee law.AsyLex comprises 59,112 historical decision documents,  legal_ground, law_report).This diversity in entity scope presents an opportunity for assessing the impact of pretraining, particularly in scenarios where entity types have received various exposures during the pretraining phase.

Legal Entity Types
The selection of legal entity types within this dataset is intended to encapsulate characteristics that can reflect similarities among various refugee cases (see Appendix A for an exhaustive description of the types) and for which we have precise gold-standard annotations (Barale et al., 2023).The set of 14 entity types is pre-defined and closed for both experiments.To extend the coverage of each entity type and extract specific entities, we use a synonym generator to give synonyms for each of our 14 entity types.As a result, when prompted, the model would have to choose between a total of 151 2 Canlii: https://www.canlii.org/en/entity types, increasing the difficulty but also the interest of the task.For example, location accepts city or country.The complete list of synonyms generated per entity type is available in Appendix B.
In our evaluation process, we assess predictions across the 14 entity types.For instance, if a prediction is country, it will be categorized as location and evaluated against a gold answer that specifies location.Contrary to previous work, we do not limit the entities' length to a single token (Petroni et al., 2019) or to entities spanning only two tokens (Jiang et al., 2020a).On the contrary, one of the objectives is to use entity types in a broader way for extracting information from text.Thereby we are interested in identifying multi-token entities that have short spans of text and can be longer than two tokens, which is often the case for explaining a decision for instance (entity type: explanation).
The length of the entity per entity type is presented in Figure 4, which provides explicit numerical values for both single-token entities and entities longer than three tokens.
5 Proposed Entity Typing Methodology

Task description
The goal of this task is to classify legal entities mentioned in text documents or sentences into specific types.Legal entities can include various organizations, companies, government bodies (org), or more abstract concepts such as the credibility assessment made in the context of a refugee claim (credibility).The task involves extracting and categorizing these entities based on their attributes or context within the text.As input, we use text documents split by sentences that contain mentions of legal entities.We then categorized legal entity types for each mention found in the input text.Let E be the set of possible entity types: E = {e 1 , e 2 , . . ., e n }, S the set of sentences or text seg-ments, T the set of masked tokens within the sentences and P (e i |t j , s k ) represents the conditional probability that masked token t j in sentence s k belongs to entity type e i .The goal is to find the entity type e i that maximizes the conditional probability for each masked token t j in each sentence s k : P (e i |t j , s k ) In other words, the objective is to find the entity type that is most likely for each masked token in each sentence.

Language models used
For the first experiment, Experiment MLM with BERT-based LMs, we experiment with two generic models optimized for MLM, RoBERTa, and De-BERTa, and three legal-oriented LMs (see Table 1).For the second experiment, Experiment Llama2 we use the open-source model Llama2, optimized for dialogue use cases.Both tasks take the list of entity types as an input argument, making it a multiple-choice task.

Cloze prompts with BERT-based models
For the first experimental setting, we use clozestyle prompts that perfectly fit masked language models (MLM).We replace the entities in the sentences with a masked token and use BERT-based models with an MLM objective.Multi-token entities are substituted with a single masked token.If multiple entities appear in the same sentence, only the initial entity occurring in the sentence is considered.The model's answers are limited to the predefined list of 151 entities.We do not provide more context than what is contained in the input sentence.We randomly select ten thousand sentences per entity type, for which we have ground truth annotations (the actual number of prompts after cleaning is given in the column # prompts in Table 2).An example of a cloze-style prompt is given in Figure 2.With this Experiment MLM, our objective is to assess whether the models can make predictions about the type of entity to expect based on contextual and syntactic cues.For instance, in the example presented in Figure 2, we assume that a human reader could deduce from the context that a location is the expected entity to fill the masked portion.Can an LM do the same?

QA prompts with Llama2
For the second experimental setting, Experiment Llama2, we use a template that briefly explains the task to the model and we input the predefined list of 151 entity types.To provide a simple task framing, we prompt the language model according to the following template: "What entity types is {entity}?", to which the model is asked to answer with the most probable entity type.Because of the format of the prompt, we use a text generation objective with an open-source, state-of-the-art auto-regressive LM, Llama2.We use the smallest available version of the model (7B parameters, to spare computing resources) and its fine-tuned version Llama2-chat, which is optimized for dialogue use cases.An example of this QA-style prompt is presented in Figure 3.In that experiment, the prompt explicitly mentions the entity, for example, here the question is "What entity type is Nice?" which makes it a simpler task compared to the task of Experiment MLM.

Experimental Results
We evaluate the results in terms of recall since we want to ensure capturing as many true positives as possible, and F1 score to assess the overall performance on the task.In this section, we compare the results in terms of LM used, length of the input entity, prompt type, and entity type, before conducting an error analysis in the section 6.5.The results of Experiment MLM are presented in Table 2 and the results of Experiment Llama2 in Table 3.

Language Models Comparison
Given the high difficulty of the task, the choice between 151 entity types when accounting for the synonyms list, and the lack of description of the entities and extra context given, it is no surprise that the scores are relatively low.However, the goal of this work is not to reach the best accuracy, but rather to explore where the models succeed or fail.On Experiment MLM, results are generally lower than in Experiment Llama2 which is firstly explained by the greater difficulty of the task of Experiment MLM and the relative lack of context provided for this task.In this experiment, Pile of Law is the model that performs the best on average, in terms of F1, retrieving 16.36% of entity types, with 9.47% in recall.The second best performing model is CaseHOLD with 15.29% average F1 and 8.58% average recall.This is despite LexLM's big- ger size, LexLM being the model that performs the worst across all, being outperformed by generic LMs RoBERTa and DeBERTa.For all entities except one, the model that achieved the best recall also achieved the best F1, highlighting the consistency in the precision.The only exception is the type procedure for which the best F1 is reached with Pile of Law and the best recall with LexLM.

Single-token vs Multi-token
An interesting point is that we did not impose any restrictions on the length of entities; the en-tities that tend to be longer are typically more abstract and closer to a piece of legal commonsense knowledge and reasoning, for example, explanation, determination, credibilty and legal ground.Interestingly the best overall F1 score in Experiment MLM is achieved for the type determination, reaching an F1 score of 55.5% (RoBERTa).For instance, an entity flagged as determination can as long as: claimants are not convention refugees and not persons in need of protection.While the other models achieve lower scores for this entity type, it is to note that the disparity between these relatively lengthy multi-token entities and those that are typically single tokens is not substantial (refer to Figure 4).This may be due to the nature of the task, which may mitigate such disparities compared to tasks like NER where the model has to retrieve the actual entity.In Experiment Llama2, shorter entities (that are also the most generic ones) are well recognized (location, date), with also good scores achieved on the types determination, claimant_info, procedure.
Overall for both experiments and certainly due to the nature of the task of entity typing, it seems that the length of the initial entity to categorize does not have an impact on the results.

Prompt Templates Comparison
The scores are on average higher in the Experiment Llama2 with a total average F1 score of 30.86% when Experiment MLM reaches an average of 14.46% across all types.However, as noted in Based on the predicted entity types, it appears that the template suggested for Experiment Llama2 is not consistently well comprehended, resulting in a lack of clarity regarding the task.In some cases, it returns not just one entity type, but multiple, leading to incorrect predictions.It seems that, instead of relying on manually crafted prompts and templates, which have been acknowledged to be su-optimal as mentioned by Jiang et al. (2020a), there is significant room for improvement in this regard.

Entity Types Comparison
For this evaluation, we categorize the type of entity into three groups: those that can be encountered in any text with the same meaning, those that are commonly found in legal texts, and those that are highly specific to the domain of the dataset, refugee law.The combined results are summarized in Table 5. Entities related to refugee law tend to yield the lowest performance across all models and settings.Pile of Law outperforms other models even on generic entities.At the same time, RoBERTa and DeBERTa surpass models specifically trained on legal data for generic legal entities, possibly due to larger exposure and a larger vocabulary.

Failure Cases Analysis
We identify four types of errors across the two experiments: 1. Random Prediction: this refers to cases where the predicted entity type is entirely random and unrelated to the context.
2. Contextually Accurate: this describes situations where the predicted entity type is incorrect, but within the context of the sentences, it is plausible in terms of syntax and meaning.
3. Closely Related: instances where the predicted entity class is incorrect, yet it is closely related to the actual gold entity type.For example, it misclassifies a legal ground (which is very precisely one of the 5 reasons for being granted refugee status, see Appendix A) for an explanation of the decision (which is more generic).
4. False Negative, it predicts an entity type that is not in the list of entity types given as input.
5. Prompt Error, if the answer provided deviates from the prompt instruction, we categorize it as incorrect; we consistently consider that an answer with more than five tokens is incorrect, as it signifies that the response extends beyond providing just the entity type.
Experiment MLM errors To assess the occurrence of error types, we sample 10 errors per entity type per model, i.e. a total of 700 errors for this experiment.Table 4 presents the findings and an example per error type.There is no instance of a False Negative error; the models never predict an entity type that is not in the input predefined list as we constraint the model to a multiple-choice task, from our pre-defined list of entity types.The most common error is simply an incorrect prediction, with the second most frequent error being the prediction of a closely related entity.This may be due to the choice of categories, some of which express subtle legal nuances.Another positive sign is the presence of more than 10% of incorrect predictions that are nevertheless accurate in the context of the input sentence.
Experiment Llama2 errors Similarly, we sample 10 errors per entity type, i.e. a total of 135 errors.Table 4 presents the findings.It's worth noting that the use of QA-style prompts leads to a significant number of prompt errors and false negatives, which we believe could be mitigated to some extent by improving the initial prompt template in future work.Additionally, a common misclassification pattern occurs with norp entities, which are always adjectives, but are misclassified as their noun counterparts, as illustrated in the example provided in Table 4. Similarly, acronyms for tribunals (e.g., RPD for Refugee Protection Division) are classified as units of length, an issue that might be rectified by providing more contextual information.Finally, entities like consistent explanation are occasionally misclassified as explanation when they should be categorized as a credibility assessment, possibly due to missing adjectives or entity length-related challenges.CLAIMANT_INFO age, gender, citizenship, occupation "28 year old", "citizen of Iran", "female" PROCEDURE steps in the claim and legal procedure events "removal order", "sponsorship for application" DOC_EVIDENCE pieces of evidence, proofs, supporting documents "passport", "medical record", "marriage certificate"

Conclusion
EXPLANATION reasons given by the panel for the determination "fear of persecution", "no protection by the state" LEGAL_GROUND referring to the Convention, refugee status is granted for reasons of race, religion, nationality, membership of a particular social group or political opinion "homosexual", "christian" LAW citations: legislation and international conventions "section 1(a) of the convention" LAW_CASE citations: case law and past decided cases "xxx v. minister of canada, 1994" LAW_REPORT country reports written by NGOs or the United Nations "amnesty international: police and military torture of women in mexico, 2016" Figure 2: Experiment MLM prompt example

Table 2 :
Entity type prediction scores in a zero-shot setting, on cloze sentences, measured in Recall and F1 score across 2 generic LMs (RoBERTa and DeBERTa-V3), and 3 legal LMs (CaseHOLD, Pile of Law and LexLM)

Table 4 :
Error cases and the ratio of the different error types for both experiments, across all tested models

Table 7 :
Extended pre-defined list of legal entity types (151 types)

Table 8 :
Error types figures per (number of occurrences) and in percentage for all studied LM