Automated Refugee Case Analysis: An NLP Pipeline for Supporting Legal Practitioners

In this paper, we introduce an end-to-end pipeline for retrieving, processing, and extracting targeted information from legal cases. We investigate an under-studied legal domain with a case study on refugee law in Canada. Searching case law for past similar cases is a key part of legal work for both lawyers and judges, the potential end-users of our prototype. While traditional named-entity recognition labels such as dates provide meaningful information in legal work, we propose to extend existing models and retrieve a total of 19 useful categories of items from refugee cases. After creating a novel data set of cases, we perform information extraction based on state-of-the-art neural named-entity recognition (NER). We test different architectures including two transformer models, using contextual and non-contextual embeddings, and compare general purpose versus domain-specific pre-training. The results demonstrate that models pre-trained on legal data perform best despite their smaller size, suggesting that domain matching had a larger effect than network architecture. We achieve a F1 score above 90% on five of the targeted categories and over 80% on four further categories.


Introduction
The retrieval of similar cases and their analysis is a task at the core of legal work.Legal search tools are widely used by lawyers and counsels to write applications and by judges to inform their decision-making process.However, this task poses a series of challenges to legal professionals: (i) it is an expensive and time-consuming task that accounts for 30% of the legal work on average (Poje, 2014), (ii) databases can be very large, with legal search tools gathering billions of documents, and (iii) selection of cases can be imprecise and may return many irrelevant cases, which creates the need to read more cases than necessary.
In Canada, from the date of the first claim to the final decision outcome, a claimant can expect to wait 24 months for refugee claims and 12 months for refugee appeals1 .Long processing times are due to a significant backlog and to the amount of work required from counsels that help claimants file their claims, and who are frequently legal aid or NGO employees.
We find that these challenges are well-suited for NLP-based solutions and investigate the feasibility of automating all steps of the legal search for past similar cases.We construct an end-to-end pipeline that aims at facilitating this multi-step process, thereby supporting and speeding up the work of both lawyers and judges in Refugee Status Determination (RSD).We provide a level of granularity and precision that goes beyond that of existing legal search tools such as Westlaw, LexisNexis, or Refworld2 (Custis et al., 2019), which operate at the document level.Refworld is an online database maintained by the United Nations which helps retrieve relevant precedent cases and legislation.However, the level of precision with which one can search for cases is limited.Moreover, our pipeline guarantees increased transparency, enabling end users to choose the criteria of legal search they find most relevant to their task among the proposed categories that act as filters for a search.
Specific literature studying refugee law and AI is sparse.Attention has been given to the classification and prediction of asylum cases in the United States (Chen and Eagel, 2017;Dunn et al., 2017).On Canadian data, research has been conducted to analyze the disparities in refugee decisions using statistical analysis (Rehaag, 2007(Rehaag, , 2019;;Cameron et al., 2021).However, those studies rely mostly on tabular data.We propose to work directly on the text of refugee cases.To the best of our knowledge, no previous work implements an end-to-end pipeline and state-of-the-art NLP methods in the field of refugee law.
We provide an NLP-based end-to-end prototype for automating refugee case analysis built on historical (already decided) cases, which are currently available only in unstructured or semi-structured formats, and which represent the input data to our pipeline.The end goal of our approach is to add structure to the database of cases by extracting targeted information described in table 1 from the case documents, and providing the results in a structured format to significantly enrich the search options for cases.Thereby, the input data set of cases is described in a structured manner based on our extracted categories of items, adding extensive capabilities for legal search.
The pipeline described in figure 1 begins by searching and downloading cases (information retrieval, paragraph 4.1), pre-processing them (paragraph 4.2), extracting items previously identified as relevant by legal professionals.It then outputs a structured, precise database of refugee cases (information extraction, paragraph 4.3).In the information extraction step, we test different training and pre-training architectures in order to determine the best methods to apply to the refugee case documents.We construct each step with the aim of minimizing the need for human effort in creating labeled training data, aiming to achieve the best possible accuracy on each extracted information item.We discuss technical choices and methodologies in section 5. Finally, we evaluate the information extraction step on precision, recall, and F1 score, and present detailed results in section 6.
We demonstrate that annotation can be sped up by the use of a terminology base while incorporating domain knowledge and semi-automated annotation tools.We find that domain matching is important for training to achieve the highest possible scores.We reach satisfactory token classification results on a majority of our chosen categories.The contributions of this paper are as follows: 1. First, we retrieve 59,112 historic decision documents (dated from 1996 to 2022) from online services of the Canadian Legal Information Institute (CanLII) based on context-based indexing and metadata to curate a collection of federal Refugee Status Determination (RSD) cases.Our automated retrieval process is exhaustive and comprises all available cases.It is superior to human-based manual retrieval in terms of error proneness and processing time.
2. Second, we proposed an information extraction pipeline that involves pre-processing, construction of a terminology base, labeling data, and using word vectors and NER models to augment the data with structured information.We fine-tune state-of-the-art neural network models to the corpus of our retrieved cases by training on newly created gold-standard text annotations specific to our defined categories of interest.
3. Lastly, we extract the targeted category items from the retrieved cases and create a structured database from our results.We introduce structure to the world of unstructured legal RSD cases and thereby increase the transparency of stated legal grounds, judge reasoning, and decision outcomes across all processed cases.

Background and motivation
At the core of the ongoing refugee crisis is the legal and administrative procedure of Refugee Status Determination (RSD), which can be summarized in three sub-procedures: (i) a formal claim for refugee protection by a claimant who is commonly supported by a lawyer, (ii) the decision-making Refugee protection decisions are high-stakes procedures that target 4.6 million asylum seekers worldwide as of mid-2022.In Canada alone, 48,014 new claims and 10,055 appeals were filed in 20213 .As stated in the introduction, processing times of refugee claims vary and range from a few months to several years.One of the reasons for the long processing times is the effort required for similar cases search.Case research is an essential part of the counsel's work in preparation for a new claim file.This search involves retrieving citations and references to previous, ideally successful RSD cases that exhibit similarities to the case in preparation such as the country of origin or the reason for the claim.Equally, judges rely on researching previous cases to justify their reasoning and ensure coherency across rulings.
While each case exhibits individual characteristics and details, legal practitioners typically search for similarities based on the constitution of the panel, the country of origin and the characteristics of the claimant, the year the claim was made in relation to a particular geopolitical situation, the legal procedures involved, the grounds for the decision, the legislation, as well as other cases or reports that are cited.
Our work aims to support legal practitioners, both lawyers preparing the application file and judges having to reach a decision, by automating the time-consuming search for similar legal cases referred to here as refugee case analysis.As a case study, we work on first instance and appeal decisions made by the Immigration and Refugee Board of Canada.A common approach used by legal practitioners is to manually search and filter past RSD cases on online services such as CanLII or Refworld by elementary document text search, which is a keyword-based find exact search, or by date.
Our defined categories of interest are described in table 1.The labels have been defined and decided upon with the help of three experienced refugee lawyers.From the interviews, we curated a list of keywords, grounds, and legal elements determining a decision.Moreover, we analyzed a sample of 50 Canadian refugee cases recommended by the interviewees to be representative over years of the claim and tribunals.
We use the pre-defined labels provided by spaCy's state-of-the-art EntityRecognizer class including DATE, PERSON, GPE, ORG, NORP, LAW and extend this list with new additional labels that we created and trained from scratch.
Each case document comprises a case cover page (the first page) and the main text which differ in the type and format of their information content.Therefore, we chose separate labels for the case cover.The case cover contains general information about the case (cf.example in Appendix A).While the main text is presented as a full-body text, the case cover page consists of semi-structured information which could that could be roughly described as tabular, except it does not follow a clear layout.Based on the case cover page we aim to extract meta-information about each claim using four labels (table 1).
For the main text, we chose 15 labels that represent characteristics reflective of similarity among different cases.To link cases to each other and later facilitate similar case retrieval, we also extract three categories of citations i.e.LAW for legal texts, LAW_CASES for other mentioned past cases, and LAW_REPORT for external sources of information such as country reports.Additionally, the CREDIBILITY label retrieves mentions made of credibility concerns in the claimant's allegations, which tends to be among the deciding factors for the success of a claim and is hence essential to understand the reasoning that led to the legal determination at hand.
A successful implementation of a system capable of extracting this information reliably would provide several benefits to legal practitioners: (i) facilitating, speeding up, and focusing legal search, (ii) reducing the time spent on a claim and on providing relevant references, potentially resulting in a file that has more chances of being accepted, and (iii) for judges, to ensure consistent outcomes across time and different jurisdictions or claimant populations.

Research approach
Our approach is guided by investigating the hypothesis that NER methods can be used to extract structured information from legal cases, i.e. we want to determine whether state-of-the-art methods can be used to improve the transparency and processing of refugee cases.Consistency of the decision-making process and thorough assessment of legal procedure steps are crucial aspects ensuring that legal decision outcomes are transparent, high-quality, and well-informed.Consequently, key research questions we need to address include: Training data requirements How many labeled samples are needed?Can keyword-matching methods or terminology databases be leveraged to reduce the need for human annotation?Extraction What methods are best suited to identify and extract the target information from legal cases?Replicability To what extent might our methods generalize to other legal data sets (other legal fields or other jurisdictions)?Pre-training How important is the pre-training step?How important is domain matching: do domain-specific pre-training perform better than general-purpose embeddings, despite their smaller sizes?Architectures How important is the architecture applied to the information extraction tasks, in terms of F1 score, precision, and recall?
4 Pipeline details and experimental setup In this section, we detail each step of the pipeline as presented in figure 1 and how it compares to the current legal search process.Subsequently, in 5 we describe the training data creation process, and the network architectures tested.The code for our implementation and experiments can be found at https://github.com/clairebarale/refugee_cases_ner.

Information retrieval: case search
We retrieve 59,112 cases processed by the Immigration and Refugee Board of Canada that range from 1996 to 2022.The case documents have been collected from CanLII in two formats, PDF and HTML.The CanLII web interface serves queries through their web API accessible at the endpoint with URL https://www.canlii.org/en/search/ajaxSearch.do.
For meaningful queries, the web API exposes a number of HTTP-GET request parameters and corresponding values which are to be appended to the URL but preceded by a single question mark and then concatenated by a single ampersand each.For instance, in the parameter=value pairs in the following example, the keyword search exactly matches the text REFUGEE, and we retrieve the second page of a paginated list of decisions from March 2004 sorted by descending date, which returns a JSON object (Full query: https://www.canlii.org/en/search/ajaxSearch.do?type=decision& ccId=cisr&text=EXACT(REFUGEE)&startDate= 2004-03-01&endDate=2004-03-31&sort= decisionDateDesc&page=2).Note that CanLII applies pagination to the search results in order to limit the size of returning objects per request.

Preprocessing
We obtain two sets: (1) a set of case covers that consists of semi-structured data and displays metainformation and (2) a set of main text that contains the body of each case, in full text.
Generally, the CanLII database renders the decision case documents as HTML pages for display in modern web browsers but also provides PDF files.We use PyPDF24 for parsing the contents of PDF files as text.To parse the contents of HTML files as text input to our NLP pipeline, we use the BeautifulSoup5 python library.
The choice between PDF and HTML format is based on multiple reasons, as each format has its own advantages and disadvantages.First, depending on the text format PyPDF2 occasionally adds excessive white space between letters of the same word.Also, the PDF document is parsed line-byline from left to right, top to bottom.Therefore, multi-column text is often mistakenly concatenated as a single line of text.However, the available PDF documents are separated by pages and PyPDF2 provides functionality to select individual document pages which we used to select the case cover page that provides case details for each document.HTML as markup language provides exact anchors with HTML tags, which, in most cases, are denoted by opening and closing tag parts such as <p> and </p> for enclosing a paragraph.
When processing the main text of each case document, we parse the HTML files using BeautilfulSoup, remove the case cover to keep only the full-body text, and tokenize the text by sentence using the NLTK6 .Our preference to tokenize by sentence facilitates the annotation process while keeping the majority of the context.We also experimented with splitting by paragraph which yielded relatively large chunks of text, whereas splitting by phrase did not keep enough context during the annotation process.To gather results, we create a pandas Dataframe, create a sentence per row, and save it to a CSV file.
For the case cover, we exploit PyPDF2's functionality to extract the text of the first page from the PDF format.In contrast to this, when using BeautifulSoup we could not rely on HTML tags (neither through generic tag selection nor by CSS identifier (ID) or CSS class), to retrieve the first page of the document robustly.After extracting this page for each case, we parse the PDF files as plain text.Combined with the metadata from the document retrieval provided by CanLII, we derive the case identifier number and assign it to the corresponding PDF file.As a next step and similar to the procedure for the main body of each document, we create a pandas Dataframe from the extracted data and save it as a CSV file with case identifier numbers and their associated case cover.
For both file formats, we perform basic text cleaning, converting letters to lowercase, and removing excessive white space and random newlines.

Information extraction
The goal of our pipeline is not only to retrieve the cases but to structure them with a high level of precision and specificity, and to output a tabular file where each column stores specific information of each of our target types for each case.Using such structured information, legal practitioners can find similar cases with ease instead of reading carefully through several cases before finding a few cases similar to their own cases by selecting attributes in one or several of the extracted categories.
We chose to use neural network approaches to perform the information extraction step.After some experimentation, approaches such as simple matching and regular expressions search proved too narrow and unsuitable for our data.Given the diversity of formulations and layouts, phrasing that captures context is quite important.Similarly, we discard unsupervised approaches based on the similarity of the text at the document or paragraph level because we favor transparency to the end user in order to enable leveraging legal practitioners' knowledge and expertise.
Extraction of target information can be done using sequence-labeling classification.NER methods are well-suited to the task of extracting keywords and short phrases from a text.To this end, we create a training set of annotated samples as explained in the next section 5.1.Labeled sentences are collected in jsonlines format, which we convert to the binary spaCy-required format and use as training and validation data for our NER pipeline.

Training data creation
We choose to use a machine learning component for text similarity to reinforce the consistency of the annotations.In line with our previous step of pre-processing, we annotate the case cover section and the main text separately.While we decided to annotate the whole page of the case cover because the semi-structured nature of the text makes tokenization approximate, we perform annotation of the main text as a sentence-based task, preserving some context.We use the Prodigy annotation tool7 , which provides semi-automatic annotations and active learning in order to speed up and improve the manual labeling work in terms of consistency and accuracy of annotation.We convert the two pandas Dataframes containing the preprocessed text to jsonlines which is the preferred format for Prodigy.We annotate 346 case covers and 2,436 sentences for the main text, which are chosen from the corpus at random.
To collect annotated samples on traditional NER labels (DATE, ORG, GPE, PERSON, NORP, LAW), we use suggestions from general purpose pre-trained embeddings8 .For the remaining labels (CLAIMANT_INFO, CLAIMANT_EVENT, PROCEDURE, DOC_EVIDENCE, EXPLANATION, DETERMINATION, CREDIBILITY), and still with the aim of improving consistency of annotation, we create a terminology base (as shown on pipeline description figure 1).At annotation time, patterns are matched with shown sentences.The human annotator only corrects them, creating a gold standard set of sentences and considerably speeding up the labeling task.
To create a terminology base for each target category, we first extract keywords describing cases from CanLII metadata retrieved during the information retrieval step.To this initial list of tokens, we add a list of tokens that were manually flagged in cases by legal professionals.We delete duplicates and some irrelevant or too general words such as "claimant" or "refugee", and manually assign the selected keywords to the appropriate label to obtain a list of tokens and short phrases per label.In order to extend our terminology base, we use the sense2vec model9 (based on word2vec (Mikolov et al., 2013)) to generate similar words and phrases.We select every word that is at least 70% similar to the original keyword in terms of cosine similarity and obtain a JSON file that contains 1,001 collected patterns.This method allows us to create a larger number of labeled data compared to fully manual annotation in the same amount of time.
Table 1 describes the breakdown of labels in our annotated data.There is a clear imbalance across categories of items, with some labels being infrequent (NORP, DETERMINATION, PERSON, LAW_REPORT, LAW_CASE).Some labels are present very few times per case: DETERMINATION occurs only once per case, PERSON does not occur frequently since most cases are anonymized.

Experimental conditions and architectures
Train, dev, test split We trained the NER models using 80% of the labeled data as our training set (276 case covers and 1,951 sentences for the main text, respectively), 10% of the labeled data as our development set (35 case covers and 244 sentences) and 10% of the labeled data as the test set for evaluation (35 case covers and 244 sentences).

Pre-training static and contextual embeddings
As the first layer of the NER network, we add pre-trained character-level embeddings in order to isolate the effect of pre-training from the effect of the architecture and improve the F1 score on target items.We fine-tune GloVe vectors ( (Pennington et al., 2014), 6B tokens, 400K vocabulary, uncased, 50 dimensions) on our data using the Mittens10 python package (Dingwall and Potts, 2018) and create 970 static vectors.On top of the generated static vectors, we add dynamic contextualized vectors using pre-training embeddings based on BERT (J.Devlin and Toutanova, 2019), updating weights on our corpus of cases.Because the text of the case cover is presented in a semi-structured format, we consider that it is unnecessary to perform pre-training given the lack of context around the target items.
Architectures We experiment with five different architectures on the case cover and seven different architectures on the main text: five based on convolutional neural networks (CNN) using different word embeddings and two transformer architectures.We train a CNN without added vectors as a baseline.Only the transformer architectures require training on a GPU.We use the spaCy pipelines11 (tokenizer, CNN and transformer) and the HuggingFace datasets12 .All CNNs use an Adam optimizer function.Since the sentencelabeling task is well-suited to the masked language modeling objective, we chose to experiment with roBERTa (Liu et al., 2019) and LegalBERT (Chalkidis et al., 2020) in order to compare performance between a general content and a legal content model.We train separately on the case cover, the traditional NER labels (GPE, NORP, ORG, DATE, PERSON, LAW), and the labels we created from scratch since it was observed that labels trained from scratch benefit from a lower learning rate (0.0005 versus 0.001 for the traditional labels).

Results and evaluation
Our experimental results are presented in table 2 in absolute terms and relative to the baseline in figure 3 below.Our chosen baseline is a CNN with no additional vectors.We present them per label because of the disparities in the scores.The upper rows contain results on the case cover and the lower rows results on the main text.The evaluation metrics applied serve a duel purpose: for future research, achieving a high F1 score and precisionrecall balance is key, while for our potential legal end users we assume that the recall measure is much more important as it measures how many of the predicted entities are correct.
For the case cover, we obtain satisfactory results on all labels with F1 scores above 90% for three of them and 84.78% for name extraction.Apart from names, CNN architectures perform better, with dates achieving the highest score with randomly initialized embeddings.We explain this with the specific layout of this page (Annex A).The only gain of using a transformer-based model is to achieve a higher recall compared to the CNNbased architectures.
For the main text, results vary across labels: we obtain a score above 80% for DATE, GPE, PERSON, ORG with the best score on roBERTa, but legal-bert-base-uncased scores lower than 60% on EXPLANATION, LAW, LAW_CASE.Overall, when using transformers, we observe a better precision-recall balance.
Results on three labels DETERMINATION, LAW_REPORT, NORP are unreliable because of the limited sample both for training and testing.DETERMINATION appears only once per case, and LAW_REPORT appears in a few cases only.Further annotation would require selecting the paragraphs of cases where these items appear to augment the size of the sample.We leave this task to future work.
Explanations for other low scores are partly to be found in the tokenization errors reported during the human-labeling task.Figure 2 shows an example of wrong tokenization on two categories LAW and LAW_CASE for which we believe bad tokenization is the primary explanation for low scores (similarly reported by Sanchez).In the first sentence of the figure, words are not correctly split between "under" and "section" and between the section number and "of".On the lower part of the figure, sentence tokenization does not correctly split the case reference as it is confused by the dot present in the middle.In this example, the case name is displayed as three different sentences, making the labeling task impossible.
The most appropriate pre-training varies across labels: For categories on which CNN performs best such as CREDIBILITY, DOC_EVIDENCE, LAW, we find that fine-tuning static vectors performs better than randomly initialized embeddings or dynamic vectors, which suggests that context was not essential when retrieving spans of text (pre-training relies on tri-grams).This could derive from the methods of annotation that were terminology-based for those labels.While the target items may contain particular vocabulary such as "personal information form" for DOC_EVIDENCE, context is of minimal importance since those phrases would not appear in another context or under another label.On the contrary, context seems much more important for retrieving procedural steps (PROCEDURE), which is the only category where the pre-training layer with contextual embeddings significantly increases the F1 score.
In the majority of categories, we find that the content of the pre-training is important (CLAIMANT_EVENT, CREDIBILITY, DATE, DOC_EVIDENCE, EXPLANATION, LAW, PROCEDURE).Results show that domain-specific In other categories, roBERTa performs better than LegalBERT and CNNs, suggesting that the size of the pre-trained model is more important than domain matching.While LegalBERT has a size of 12GB, roBERTa is over 160GB and outperforms LegalBERT on traditional NER labels (GPE, ORG, PERSON and also CLAIMANT_INFO, LAW_CASE).
Looking at recall measures only, the superiority of transformer architectures against CNNs is more significant, with only 3 categories (DOC_EVIDENCE, CLAIMANT_INFO, LAW) achieving their best recall score with a CNN architecture and legal pretraining.Comparing results on recall, we reach the same conclusion as with F1, i.e. that domain matching allows us to achieve higher scores on target categories.Indeed, for seven out of 12 categories analyzed for the main text, the best scores are achieved by two architectures that differ in their pre-training domain.Higher F1 and recall scores, obtained through comparison and observation, enable us to attribute the improved performance primarily to the domain of the training data.

Related work
Because of the importance of language and written text, applications of NLP in law hold great promise in supporting legal work, which has been extensively reviewed by Zhong et al..However, because of the specificity of legal language and the diversity of legal domains, as demonstrated in our work with the results on LegalBERT-based transformer, general approaches aiming at structuring legal text such as LexNLP (Bommarito II et al., 2021) or general legal information extraction (Brüninghaus and Ashley, 2001) are unfit for specific domains such as international refugee law and are not able to achieve a high degree of granularity.
Earlier methods of statistical information extraction in law include the use of linear models such as maximum entropy models (Bender et al., 2003;Clark, 2003) and hidden Markov models (Mayfield et al., 2003).However, state-of-the-art results are produced by methods able to capture some context, with an active research community investigating the use of conditional random fields (Benikova et al., 2015;Faruqui et al., 2010;Finkel et al., 2005) and BiLSTMs (Chiu and Nichols, 2016;Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016;Leitner et al., 2019) for legal applications.
Scope and performance increased with the introduction of new architectures of deep learning using recurrent neural networks (RNN), CNNs, and attention mechanisms as demonstrated by Chalkidis et al., even though we find that transformers do not always perform best on our data.We therefore fo-   Vardhan et al., 2021) with the latter achieving a total F1 score of 59.31% across labels, citations, as well as events, which is below our reported scores.Similar case matching is a well-known application of NLP methods, especially in common law systems (Trappey et al., 2020) and in domains such as international law.The Competition on Legal Information Extraction/Entailment includes a task of case retrieval, proves that there is much interest in this area both from researchers and developers of applications.While research has been conducted to match at paragraph level (Tang and Clematide, 2021;Hu et 2022), we find that our approach is more transpar-ent and shifts the decisions regarding which filters to choose to legal practitioners, which we believe is appropriate to enable productive human-machine collaboration in this high-stakes application domain.

Conclusion and future work
Our pipeline identifies and extracts diverse text spans, which may vary in quality across different categories.We acknowledge that certain entities we identify are more valuable than others for legal search purposes.Additionally, due to the complexity of the text, some noise is to be expected.However, this noise does not hinder the search for relevant items.Users have the flexibility to search and retrieve cases using any combination of our 19 categories of flagged entities.Additionally, work is required for the evaluation of the prototype by legal practitioners beyond traditional machine learning metrics (Barale, 2022).However, we believe the work presented here is an important first step and has the potential to be used for future NLP applications in refugee law.Our approach provides significant contributions with newly collected data, newly created labels for NER, and a structure given to each case based on lawyers' requirements, with nine categories of information being retrieved with an F1 score higher than 80%.Compared to existing case retrieval tools, our pipeline enables endusers to decide what to search for based on defined categories and to answer the question: What are criteria of similarity to my new input case ?

Limitations
In this section, we enumerate a few limitations of our work: • We believe that the need to train transformer architectures on GPU is an obstacle to the use of this pipeline, which is destined not to be used in an academic environment but by legal practitioners.
• Because of the specificity of each jurisdiction, generalizing to other countries may not be possible on all labels with the exact same models (for example in extracting the names of tribunals).
• The manual annotation process is a weakness: while it results in gold-standard annotations, it is very time-consuming.We do acknowledge that the amount of training data presented in this work is low and that collecting more annotations in the future would improve the quality of the results.We think it would be interesting to look at self-supervised methods, weak supervision, and annotation generation.The need for labeled data also prevents easy replication of the pipeline to new data sets, which would also require manually annotating.
• More precisely on the extracted categories, some categories lack precision and would require additional processing steps to achieve satisfactory results.For example, the category PERSON sometimes refers to the claimant or their family, but sometimes refers to the name of the judge.The tools and artifacts used are publicly available.Prodigy is the only tool used that is note freely available and we were granted an academic research license.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
The retrieved documents are to be used for research purposes, which is the use we made of the dataset.The created artifacts (code and NER model) do not give access directly to the retrieve documents.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?This work does not directly give access to the data but to trained models of named-entity recognition.The data is provided by the Canadian Legal Information Institute and is publicly available online.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.

Figure 2 :
Figure 2: Example of an error in tokenization

Figure 3 :
Figure 3: Comparison to the baseline on the F1 score on the main text: per targeted category (x-axis) on seven network architectures: baseline CNN model (baseline), CNN model with random static vectors on en_core_web_lg (CNN+rsv), CNN with fine-tuned static vectors (CNN+fts), CNN with random static vectors and pre-training (CNN+rsv+pt), CNN with fine-tuned static vectors and pre-training (CNN+fts+pt), RoBERTa-based transformer, LegalBERT-based transformer

Figure 4 :
Figure 4: Comparison to the baseline on the F1 score on the case cover: per targeted category (x-axis) on four network architectures: baseline CNN model (baseline), CNN model with random static vectors on en_core_web_lg (CNN+rsv), RoBERTa-based transformer, LegalBERT-based transformer you describe the limitations of your work?Not numbered, after the conclusion: 'Limitations' section A2.Did you discuss any potential risks of your work?Section 8 and 'Limitations' A3.Do the abstract and introduction summarize the paper's main claims?Abstract and section 1 A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?Section 4 and 5 B1.Did you cite the creators of artifacts you used?Section 4 and 5 (and footnotes) B2.Did you discuss the license or terms for use and / or distribution of any artifacts?

Table 2 :
Precision (P), Recall (R) and F1 score (in %) on the cover page and the main text for seven network architectures: baseline CNN model (baseline), CNN model with random static vectors on en_core_web_lg (CNN+rsv), CNN with fine-tuned static vectors (CNN+fts), CNN with random static vectors and pretraining (CNN+rsv+pt), CNN with fine-tuned static vectors and pretraining (CNN+fts+pt), RoBERTa-based transformer, LegalBERT-based transformer training data has a larger effect than network architecture difference.More precisely, it seems that, on some categories (CREDIBILITY, DOC_EVIDENCE, LAW, PROCEDURE), pre-training on our own data is more effective than training on a general legal data set as in LegalBERT.This can be explained by the content LegalBERT is pre-trained on, which does not contain any Canadian but only US, European, and UK texts and does not include any refugee cases.