VECHR: A Dataset for Explainable and Robust Classification of Vulnerability Type in the European Court of Human Rights

.


Introduction
Vulnerability encompasses a state of susceptibility to harm, or exploitation, particularly among individuals or groups who face a higher likelihood of experiencing adverse outcomes due to various factors such as age, health, disability, or marginalized social position (Mackenzie et al., 2013;Fineman, 2016).While impossible to eliminate vulnerability, society has the capacity to mitigate its impact.The European Court of Human Rights (ECtHR) interprets the European Convention of Human Rights (ECHR) to address the specific contextual needs of individuals and provide effective protection.This is achieved through various means, such as displaying flexibility in admissibility issues, and shifting the burden of proof (Heri, 2021).However, the concept of vulnerability remains elusive within the ECtHR.While legal scholars have explored vulnerability as a component of legal reasoning (Peroni and Timmer, 2013), empirical work in this area remains scarce and predominantly relies on laborious manual processes (Bommasani et al., 2022, §3.2).To address this challenge, NLP can offer valuable tools to assist experts in efficiently classifying and analyzing textual data.Besides high classification performance, the true utility of NLP in the legal field is its ability to identify relevant aspects related to vulnerability in court cases.These aspects can be extracted, grouped into patterns, and used to inform a litigation strategy.Even so, a significant obstacle to progress in this area is the lack of appropriate datasets.To bridge these research gaps, we present the dataset VECHR1 , which comprises cases dealing with Article 3 allegation and is obtained from legal expert's empirical study.Our proposed task is to identify which type of vulnerability (if any) is involved in an ECtHR

Dependency
Including that of minors, the elderly, and those with physical, psychosocial and cognitive disabilities (i.e.mental illness and intellectual disability).

State Control
Including that of detainees, military conscripts, and persons in state institutions.

Victimisation
Due to victimisation, including by domestic and sexual abuse, other violations, or because of a feeling of vulnerability.

Migration
In the migration context, applies to detention and expulsion of asylum-seekers.

Discrimination
Due to discrimination and marginalisation, which covers ethnic, political and religious minorities, LGBTQI people, and those living with HIV/AIDS Reproductive Health Due to pregnancy or situations of precarious reproductive health Unpopular Views Due to the espousal of unpopular views Intersection Intersecting vulnerabilities Table 1: Description of each type of vulnerability.For more details, please refer to App A.
case.Previous work (Chalkidis et al., 2021) investigated the explainability of ECtHR allegation prediction on an annotated dataset of 50 cases by identifying the allegation-relevant paragraphs of judgment facts sections and find fairly limited alignment between models and experts rationale.citealtsantosh2022deconfounding show that these models are drawn to shallow spurious predictors if not deconfounded.Both work derived and evaluated the explanation rationales at the fairly coarse paragraph level, which may over-estimate alignment.As model explainability is crucial for establishing trust, we extend the dataset with VECHR explain , a token-level explanation dataset annotated by domain experts in a subset of VECHR.This overcomes potential performance overestimation of previous studies (Chalkidis et al., 2021;Santosh et al., 2022) that derived and evaluated explanations for case outcome classification in the ECtHR at coarse paragraph-level.Furthermore, the understanding and application of vulnerability in court proceedings change over time, reflecting societal shifts and expanding to encompass a wider range of types (Fig 1a

Vulnerability Typology in ECtHR
The inescapable and universal nature of vulnerability, as posited by Fineman (2016), underscores its significance in legal reasoning.For instance, the European Union has acknowledged vulnerability by establishing a definition for vulnerable individuals (Dir, 2013).However, the concept of vulnerability remains undefined within the context of ECtHR.
To facilitate an examination of vulnerability and its application within the ECtHR, it is crucial to establish a typology of vulnerability recognized by the Court.Several scholars have endeavored to effectively categorize vulnerability for this purpose (Timmer, 2016;Limantė, 2022)

Vulnerability Type Annotation
We follow the typology and methodology presented by Heri 2021.She considered cases as "vulnerablerelated", only when "vulnerability had effectively been employed by the Court in its reasoning".For cases under Article 3, we utilized the labelling provided by Heri's protocol.
For VECHR challenge , we ask two expert annotators5 to label the case following Heri's methodology 6 .Each annotator has annotated 141 cases.Inter-Annotator Agreement To ensure consistency with Heri's methodology, we conducted a two-round pilot study before proceeding with the annotation of the challenge set (details in App G).In each round, two annotators independently labelled 20 randomly selected cases under article 3, and we compared their annotations with Heri's labels.The inter-annotator agreement was calculated using Fleiss Kappa, and we observed an increase from 0.39 in the first round to 0.64 in the second round, indicating substantial agreement in a multilabel setting with seven labels and three annotators.

Explanation Annotation Process
The explanation annotation process was done using the GLOSS annotation tool (Savelka and Ashley, 2018), see App H for details.Based on the case facts, the annotator 5 was instructed to identify relevant text segments that indicate the involvement of a specific vulnerability type in the court's reasoning.The annotator was permitted to highlight the same text span as an explanation for multiple vulnerable types.

Dataset Analysis
Tab 2 presents the key statistics of our dataset.VECHR comprises a total of 1,070 documents, with an average of 4,765 tokens per case (σ = 4167).788 cases fall under Article 3 and 282 non-Article 3 cases.Among all, 530 documents are considered as "non-vulnerable", meaning they are not labelled as any of the seven vulnerable types.In the vulnerable-related cases, the average number of labels assigned per document is 1.54.We observe a strong label distribution imbalance within the dataset.The label "state control" dominates the dataset, accounting for 33% of the cases, while the least common label, "reproductive health", represents only 3% of the cases.Tab 3 presents details regarding the label imbalances.App I presents more detailed statistics of our dataset.et al., 2020).We finetune pre-trained models on our dataset with a multi-label classification head, truncating to BERT's maximum length of 512.

Count
Longformer (Beltagy et al., 2020) We finetune it on our dataset that allows for processing up to 4,096 tokens, using a sparse-attention mechanism which scales linearly, instead of quadratically.
Hierarchical We employ a hierarchical variant of pre-trained LegalBERT to deal with long input limitation.We use a greedy input packing strategy where we merge multiple paragraphs7 into one packet until it reaches the maximum of 512 tokens.We utilize pre-trained transformer models to independently encode each packet of the input text and obtain representations (h [CLS] ) for each packet.Then we apply a transformer encoder (non-pretrained) to make the packet representations context-aware, considering the surrounding packets.Finally, we apply a max-pooling on the contextaware packet representations to obtain the final representation of the case facts, which is then passed through a classification layer.For details on all models' configuration and training please refer to App J.
Evaluation Metrics we report micro-F1 (mic-F1) and macro-F1 (macF1) scores for 7+1 labels, where 7 labels correspond to 7 vulnerability types under consideration and an additional augmented label during evaluation to indicate non-vulnerable.
Results Tab 4 reports the results of classification performance.We observe that legal specific pretraining improved the performance over general pre-training.However, BERT models still face the input limitation constraint.Both Longformer and Hierarchical models improved compared to truncated variants and are comparable to each other.Overall, we witness an overall low performance across models, highlighting the challenging task.

Vulnerability Type Explanation
We use Integrated Gradient (IG) (Sundararajan et al., 2017) to obtain token-level importance from the model with respect to each vulnerable type un-  der consideration.We max pool over sub-words to convert token-level IG scores into word-level scores, followed by a threshold-based binarization.Tab 4 reports explainability performance expressed as the average of κ between the models' focus and the experts' annotations for the test instances.We observe that explainability scores among different models reflect their trend in classification scores and also echo the challenging nature of the task.

Robustness to Distributional Shifts
We assess the robustness of models to distributional shift using the VECHR challenge and present the performance in Tab 5. Notably, we observe a drop in performance on VECHR challenge compared to the test set.We attribute this to the models relying on suboptimal information about vulnerability types, which is primarily derived from the factual content rather than a true understanding of the underlying concept.To address this limitation, we propose a Concept-aware Hierarchical model that con-siders both the case facts and the description of vulnerability type to determine if the facts align with the specified vulnerability type. 8We employ a greedy packing strategy as described earlier and employ an hierarchical model to obtain the contextaware packet representations for each packet in the facts and concept description separately.Subsequently, we apply a scaled-dot-product cross attention mechanism between the packet vectors of the facts (as Query) and concepts (as Keys and Values), generating the concept-aware representation of the packets in the facts.Further we employ a transformer layer to capture the contextual information of the updated packet vectors.Then we employ max-pooling to obtain the concept-aware representation of the case facts, which is then passed through a classification layer to obtain the binary label.Concept-aware model showed an improvement on challenge set owing to the incorporation of the description of vulnerability types, increasing its robustness to distributional shifts.Overall, our results show promise for the feasibility of the task yet indicate room for improvement.
We present VECHR, a dataset consisting of 1,070 cases for classification of vulnerability type and 40 cases for explanation.We also release a set of baselines as a benchmark, revealing the challenges of achieving accuracy, explainability, and robustness in vulnerability classification.We hope that VECHR and the associated tasks will provide a challenging and interesting resource for Legal NLP researchers to advance research in vulnerability within the ECtHR, ultimately contributing to effective human rights protection.

Limitations
In our task, the length and complexity of the legal text require annotators with a deep understanding of ECHR jurisprudence to identify vulnerability types.As a result, acquiring a large amount of annotation through crowdsourcing is not feasible, leading to limited-sized annotated datasets.Additionally, the high workload restricts us to collect only one annotation per case.It is important to acknowledge that there is a growing body of work in mainstream NLP that highlights the presence of irreconcilable Human Label Variation (HLV) (Plank, 2022;Basile et al., 2021) in subjective tasks like natural language inference (NLI) (Pavlick and Kwiatkowski, 2019) and toxic language detection (Sap et al., 2022).Future work should address this limitation and strive to incorporate multiple annotations to capture a more comprehensive understanding of the concept of vulnerability.
This limitation is particularly pronounced because of the self-referential wording of the ECtHR (Fikfak, 2021).As the court uses similar phrases in cases against the same respondent state or alleging the same violation, the model may learn that these are particularly relevant, even though this does not represent the legal reality.In this regard, it is questionable whether cases of the ECtHR can be considered "natural langauge".Moreover, the wording of case documents is likely to be influenced by the decision or judgement of the court.This is because the documents are composed of court staff documents after the verdict.Awareness of the case's conclusion could potentially impact the way its facts are presented, leading to the removal of irrelevant information or the highlighting of facts that were discovered during an investigation and are pertinent to the result (Medvedeva et al.).Instead, the communicated cases, which are often published years before the case is judged, should be used.However, these come with their limitations as they only represent the facts of the applicant's and not the defendant's state.There are also significantly fewer communicated cases than decisions and judgements.
One of the main challenges when working with corpora in the legal domain is their extensive length.To overcome this issue, we employ hierarchical models, which have a limitation in that tokens across long distances cannot directly interact with each other.The exploration of this limitation in hierarchical models is still relatively unexplored, although there are some preliminary studies available (e.g., see Chalkidis et al. 2022).Additionally, we choose to freeze the weights in the LegalBERT sentence encoder.This decision serves the purpose of conserving computational resources and reducing the model's vulnerability to superficial cues.

Ethics Statement
The assessment of the ethical implications of the dataset is based on the Data Statements by Bender and Friedman (2018).Through this, we aim to establish transparency and a more profound understanding of limitations and biases.The curation is limited to the Article 3 documents in English.The speaker and annotator demographic are legally trained scholars, proficient in the English language."Speaker" here refers to the authors of the case documents, which are staff of the Court, rather than applicants.We do not believe that the labelling of vulnerable applicants is harmful because it is done from a legally theoretical perspective, intending to support applicants.The underlying data is based exclusively on the publicly available datasets of ECtHR documents available on HUDOC9 .The documents are not anonymized and contain the real names of the individuals involved.However, we do not consider presenting the dataset harmful, given that this information is already publicly available.
Ethical considerations are of particular importance because the dataset deals with vulnerability and thus with people in need of special protection.In general, particular attention needs to be paid to ethics in the legal context to ensure the values of equal treatment, justification and explanation of outcomes and freedom from bias are upheld (Surden, 2019).We are conscious that, by adapting pre-trained encoders, our models inherit any biases they contain.The results we observed do not be substantially relate to such encoded bias.Nonetheless, attention should be paid to how models on vulnerability are employed practically.
In light of the aforementioned limitations and the high stakes in a human rights court, we have evaluated the potential for misuse of the vulnerability classification models.Medvedeva et al. (2020) mention the possibility of reverse engineering the model to better prepare applications or defences.This approach is, however, only applicable in a fully automated system using a model with high accuracy.As this is not the case for the models presented, we assume the risk of circumventing legal reasoning to be low.On the contrary, we believe employing a high recall vulnerability model could aid applicants and strengthen their legal reasoning.In a scholarly setting used for vulnerability research, we do not think the model can be used in a detrimental way.Our research group is strongly committed to research on legal NLP models as a means to derive insight from legal data for purposes of increasing transparency, accountability, and explainability of data-driven systems in the legal domain.
There was no significant environmental impact, as we performed no pre-training on large datasets.Computational resources were used for fine-tuning and training the models, as well as assessing the dataset.Based on partial logging of computational hours and idle time, we estimate an upper bound for the carbon footprint of 30 kg CO 2 equivalents.This is an insignificant environmental impact.
Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho. 2021.When does pretraining help?assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings.In Proceedings of the eighteenth international conference on artificial intelligence and law, pages 159-168.

A Vulnerability Typology and Descriptions
Here is the typology of vulnerability in ECtHR (Heri, 2021): • Dependency: dependency-based vulnerability, which concerns minors, the elderly, and those with physical, psychosocial and cognitive disabilities (i.e.mental illness and intellectual disability).
• State Control: vulnerability due to state control, including vulnerabilities that of detainees, military conscripts, and persons in state institutions.
• Victimisation: vulnerability due to victimisation, including by domestic and sexual abuse, other violations, or because of a feeling of vulnerability.
• Migration: vulnerability in the migration context, applies to detention and expulsion of asylum-seekers.
• Discrimination: vulnerability due due to discrimination and marginalisation, which covers ethnic, political and religious minorities, LGBTQI people, and those living with HIV/AIDS • Reproductive Health: vulnerability due due to pregnancy or situations of precarious reproductive health; • Unpopular Views: vulnerability due due to the espousal of unpopular views • Intersection: intersecting vulnerabilities.
Following is the detailed description of each type: Dependency Dependency-based vulnerability derives from the inner characteristics of the applicant and concerns thus minors, elderly people, as wells as physical, psychosocial and cognitive disabilities (i.e.mental illness and intellectual disability).The Court has built around these categories special requirements to be fulfilled by States as part of their obligations under the Convention.Minors: The Court often refers to children as a paradigmatic example of vulnerable people and made use of the concept of vulnerability to require States to display particular diligence in cases imposing child protection given on the one hand, their reduced ability and/or willingness of complaining of ill-treatment and on the other hand their susceptibility to be exposed to traumatic experiences/treatment.Elderly: In many ways, vulnerability due to advanced age is a continuation of the vulnerability of children.All humans experience dependency at the beginning of life, and many experience it near the end.Intellectual and Psychosocial Disabilities: Intellectual disability may render those affected dependent on others -be it the state or others who provide them with support.Having regard to their special needs in exercising legal capacity and going about their lives, the Court considered that such situations were likely to attract abuse.Persons living with intellectual disabilities experience difficulties in responding to or even protesting against violations of their rights.In addition, persons with severe cognitive disabilities may experience a legal power imbalance because they do not enjoy legal capacity.
State Control Vulnerability due to state control includes detainees, military conscripts, and persons in state institutions.This type of vulnerability includes persons in detention, but also those who are institutionalised or otherwise under the sole authority of the state.When people are deprived of their liberty, they are vulnerable because they depend on the authorities both to guarantee their safety and to provide them with access to essential resources like food, hygienic conditions, and health care.In addition, the state often controls the flow of information and access to proof.Hence, the Court automatically applies the presumption of state responsibility when harm comes to those deprived of their liberty.
Victimisation Vulnerability due to victimisation refers to situations in which harm is inflicted by someone else.This type of vulnerability applies to situations of domestic and sexual abuse, and other type of abuse.The Court has also found that a Convention violation may, in and of itself, render someone vulnerable.Crime victims who are particularly vulnerable, through the circumstances of the crime, can benefit from special measures best suited to their situation.Sexual and domestic violence are expressions of power and control over the victim, and inflict particularly intense forms of trauma from a psychological standpoint.Failing to recognise the suffering of the victims of sexual and domestic violence or engaging in a stigmatising response -such as, for example, the perpetuation of so-called 'rape myths' -represents a secondary victimisation or 'revictimisation' of the victims by the legal system Migration Vulnerability in the context of migration applies to detention and expulsion of asylumseekers.Applicants as asylum-seeker are considered particularly vulnerable based on the sole experience of migration and the trauma he or she was likely to have endured previously'.The feeling of arbitrariness and the feeling of inferiority and anxiety often associated with migration as well as the profound effect conditions of detention in special centres indubitably affect a person's dignity.The status of the applicants as asylum-seekers is considered to require special protection because of their underprivileged (and vulnerable) status.Discrimination Vulnerability due to discrimination and marginalisation covers ethnic, political and religious minorities, LGBTQI people, and those living with HIV/AIDS.The Court recognises that the general situation of these groups -the usual conditions of their interaction with members of the majority or with the authorities -is particularly difficult and at odds with discriminatory attitudes.Similarly to depency-based vulnerability, this type of vulnerability imposes special duties on states and depends not solely on the inner characteristics of applicants but also on their choices which, in most cases, states have to balance against other choices and interests.Reproductive Health Vulnerability due to pregnancy or situations of precarious reproductive health concerns situations in which women may find themselves in particular vulnerable situations even if the Court does not consider women vulnerable as such.This may be due to an experience of victimisation, for example in the form of gender-based violence, or due to various contexts that particularly affect women.Sometimes, depending on the circumstances, pregnancy may be reason enough for vulnerability while other times vulnerability is linked to the needs of the unborn children.Unpopular Views Vulnerability due to the espousal of unpopular views includes: demonstrators, dissidents, and journalists exposed to ill-treatment by state actors due to the espousal of unpopular views.Where an extradition request shows that an applicant stands accused of religiously and politically motivated crimes, the Court considers this sufficient to demonstrate that the applicant is a member of a vulnerable group.Similarly to the case of victimisation, it is also the applicant's choice to display such views and vulnerability comes from particular measures that state undertake when regulating or disregarding such choices.

B More details on Data Source and Collection Process
Heri 2021 reported the details of the methodology of her case sampling process.The following serves as a summary of her case sampling methodology: She used the regular expression "vulne*" to retrieve all English relevant documents related to Article 3 from HUDOC, the public database of ECtHR, excluding communicated cases and legal summaries, for the time span between the inception of the Court and 28 February 2019.This yielded 1,147 results.Heri recorded her labeling in an excel sheet including the application number for each case.The application number serves as a unique identifier for individual applications submitted to the ECtHR.
To create VECHR, we fetch all the according case documents from HUDOC including their metadata.
During the post-processing, when one case has multiple documents, we keep the latest document and discard the rest, which yields 788 documents.

C Definition of Vulnerable-related
Heri (2021) specified that only cases where "vulnerability had effectively been employed by the Court in its reasoning" are regarded as "vulnerablerelated".Cases in which vulnerability was used only in its common definition, or used only in the context of other ECHR rights or by third parties, were excluded.To ensure clarity and alignment with Heri's perspective, we engaged in communication with her during the annotation process to clarify the definition of "vulnerable-related."As a result, we determined that vulnerability is labeled (primarily) in situations where: • Vulnerability is part of the court's main legal reasoning.• The alleged victims (or their children) exhibit vulnerability.• The ECtHR court, rather than domestic courts or other parties, consider the alleged victims vulnerable.

D Omitting "Intersectionality" Label
The vulnerability type "intersectionality" was omitted because of its unclear usage in cases.Even more than the other typologies, it is a highly nuanced concept without a clear legal definition.Furthermore, Heri (2021, 117) says that the ECtHR does not engage with the concept of intersectionality in this form.Given that there are only 11 cases exclusively annotated as "intersectional" (out of a total of 59), the effect of disregarding it in the following work is negligible.Omission does not suggest that intersectionality fails to play a role in the reasoning of the ECtHR or that we deem it irrelevant.Instead, we suggest exploring the concept of vulnerability as a combination of the other seven vulnerabilities.

F Justification of Article Applicability
Heri's (2021) typology is limited to Article 3 of ECHR, which pertains to the Prohibition of Torture.Under Article 3, the concept of vulnerability was first coined by the ECtHR, given that it deals with inhuman, degrading treatment and torture, which represent prototypical contexts of vulnerability.As such, an initial exploration under Article 3 is reasonable.
Applying Heri's procedure to non Article 3 cases is nonetheless justified according to our legal expert because Heri's underlying typology is based on Timmer (2016) and relates to all articles.Furthermore, vulnerability is now a concept that is not limited to a single article, and which the ECtHR applies across articles.

G Pilot study for Annotating Non-Article 3 Cases
In the first round, both annotators independently labeled 20 randomly selected cases under Article 3.After completing the labeling process, they compared their annotations with Heri's labels and engaged in a discussion to address any discrepancies and clarify their understanding of the vulnerability concept.In the second round, the annotators independently labeled another 20 randomly selected cases.We calculated the inter-annotator agreement, specifically employing the Fleiss Kappa agreement metric, to measure the consistency between Heri's labels and the annotations provided by our two annotators.The Fleiss Kappa agreement increased from 0.39 in the first round to 0.64 in the second round, which we consider to be substantial agreement in a multi-label setting involving seven vulnerable types and three annotators.

H GLOSS Annotation Tool
The task of explanation annotation was done using the GLOSS annotation tool (Savelka and Ashley, 2018).

I Dataset Statistics
The dataset comprises 1070 cases.On average, each case involves 0.78 vulnerable types and 1.54 vulnerable types for cases concerning vulnerability.
Hyperparameter & Overfitting measures: For the BERT-based models, we perform a grid search for the hyperparameters across the search space of batch size [4,8,16], learning_rate [5e-6, 1e-5, 5e-5, 1e-4].We train models with the Adam optimizer up to 8 epochs.We determine the best hyperparameters on the dev set and use early stopping based on the dev set macro-F1 score.For the Hierarchical models, we employ a maximum sentence length of 128 and document length (number of sentences) of 80.The dropout rate in all layers is 0.1.We perform a grid search for the hyperparameters across the search space of batch size [2,4], learning_rate [1e-6, 5e-6, 1e-5].We train models with the Adam optimizer up to 20 epochs.We determine the best hyperparameters on the dev set and use early stopping based on the dev set macro-F1 score.We use the PyTorch (Paszke et al., 2019) 2.0.1.

K Model Architecture
Hierarchical Model: Given the case facts as input which on greedy packing turn into m packets as x = {x 1 , x 2 , . . ., x m }, where packet x i = {x i1 , x i2 , . . ., x in } consists of n tokens.We pass each packet x i independently into pre-trained LegalBERT model (Chalkidis et al., 2020) to extract h cls i representation for each packet.These all packet representations h = {h 1 , h 2 , . . .h m } along with learnable position embeddings are passed through transformer encoder to make them aware of the surrounding context.Thus we obtain the context-aware packets representations which are then max pooled to obtain the final representation of the case facts.This is passed through a classification layer.We employ a binary cross-entropy loss over each label type, given the multi-label nature of the task.Concept-aware Hierarchical model: We cast the multi-label problem into binary classification problem where we pair case facts with each vulnerabil-ity type to predict binary value of whether this vulnerability type is aligned with the given facts.Note that we transform these binary labels into multilabel vector for evaluation, to make a fair comparison with multi-label models.This is inspired from article-aware outcome prediction setting from Tyss et al. 2023.Concept-aware models take case facts as input which on greedy packing would be x = {x 1 , x 2 , . . ., x m } with m packets and vulnerability concept description text c = {c 1 , c 2 , . . ., c k } with k packets.Packet x i = {x i1 , x i2 , . . ., x in } has n tokens and packet c i = {c i1 , c i2 , . . ., c ip } has p tokens.
Similar to hierarchical model, we employ a pretrained LegalBERT model (Chalkidis et al., 2020) to each packet in the case facts and concept descrption independently and extract h cls representation for each packet.Further to enhance context information, those packets are passed through non-pretrained transformer model to obtain contextaware packet representations f = {f 1 , f 2 , . . ., f m } and g = {g 1 , g 2 , . . .g k } for case facts and concept descriptions respectively.
Thus obtained the packet representations in the facts and concept description interact with each other using a multi-head scaled dot product cross attention (Vaswani et al., 2017) similar to the decoder in the transformer layer by treating case facts packets (f) as the queries(Q) and the keys and values come from the packets of concepts (g).
(1) Thus we obtain concept-aware representation of each packet of the case-facts as d = {d 1 , d 2 , . . ., d m } which are once again passed through non-pretrained transformer module to enhance with the surrounding contextual information and finally apply a max-pooling operation to obtain the final concept-aware representation of case facts which is sent to classification layer to predict the binary label indicating whether this case fact belongs to the given vulnerability concept.
(a) Evolving distribution of types of vulnerability.(b) Increase in application of the vulnerability type "migration" between 2010 and 2018.
Fig 2a demonstrates the detailed architecture of the hierarchical model.
Figure 2: Visualization of Hierarchical and Concept-aware Hierarchical Model architectures.
Fig 2b demonstrates the detailed architecture of the concept-aware model.Further details about the model architecture can be found in App K.
Fig 4 presents the distribution of the number of annotated labels per document.In Tab 3 we report the imbalance characteristics for each label.Fig 5 demonstrates the difference in frequency of vulnerability annotations between Article 3 and non-Article 3. Fig 6 demonstrates the difference in frequency of each vulnerability label between all four datasets.Tab 6 shows the statistics for the explanation dataset.The hierarchical nature of the dataset is based on the naturally occurring paragraphs in the legal cases.On average, each case consists of 71.54 paragraphs (σ = 67.54).The distribution of the number of paragraphs by vulnerability type is shown in Fig 7. The mean token count is 4,765; its distribution by vulnerability type is depicted in Fig 8.The distribution of the mean token count per paragraph by vulnerability type is shown in Fig 9.
Fig 2a illustrates the detailed architecture of the hierarchical model.
Fig 2b displays the detailed architecture of the conceptaware hierarchical model.

Figure 3 :
Figure 3: Screenshot of the GLOSS annotation interface.

Figure 4 :
Figure 4: Distribution of number of annotated labels per document.

Figure 5 :
Figure 5: Difference in frequency of vulnerability annotations between Article 3 and non-Article 3.

Figure 7 :
Figure 7: Box plot of the number of paragraphs per typology.

Figure 8 :
Figure 8: Box plot of the token count per typology.

Figure 9 :
Figure 9: Box plot of the mean number of tokens per paragraph per typology.

Table 4 :
Classification and explanation results.We report F1s for classification performance and Kappa score with standard error for explanation agreement.

Table 5 :
Results on the challenge dataset.

Table 6 :
Fig 3 demonstrates the GLOSS annotation interface.Statistics for the explanation dataset.