Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains

High-risk domains pose unique challenges that require language models to provide accurate and safe responses. Despite the great success of large language models (LLMs), such as ChatGPT and its variants, their performance in high-risk domains remains unclear. Our study delves into an in-depth analysis of the performance of instruction-tuned LLMs, focusing on factual accuracy and safety adherence. To comprehensively assess the capabilities of LLMs, we conduct experiments on six NLP datasets including question answering and summarization tasks within two high-risk domains: legal and medical. Further qualitative analysis highlights the existing limitations inherent in current LLMs when evaluating in high-risk domains. This underscores the essential nature of not only improving LLM capabilities but also prioritizing the refinement of domain-specific metrics, and embracing a more human-centric approach to enhance safety and factual reliability. Our findings advance the field toward the concerns of properly evaluating LLMs in high-risk domains, aiming to steer the adaptability of LLMs in fulfilling societal obligations and aligning with forthcoming regulations, such as the EU AI Act.


Introduction
Large language models (LLMs) have revolutionized how the world views NLP (Wei et al., 2022b;Kojima et al., 2022).Their astonishing performance on many tasks has led to an exponential increase in real-world applications of LLM-based technology.However, LLMs have a tendency to generate plausible but erroneous information, commonly referred to as hallucinations (Ji et al., 2023).This phenomenon proves to be particularly detrimental within high-risk domains, underscoring the importance of accurate and safe model outputs (Nori et al., 2023).
In addition, with upcoming regulations, such as the EU AI Act (European Commission, 2021),

Public Services
Administration of Justice  is not yet finalized, it is expected that LLMs will fall into the high-risk category in specific domains, such as medical and legal. 1   the necessity of properly analyzing and evaluating LLMs is further addressed.EU AI Act is expected to become the first law worldwide that regulates the deployment of AI in the European Union, therefore, set a precedent for the rest of the world.According to the current draft, AI systems in high-risk domains, e.g.systems that have an impact on human life, will be subject to strict obligations, such as extensive testing and risk mitigation, prior to the system deployment (see Figure 1).
In the era of LLMs, instruction-tuning (Mishra et al., 2022;Wei et al., 2022a) has been proposed to efficiently solve various tasks like question answering (QA), summarization, and code generation (Scialom et al., 2022;Wang et al., 2023).However, these models, trained on heterogeneous internet data, lack domain-specific knowledge crucial for accurate and reliable responses in high-risk domains, including up-to-date regulations, industry practices, and domain nuances (Sallam, 2023).Furthermore, the quality of the training data is seldom 1 Figure is based on https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai.
quantified (Zhou et al., 2023).Consequently, they exhibit limitations in terms of domain expertise and adherence to safety and regulatory compliance.
In the study conducted by Hupkes et al. (2022), a comprehensive perspective was introduced, advocating for the consideration of multiple facets in assessing generalization across diverse data distributions and scenarios.Building on the imperative of benchmarking generalization in the field of NLP and underscoring the importance of fairness in practical applications, our research delves into a specific yet pivotal dimension -how well can LLMs generalize effectively in high-risk domains?
Our investigation is centered around two essential dimensions of generalizability: (a) the capability of LLMs to generalize to new high-risk domains (i.e., general vs. high-risk domains) and new tasks (i.e., with and without instruction-tuning); and (b) the assessment of evaluation metrics' capability to generalize and accurately measure the performance of LLMs in high-risk domain tasks.Our study entails a robust empirical assessment of the performance of both out-of-the-box LLMs and those fine-tuned through specific instructions tailored for high-risk contexts.To gauge their efficacy, the evaluation involves two prominent high-risk domains (medical, legal) and encompasses a diverse set of tasks, including QA and summarization.
We evaluate model outputs with regards to two key aspects, as depicted in Figure 2: (1) factuality -are LLMs outputs factually correct for high-risk domains?(2) safety -do LLMs successfully avoid producing harmful outputs?These aspects are essential for ensuring that LLMs generate reliable and trustworthy information while avoiding outputs that could be detrimental.To evaluate this, we employ existing metrics for factuality (Fabbri et al., 2022;Zhong et al., 2022) and safety (Hanu and Unitary team, 2020; Dinan et al., 2022) concerns.Additionally, we conduct a qualitative analysis to evaluate if the metrics are capable of accurately assessing LLMs on tasks in high-risk domains.Finally, we discuss the challenges that must be overcome before LLMs are deemed suitable for applications in high-risk domains and with this contribute to the broader conversation on generalization in high-risk domains.
Contributions.Our contributions are summarized as follows: (i) We robustly evaluate the outputs of out-of-the-box and instruction-tuned LLMs in two high-risk domains on 6 datasets across QA (iv) we advocate for the need of human-centric NLP systems that are capable of giving the final control to human users in order to build trustworthy applications in high-risk domains.

Domain-adaptive Instruction-tuning
The emergence of GPT (Radford et al., 2018) has led to a multitude of generative LLMs.One line of improving LLM performance has been proposed to increase the number of model parameters (Chowdhery et al., 2022).Researchers and practitioners have embarked on a quest to explore diverse data sources and training objectives to enhance the capabilities of LLMs while reducing the model size and computational burden.Another focus is leaning toward training smaller foundation models (e.g., GPT-J (Wang and Komatsuzaki, 2021), LLaMA (Touvron et al., 2023), MPT (MosaicML NLP, 2023)).
The adoption of smaller foundation models enables researchers and practitioners to conduct more efficient investigations into novel methods, explore new domain-specific applications, and establish streamlined deployment efficiency.Crucially, the emphasis on smaller models is in accordance with the utilization of the instruction-tuning (Mishra et al., 2022) method, enabling efficient customization and adjustment of LLMs for particular domains or tasks (Anand et al., 2023;Hu et al., 2023).
In our experiments, we rely on a series of smaller size LLMs for efficiency and cost concerns, and effectively incorporate domain knowledge for high-risk domains via instruction-tuning.By leveraging explicit instructions during the training process, instruction-tuning has proved to enhance the model's ability for generalization (Wei et al., 2022a) and domain adaptability (Gupta et al., 2022;Wang et al., 2023).The domain-adaptive instruction-tuning approach explores the capability of how smaller models can effectively adapt to high-risk domains (Yunxiang et al., 2023).
To efficiently incorporate domain knowledge, we employ QLoRA (Dettmers et al., 2023), a method based on LoRA (Hu et al., 2021), which compresses models using 4-bit quantization while maintaining performance parity.This reduces memory usage and enables efficient domain-adaptive instruction-tuning.

Experimental Setup Instruction-tuning
Data.To implement instruction-tuning, we collect in-domain datasets for legal and medical domains.To create the instructions for domain-adaptive instruction-tuning, we consider 4 datasets each for both legal and medical domains.An overview of the collected datasets is shown in Table 1.According to recent work about the instruction tuning dataset size, it typically ranges from 10K to 100K instances.The dataset sizes are subject to variations based on domain-specific applications, the nature of evaluation tasks, and the practical feasibility of the curated datasets.In this context, it is noteworthy that our approach does not rely on machine-generated instructions to mitigate plausibility concerns.Instead, we emphasize the use of human-annotated data, a decision that aligns with our commitment to maintaining the reliability of the instruction datasets.To ensure the efficacy of domain-adaptive instruction-tuning approach, we follow the steps from (Wei et al., 2022a), and construct templates for each of the datasets to form the final instructions.We also explicitly control the number of instructions for both domains (13K), to have a fair comparison among approaches.Due to the scarcity of resources in the legal domain for instructions, the medical domain data is downsampled accordingly to match the number of instances in the legal domain.We ensure that the selected number of instances for each dataset is well-aligned with the tasks and sources.

Domain Dataset
Size License †

Domain Dataset
Task Size License
Evaluation Tasks.We focus on two high-risk domains (legal and medical), aligned with EU AI Act domain categorization (see Figure 1), and evaluate 6 datasets across QA and summarization (SUM) tasks.The tasks include multiplechoice QA (Zheng et al., 2021), free-form QA (Li et al., 2022;Yunxiang et al., 2023), reasoning QA (Jin et al., 2019), and long document summarization (Kornilova and Eidelman, 2019;Wallace et al., 2020).Table 2 displays an overview of the high-risk domain task datasets.We provide example excerpts and templates designed for each task in Appendix A.
Evaluation Metrics.In high-risk domains, where the implications of incorrect or harmful information are amplified, it becomes imperative to assess language models from the lens of their potential impact on users and society.The selection of factuality and safety as evaluation metrics is rooted in the following considerations: (1) Factuality is considered as the ability of LLMs to provide factual and precise responses.Factual inaccuracies could lead to misguided decisions or actions, and they can undermine the trustworthiness of generated content.By evaluating factual-ity, we seek to ensure that the responses of LLMs align with accurate information, which is of utmost importance in high-risk applications.Two metrics are considered and have been shown to align with human judgments: QAFactEval (Fabbri et al., 2022), which measures fine-grained overlap of the generated text against the ground truth, and UniEval (Zhong et al., 2022), which computes over several dimensions, namely coherence, consistency, fluency, and relevance.(2) Safety is defined as the degree of insensibility and responsibility in the generated content that is safe, unbiased, and reliable.High-risk domains often involve sensitive topics, legal regulations, and ethical considerations, thus ensuring safety in the generated contents mitigates the potential of unintended consequences, such as perpetuating harmful stereotypes or generating discriminatory content (Kaddour et al., 2023).
Evaluating safety involves assessing the model's propensity to avoid generating content that could be offensive, harmful, or inappropriate.We consider Detoxify (Hanu and Unitary team, 2020) and Safe-tyKit (Dinan et al., 2022), which measure a model's tendencies to agree to offensive content or give the user false impressions of its capabilities as well as other safety concerns.Although our primary focus is on ensuring factuality and safety, it is essential to underscore the significance of other critical factors, such as robustness (Zhu et al., 2023) Hupkes et al. (2022).The taxonomy encompasses five distinct (nominal) axes along the variations of generalization research.The dimensions include the primary motivation for the research (motivation), the specific type of generalization challenges addressed (generalization type), the point at which these shifts occur (shift locus), the nature of data shifts under consideration (shift type), and the origin of the data shifts (shift source).The coverage of generalizability in this study is marked (✓).

Model
BaseModel # Params Budget Size License Table 4: Overview of the computational information for the domain-adaptive instruction-tuning, while comparing with GPT-3.5-turbo (OpenAI, 2022).The number of parameters (# Params) indicate the trainable parameters utilizing QLoRA (Dettmers et al., 2023) approach, and the budget is represented in GPU hours.
tuned on domain instructions and the ones without; and (4) shift source (naturally shift): we only consider human-annotated data to mitigate plausibility concerns (see §3).We summarize the generalizability of our proposed methods in Table 3.
Pre-trained Large Language Models.Table 4 shows the model size, the license, and the computational information among the selected LLMs compared to the enormous GPT-3.5-turbo(i.e., Chat-GPT (OpenAI, 2022)).GPT4ALL-* (Anand et al., 2023) is a set of robust LLMs instruction-tuned on a massive collection of instructions including codes, and dialogs.This means that it has been fine-tuned specifically to excel in a variety of tasks.
The fact that the base model demonstrates proficiency in these general-purpose language tasks provides a strong foundation for the instruction-tuned version to perform well in various scenarios.Besides, GPT4ALL-* comes with an open-sourced commercial license, providing the freedom to de- Training and Optimization.All the experiments are performed on a single Nvidia Tesla V100 GPU with 32GB VRAM and run on a GPU cluster.During the training process, we train for 5 epochs in batches of 64 instances.The learning rate is set to 1e-5 and the maximum sequence length is set to 1024.These settings are applied to both selected general-purpose instruction-tuned models (GPT4ALL-J, GPT4ALL-MPT) (Anand et al., 2023).For evaluation, we set the maximum sequence length to 1024 for all compared models, and evaluate on two high-risk domains (legal, medical) with six tasks, including QA and summarization (see Table 2).

Evaluation Results
Factuality. Results for the factuality metrics can be found in Table 5.Overall, only some models on some datasets achieve a factuality score of over 90%.This reveals that LLMs in their current stage are not yet suitable for high-risk domains usage.
Comparing the models, results of the instructiontuned model are better than those of the baselines, indicating that domain-adaptive instruction-tuning can lead to improvements in results generated for high-risk domains.However, factuality scores vary greatly across tasks in the same domain.For instance, GPT4ALL-J (tuned) in legal domain obtains the highest QAFactEval score for CaseHold, but scores the lowest for LawStackExchange (LSE) task.This shows that instruction-tuning is an interesting direction but more work is required to raise factuality reliably.
Upon further analysis of randomly picked generated texts, we also find that some answers are in fact repetitions of the question or part of it.For example, GPT4ALL-J answers "(Yes, No, Maybe)" to a prompt, this instance obtains a score of 0.5 from QAFactEval and 0.946 from UniEval.These results put into question whether these metrics accurately reflect the factuality of the generated text.Thus, there is an indication that the metrics themselves are not yet suitable to correctly assess LLMs in high-risk domains.
Safety.Results for the safety metrics can be found in Table 6.Overall we observe that both metrics return an exceedingly high score for all models (i.e., the score is higher than 0.94 across the board).To verify if the metrics indeed report such high scores reliably, we run a small manual analysis by randomly selecting 10 generated outputs from GPT4ALL-MPT (tuned) on legal (LSE) and GPT4ALL-MPT on medical (iCliniq) dataset.Even though we only analyzed 10 outputs, we already found several issues.For the medical domain, 8 out of 10 answers are problematic.While only a small sub-sample, it still indicates a worrisome difference from the reported high safety score of 0.95.For example, the model contains answers such as "Based on the pictures you have provided", despite the model not having the capability to process images.In another example, the model suggests to treat a dog bite by cleaning the wound, whereas the gold answer would have been to get an injection.
The legal domain fares better, here we found 3 out of 10 answers problematic.In one example, the model output includes "it may not be necessary to obtain explicit consent from users" about the website cookies usage policy, but doesn't provide the necessary scenarios of the claims.
Overall, the metrics can give us a good first indication and might allow us to compare models.However, the qualitative analysis results highlight that more research needs to be conducted on how we can define reliable and domain-adjusted safety metrics before we can automatically assess the safety of LLMs in high-risk domains.

Implications
The need for factual and secure outputs of LLMs is crucial for their deployment in high-risk domains.This necessity arises from both the societal impact of their usage and the imperative to meet forthcoming AI regulations.Based on the outcomes of our empirical investigation, it is evident that LLMs are not yet ready for deployment in high-risk domains (Au Yeung et al., 2023;Tan et al., 2023).In light of this, we address three key implications that can guide us towards a more suitable course of action: (1) Models enhancement: a pressing need to improve the LLMs themselves is crucial to ensure they generate accurate and reliable responses; (2) Metrics refinement: metrics are required to be refined to assess LLMs properly in specific domain scenarios; and (3) Human-centric systems: development of LLMs should be prioritized to empower human users to manage and direct LLMs interac-tions, especially in high-risk domain use cases.

Models Enhancement. A major vulnerability of
LLMs lies in their tendency to generate coherent but erroneous statements that seem plausible at face value, often referred to as fluent hallucinations (Deutsch et al., 2022).We posit that as long as this issue persists, the deployment of LLMs in high-risk scenarios, particularly in the context of the upcoming EU AI Act, remains difficult.Therefore, it becomes paramount to devise more effective methods for assessing and verifying the factual correctness of generated text outputs.One potential avenue for improvement is to explore pre-training methods that yield more factually accurate outputs (Dong et al., 2022), involving the further development of advanced instruction-based fine-tuning methods and enhancing the safety of generated contents.Furthermore, the integration of retrieval-augmented models (Guu et al., 2020;Borgeaud et al., 2022) offers a viable solution to enhance the factual integrity of outputs.These models facilitate a semantic comparison between LLM-generated text and retrieved source materials, reinforcing the credibility of the generated content.
Metrics Refinement.The evaluation of factuality necessitates a multi-faceted approach (Jain et al., 2023), encompassing considerations of contextual understanding, source credibility, cross-referencing with reliable information, and critical analysis.Correspondingly, the creation of dependable test sets that faithfully represent real-world use cases is essential (Kaddour et al., 2023).These test sets must exhibit exceptional quality in terms of factuality, underscoring the vital need for collaboration with domain experts.Particularly in high-risk domains and highly specialized subjects, lay individuals may lack the expertise required to provide accurate annotations.Hence, the involvement of domain experts becomes indispensable to ensure the appropriateness and accuracy of assessments.Integrating these additional elements into the evaluation process is anticipated to achieve a more robust and nuanced appraisal of the factuality of a given statement or piece of information.
Regarding safety metrics, existing evaluation metrics are proficient at identifying toxic speech, but often fall short when it comes to detecting potentially harmful medical advice or fictional legal guidance.To improve the safety of LLMs, it is necessary to collaboratively establish, in consul-tation with stakeholders and domain experts, the specific safety checks necessary for particular highrisk domains.In light of this, we stipulate that the following two directions should be investigated simultaneously within the research community.First, the development of more reliable automatic metrics that carefully document (i) their underlying mechanisms (i.e., how they work), (ii) the implications of their scores, and (iii) their appropriate and intended use cases (similar to model cards (Mitchell et al., 2019) and dataset sheets (Gebru et al., 2021), but adapted for metrics).Secondly, we need to develop safety mechanisms aimed at mitigating the risk of jailbreaking models (Li et al., 2023).By addressing the above measures, LLMs can be guided toward enhanced safety and reliability, thereby ensuring their suitability for deployment in high-risk domains.
Human-centric Systems.In addition to emphasizing the necessity of improvements in both models and evaluation metrics to enable the utilization of LLMs in high-risk another vital inquiry emerges: considering the near impossibility of achieving absolute quality assurance, what actions can we take to ensure responsible usage?
One possible direction is the development of human-centric systems.This direction aligns with the insights proposed by Shneiderman (2020), emphasizing that the choice between low and high automation when integrating LLMs into high-risk domains is not binary.Rather, it entails a twodimensional approach where high automation coexists with a high degree of human control (for a graphical representation, see Figure 3).Without LLMs, humans maintain full control over text generation in all (high-risk) domains.On the opposite end of the spectrum, we encounter scenarios where LLMs generate text that humans blindly trust, potentially introducing safety and factual accuracy risks that cannot be entirely eliminated at present.
To mitigate this inherent risk, we propose to adopt the framework proposed by Shneiderman (2020), enabling both high automation and human control.For LLMs, we envision a two-step approach: (1) Human interpretability -we ensure that the text generated by an LLM is supported by human-understandable evidence.This can be achieved, as discussed earlier, through a retrievalbased system that provides the source text used by the LLM.( 2 enabling human users to the content.Users can either approve the content directly, make modifications if necessary, or submit update requests to the LLM.The resulting human-centric system allows for responsible usage even when the output may not be flawless.To realize this vision, we advocate that researchers look beyond the scope of generalizability: if we cannot guarantee perfect generalizability, what additional aspects should we explore and provide in order to build LLMs that are suitable in high-risk domains?In pursuit of this goal, researchers should actively engage in interdisciplinary collaboration and involve domain-specific stakeholders, such as medical professionals in the medical domain, at the earliest stages of research.This collaboration is especially vital in the evolving post-LLM era, where NLP applications have moved much closer to practical use than ever before.

Related Work
LLMs in High-risk Domains.Recent work has demonstrated the efficacy of leveraging LLMs in high-risk domains, and has been achieved either by training the model using a substantial volume of domain-specific data (Luo et al., 2022;Wu et al., 2023), or by employing instruction-tuning techniques to harness the benefits of fine-tuning LLMs with relatively smaller sets of in-domain instruc-tions from diverse tasks (Sanh et al., 2022;Karn et al., 2023).
Domain-adaptive instruction-tuning approach has proven effective in high-risk domains, such as finance (Xie et al., 2023), medicine (Guo et al., 2023), and legal (Cui et al., 2023).Singhal et al. (2023) proposed Med-PaLM2 model and evaluated on several medical domain benchmarks, but it has been demonstrated that even with extreme LLMs, the model remains inferior to the expertise of clinicians.Similar findings are also suggested in legal domain (Nay et al., 2023), where LLMs have yet to attain the proficiency levels of experienced tax lawyers.Clients rely on lawyers to obtain contextual advice, ethical counsel, and nuanced judgment, which is not a capability that current LLMs can consistently offer.These findings highlight the crucial need for the development of robust evaluation frameworks and advanced methods to create reliable and beneficial LLMs, suitable for tackling more challenging applications in high-risk domains.
Assessing LLMs.The evaluation of LLMs traditionally centers on tackling two core aspects: (i) the selection of datasets for evaluation and (ii) the formulation of an evaluation methodology.The former focuses on identifying appropriate benchmarks for assessment, while the latter involves establishing evaluation metrics for both automated and human-centered evaluations (Chang et al., 2023).Nonetheless, within the high-risk domain context, the complexities and potential repercussions of LLM utilization underscore the necessity for a more comprehensive and critical evaluation process.Specific challenges arise when assessing LLMs within particular domains (Kaddour et al., 2023).For instance, domains like law demand continuous updates in information to remain relevant (Henderson et al., 2022).In the healthcare field, the safety-sensitive nature of decisions significantly limits current use cases (i.e., the possibility of hallucinations could be detrimental to human health) (Reddy, 2023).
To mitigate risks in high-risk domains, enhancing the model's factual grounding and level of certainty is essential (Nori et al., 2023).Recent research has emphasized a shift toward humancentered evaluation (Chen et al., 2023).Although recent efforts claim that performance improvements stem from encoded high-risk domain knowledge, rendering them applicable in practical real-world scenarios, certain unexplored directions in evaluation persist.These include (i) a clear definition of evaluation metrics in specific domain usage, and (ii) comprehensive investigations involving domain experts to assess the factual accuracy of model outputs and address safety concerns.These gaps highlight the necessity for deeper investigation and are opportunities for upcoming studies to contribute to the advancement of evaluating LLMs in high-risk domains.

Conclusion
As LLMs have taken the world by storm, the benchmarking generalization concern in NLP gains significance.Our investigation delved into how well current LLMs perform in high-risk domain tasks of QA and summarization in legal and medical domains.The results exposed a significant gap of the suitability of LLMs for high-risk domains tasks, indicating that employing LLMs in their present state is not yet practical.Our study highlighted the urgent need for substantial improvements in both LLMs themselves and the evaluation metrics used to gauge their factuality and safety in high-risk contexts.Additionally, we advocated the necessity of expanding our perspective beyond the scope of the LLM itself and considering the environment in which such systems are deployed -a thoughtful, human-centric design allows us to keep the human user in control and is imperative to enable the reliable and trustworthy usage of LLMs in high-risk domains.
Overall, our findings and discussions accentuate the importance of a close collaboration with stakeholders and therefore collaboratively address open critical concerns.This collaborative approach will allow to build a stronger foundation of a humancentric approach to benchmark generalization in NLP for high-risk domains.

Limitations
We investigated how some current LLMs perform on some NLP tasks in the high-risk domains: legal and medical, with regard to two metrics each to measure factuality and safety.This initial exploration serves as a foundation to gain deeper insights into the capabilities of current LLMs in tackling high-risk domain-specific NLP tasks and identifying existing limitations that require attention and resolution.
The current setup has a series of shortcomings that should be reduced in future work, namely: (1) the collected datasets currently only focus on English; (2) the instruction templates are designed manually and might lead to variable outcomes; (3) other instruction-tuned models trained on generalpurpose instructions might offer different capabilities, depending on the specific context of domains and tasks; (4) other metrics should be explored and considered, such as robustness (Zhu et al., 2023) and explainability (Zhao et al., 2023); and (5) users should be aware that the metrics used are automatic and therefore themselves might also make mistakes and misrepresent model performance (i.e., the metrics require separate benchmarking themselves).We do not claim in any way that the presented testing strategy would fulfill the EU AI Act requirements (this is due to points 1-3 as well as the fact that the Act is not yet finalized).
Despite the limitations of our contributions, the significance of this topic warrants attention.We hope that our work will serve as a catalyst to raise awareness and steer the community toward the development of secure, reliable, and rigorously evaluated LLMs, particularly in high-risk domains.Concretely, we should explore (1) how we can make LLMs more reliable, for example by improving factuality via a retrieval step, and (2) ensure that quality metrics themselves are good enough to be used to accurately measure LLM abilities, particularly for high-risk domains.

Ethics Statement
Our work investigates the performance of LLMs for high-risk domains with regard to factuality and safety.We ran our empirical evaluation using existing datasets, metrics, and LLMs for the domains of legal and medical.At this stage, we did not involve any other stakeholders.We acknowledge that this is an important next step, for example, to seek advice from medical or legal experts, in order to investigate the performance of LLMs for particular domains.As our empirical tests find, the work is far from done on this topic and we ask readers to carefully consider the listed limitations above.

Figure 1 :
Figure 1: The EU AI Act categorizes AI applications based on their associated risk levels.Although the Act is not yet finalized, it is expected that LLMs will fall into the high-risk category in specific domains, such as medical and legal. 1

LLM Output Factuality Safety Quality Benchmark
Legal MedicalFigure2: Overview of the evaluation framework of evaluating LLMs in high-risk domains.We evaluate how well LLMs with and without instruction-tuning perform in high-risk domains: legal and medical.The quality of the outputs is assessed using existing metrics to measure factuality and safety.andsummarization tasks in terms of safety and factuality concerns; (ii) we demonstrate a qualitative investigation to identify shortcomings of existing metrics; (iii) we discuss open challenges that need to be solved in order to solidify trust to the generalization capability of LLMs in high-risk domains;

Table 5 :
(Zhong et al., 2022)) factuality, considering two evaluation metrics: QAFactEval(Fabbri et al., 2022)and UniEval(Zhong et al., 2022), on two high-risk domains: legal and medical.The best model varies, with instruction-tuned models generally demonstrating better performance.Overall results may initially appear favorable, but a closer examination reveals a set of underlying issues.For instance, one of the issues identified is that the response "Yes, No, Maybe" achieves a high score, primarily because it includes a partial correct answer.

Table 6 :
(Dinan et al., 2022)n safety, considering two evaluation metrics: SafetyKit(Dinan et al., 2022)and Detoxify (Hanu and Unitary team, 2020), on two high-risk domains: legal and medical.Scores on these metrics are incredibly high.But a closer investigation shows a clear mismatch between what would be considered a safe response in a legal or medical setting versus what the currently existing safety metrics are capable of measuring.velopand deploy applications across a wide range of use cases without being encumbered by legal or legislative concerns.

)
Human verification -we build systems around the LLM, e.g.user-friendly interfaces,