Systematic Assessment of Factual Knowledge in Large Language Models

Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.


Introduction
The rise of Large Language Models (LLMs) has greatly improved the capabilities of natural language processing (NLP).However, one primary concern with these models is the potential for extrinsic hallucinations where LLMs generate statements that cannot be verified from the source (Levy et al., 2021;Ji et al., 2023).This issue severely impairs the trustworthiness of LLMs and is particularly concerning when relying on LLMs for decision-making.Rigorous evaluation is necessary before deploying them in critical applications.
One evaluation approach is to use questionanswering datasets to assess the language and knowledge capabilities of LLMs.Recent research has mainly focused on evaluation using existing benchmarks (Bommasani et al., 2023;Bang et al., 2023; Guo et al., 2023).While these benchmarks are valuable for comparison and measuring progress in LLM research, they may not provide sufficient assessment for production.Benchmarks constructed from public datasets can pose information leakage problems due to overlap with pretraining data.Furthermore, constructing domainspecific benchmarks is costly, requiring domain expertise and adequate knowledge coverage.
This paper proposes a systematic approach to assess factual knowledge in LLMs by generating a comprehensive assessment suite from knowledge graphs (KGs) and evaluating the correctness of LLMs' responses.The question generation process is carefully designed to ensure coverage of facts, as well as diversity and validity of the questions (Figure 1).Using this framework, we evaluate multiple models from three LLM families on factual questions derived from four KGs, covering both generic and specialized domains.Specifically, our contributions are: • We propose a novel framework to evaluate factual knowledge in LLMs by systematically generating valid and diverse questions from KGs while also ensuring knowledge coverage.
• We observe that LLMs may abstain from an-swering certain questions, prioritizing precision by avoiding the provision of inaccurate or hallucinated answers.We propose to use the F1 metric to take the abstention into account and ensure fair comparison across models.
• We show that LLMs performance depends on several factors such as instruction finetuning, domains, and question complexity.Despite sharing the same parametric knowledge base, models finetuned with different instruction datasets show varying performance levels.In general-domain KGs, LLMs achieve the highest score, but their performance declines in specialized domains and is worse on questions having a wide range of potential answers.
• We assess robustness of LLMs to the prompting context and find they are highly sensitive to irrelevant information and are susceptible to being misled by antifactual contexts.

Systematic Assessment Framework
This section describes the question generation component in our proposed assessment framework, followed by the answer prompting strategy to collect LLM's response and the evaluation metric.

Question Generation
Our framework leverages the facts stored in a KG, organized into triplets, i.e., (subject, relation label, object), to automatically generate a set of knowledge-based questions and answers satisfying three requirements: (i) validity: questions should have unique or verifiable answers ; (ii) coverage: questions should cover all explicit facts; and (iii) diversity: questions should vary in format and difficulty.
In this paper, we assume the complete KG and generate valid questions by considering the object of a given triplet as the reference answer and generating questions with the subject and relation label.
To ensure the question coverage and diversity, we utilize all available triplets and employ two question generation methods from a predefined template (Petroni et al., 2019) or using ChatGPT (Ope-nAI, 2023).We consider three types of questions: true-false question (TFQ), multiple choice question (MCQ), and short-answer question (SAQ).In addition, each question type can be represented in different formats: true/false question, fill-in-thebank (FiB) question, and Wh-question (Figure 3 in Appendix).
True-false question (TFQ) Given a triplet, we create factual questions that ask the LLM to determine whether a given statement is true or false.For example, given the triplet (Barack Obama, born in, Hawaii), we can generate a true statement "The birth place of Barack Obama is Hawaii.".For false statement, we randomly replace the object with a wrong entity.

Multiple choice questions (MCQ)
The LLM is presented with a list of answer candidates (choices) and is required to select the correct one.The candidates consist of the object along with randomly selected incorrect entities.We consider two formats for MCQ: fill-in-the-blank (FiB) by replacing the reference object in the true statement with [MASK] token and Wh-question (Aigo et al., 2021).
Short-answer questions (SAQ) Instead of providing answer candidates as in MCQ, we ask the LLM to predict the correct answer directly in SAQ.For many-to-many relations, we consider all possible objects as potential correct answers and request the LLMs to list all possible answers.

Evaluation
Answer Prompting We carefully design prompts to describe the task and instruct the LLMs to provide concise answers.We also verify the robustness and consistency of LLMs by injecting different types of knowledge into the question, including (i) relevant knowledge, (ii) irrelevant knowledge which is correct but not related to the question, and (iii) anti-factual knowledge that provides false or erroneous information.The injected knowledge can come from the relation description or extra evidence information, which are available in several knowledge graphs.
Metric Although we prompt LLMs to provide brief and concise answers, evaluating the correctness of the generated answer is not trivial.A small percentage of generated answers are long and contain explanations.Hence, the standard exact match metric used in question-answering tasks is not a suitable metric.Instead, we use a fuzzy match metric that checks if the generated answer appears in the reference answers and vice versa.
Many LLMs employ several guardrails to avoid providing inaccurate or hallucinated answers which return an abstained answer (e.g., "I am unable to answer the questions without more knowledge.").We define precision as the accuracy of non-abstained answers (1) and recall as the percentage of accuracy of all questions The F1 score F 1 = 2× P ×R P +R is the main evaluation metric to compare the performance of LLMs.

Setup
Datasets We use four KGs in LAMA (Petroni et al., 2019) and BioLAMA (Sung et al., 2021) benchmarks to generate factual questions, including two general-domain KGs: Google-RE (Petroni et al., 2019), T-REx (Elsahar et al., 2018), and two domain-specific KGs: WikiBio (Sung et al., 2021) in biology domain and ULMS (Bodenreider, 2004) in the medical domain.Each relation in the KGs is associated with a predefined template to construct a natural sentence from a given triplet.Detail descriptions of the datasets and the predefined templates are reported in Appendix A.1.

Large Language Models
We evaluate the knowledge captured in several LLMs coming from three backbones: (i) ChatGPT 2 (OpenAI, 2023); 2 We did not assess GPT4 due to its high cost.

Experiment Settings
We employ two question generation methods: (i) template-based (TPL) where the subject is plugged into the provided template and the object is the ground-truth answer; and (ii) LLM-based where we use GPT-3.5-turbo to generate the questions.The question generation prompt can be found in Appendix C. Given a triplet, we generate the TFQ with the ratio of true and false questions set to 1 : 1.For MCQ, we randomly select three incorrect entities and combine them with the correct entities as the choices.

Results
Precision We report the precision of LLMs on question generated from Google-RE in Table 1.As expected, LLMs perform best on TFQ and worst on SAQ due to the increasing difficulty level.Surprisingly, almost all LLMs struggle with FiB questions, often returning abstentions or the [MASK] token without making any predictions.While FiB questions are commonly used in masked language model evaluation, we find that Wh-questions, which are more natural and occur more frequently in the instruction set, are more suitable for evaluating conversational LLMs.Moreover, we observe comparable performance between template-based and GPT3.5-based questions.
Overall, ChatGPT achieves the best average precision.However, it also has a high percentage of abstained answers across all question types.Both the LLaMA-7B and T5-XL models perform worse than random guessing in TFQ and MCQ, indicating a failure to follow instructions due to the lack of training on instruction finetuning datasets.Although sharing the same parametric knowledge base (LLaMA-7B), Alpaca consistently outperforms Vicuna.On the other hand, further instruction finetuning the FLAN-T5-XL does not improve precision.
F1 Measure Table 2 shows the average F1 score across all question types for each KG.The detailed breakdown of F1 scores for each question type can be found in Appendix B. Overall, ChatGPT outperforms other LLMs, and models from the T5 family generally perform better than those from the LLaMA family.Among the models from the same family, those fine-tuned on the Alpaca instruction set have better performance.This contrasts with the above observation where FLAN-T5-XL is the top performer in terms of precision in the T5 family.With the high abstention rate, it can be seen that FLAN-T5-XL tends to abstain from uncertain questions to achieve higher precision, which comes at the expense of losing recall for correct answers.2, the F1 scores on TREx (general domain) are higher than those in specific domains (WikiBio and UMLS).Additionally, the relatively stronger performance on WikiBio over UMLS can be attributed to the pretraining data overlap as it is derived from Wikipedia.Interestingly, all LLMs perform poorly on the Google-RE dataset, despite also being extracted from the general domain (Wikipedia).We speculate that this discrepancy may be attributed to the complexity of the answer range of the Google-RE questions such as date-of-birth, birth place, and death place which have a wide answer range.

Robustness to Adversarial Context
We inject different contexts to the questions of Google-RE evaluation set and reported the results in Figure 2.
Our observations reveal that the responses of LLMs are highly sensitive to the contexts.Incorporating relevant context leads to significant performance improvement across all LLMs.Conversely, LLMs are prone to be misled by antifactual context, despite explicitly instructed to base their answers on real-world facts.LLMs performance also decrease when conditioned on irrelevant contexts.These findings highlight the lack of robustness in LLMs against adversarial examples.Ideally, a robust LLM should perform comparable in the absence of context or with irrelevant context.This poses a challenge in deploying LLMs to production, as they may inadvertently reinforce misinformation provided by users.

Related Works
LLM Evaluation Evaluation of the Large Language Model (LLM) has gained increasing interest among researchers (Bommasani et al., 2023;Bang et al., 2023;Guo et al., 2023).For instance, Bang et al. (2023) conducts a multitask, multilingual, and multimodal evaluation for ChatGPT.Holistic Evaluation of Language Models (HELM) (Bommasani et al., 2023) selects a broad of datasets and benchmarks to evaluate the ability of LLMs.However, previous works mostly focus on human evaluation and using existing datasets and benchmarks (Guo et al., 2023).This requires lots of human effort and cannot guarantee the knowledge coverage to assess knowledge in LLMs comprehensively.
Factual Knowledge Evaluation for LLMs Evaluating the factual knowledge of LLMs can ensure the model is providing reliable and trustworthy information to users.Knowledge Graphs (KGs), which capture vast amounts of facts, offer a reliable source of factual knowledge for evaluation (Pan et al., 2023;Luo et al., 2023).LAMA (Petroni et al., 2019) adopts pre-defined templates to convert the facts in KGs into cloze questions then uses LLMs to predict the answers.The prediction results are used to evaluate the knowledge stored in LLMs.Similarly, BioLAMA (Sung et al., 2021) and MedLAMA (Meng et al., 2021) assess the factual knowledge of LLMs in medical domains by using medical knowledge graphs.Alex et al. (Mallen et al., 2022) selects unpopular facts from Wikidata knowledge graphs which have low-frequency clicked entities to investigate the ability of LLMs to retain less popular factual knowledge.By enumerating all available factual triplets in KGs, we could ensure the evaluation coverage of the factual knowledge.Nevertheless, exciting methods lack a systematic framework containing question generation and evaluation modules.They often use pre-defined templates for question generation which cannot provide diverse questions to evaluate the knowledge of instruction-tuning LLMs (Sun et al., 2023).

Automatically Question Generation from KGs
To assess knowledge in instruction-tuning LLMs, we need to evaluate whether they have such knowledge and whether they can accurately express their knowledge, i.e. instruct following ability and robustness.Therefore, given the same factual knowledge, we need to generate diverse questions at different levels of difficulty.Early works that generate questions from KGs either use sequence-tosequence models or graph neural networks to convert the triplet into a natural language question (Seyler et al., 2017;Kumar et al., 2019;Indurthi et al., 2017;Chen et al., 2023).Recently, many methods harness the ability of LLMs to generate questions from KGs (Guo et al., 2022;Axelsson and Skantze, 2023).In this way, they can generate questions with different diversities and complexities.Although there are previous works that generate questions from knowledge graphs, to the best of our knowledge, none of them adopt the generated questions for evaluating the factual knowledge in LLMs.

Conclusion
We propose a systematic framework to evaluate factual knowledge of LLMs with the diverse and well-coverage questions generated from KG.The experiment reveals several factors affecting LLMs' performance and highlights their vulnerability to adversarial context.Our findings contribute to understanding LLMs' capabilities and limitation in handling factual knowledge.

Limitations
The limitation of our work includes • Assuming a completed knowledge graph.In our work, we access the knowledge of LLMs by using the facts in knowledge graphs.However, knowledge graphs are often incomplete, which could contain lots of implicit facts.Thus, it could be inadequate to evaluate the LLMs with the existing KGs.In the future, we plan to incorporate the knowledge graph completion methods and present a more comprehensive assessment framework.
• Focusing only on triplet-based facts.We only assess the knowledge of LLMs by using the question generated from the single triplet, which ignores the complex knowledge represented by the combination of triplets.To assess the completed knowledge, we need to design a framework that considers the reasoning ability of LLMs on knowledge graphs.
• Evaluating the correctness of multiple answer questions.For N-M relations, we have multiple answers to a question.However, the LLMs might not return all the answers.How to evaluate the partially answered questions is still an open question for accessing the knowledge of LLMs.

Ethics Statement
Our work aims to design a framework that can automatically assess the factual knowledge stored in large language models.In this research, we conducted experiments on publicly available datasets and implemented our approaches using commonly accepted techniques, giving utmost consideration to fairness and avoiding potential biases.We acknowledge the significance of transparency and have furnished comprehensive elucidations regarding our methodology and decision-making process.
To conclude, our research adheres to ethical guidelines and poses no potential risks.20) for each relation which can be used to generate cloze sentences.
• Google-RE (Petroni et al., 2019) is a subset of knowledge graphs containing three relations: place of birth, date of birth, and place of death.
The fact triplets associated with each relation are extracted from Wikipedia and aligned with a short piece of support text.Table 19 shows the predefined template for each relation in Google-RE.
• Wikipedia Biography Dataset (WikiBio) (Sung et al., 2021) is a biology knowledge graph that is constructed by extracting the biology-related facts from Wikidata.Table 21 shows the template for each relation in Wik-iBio.
• Unified Language Medical System (ULMS) (Bodenreider, 2004) is a medical knowledge graph constructed by domain experts.It contains information about various medical concepts and their relationships.Table 22 shows the template for each relation in UMLS.Triplet: (Barack Obama, born in, Hawaii)

Template-based ChatGPT-based
Prompt: Please answer the following questions based on facts.
Fill-in-blank: The birth place of Barack Obama is [MASK].

Prompt:
Please select the answers from the given choices.

Fill-in-blank:
The birth place of Barack Obama is [MASK].

Wh-question:
Where was Barack Obama born?

True-false questions (TFQ)
Prompt: Please predict whether this fact is correct or not?
False statement: The birth place of Barack Obama is Miami.
True staement: The birth place of Barack Obama is Hawaii.Question Generation Given a triplet, we generate the TFQ with the ratio of true and false questions set to 1 : 1.For MCQ, we randomly select three incorrect entities and combine them with the correct entities as the choices.Table 5 shows the number of generated questions for each KG.We also illustrate the example of template-based and LLM-based questions in Figure 3.
Abstained Answer Detection Assessing the accuracy of answers generated by LLMs in free text format presents a challenge in determining both the correctness of the answer and whether the model chooses to abstain from answering.For TFQ, instead of treating it as a binary classification problem, we instruct the model to respond with "UNKNOWN" when uncertain, effectively transforming it into a 3-class text classification task.
For MCQ and ASQ, we compile a curated list of phrases that indicate abstention, such as "cannot predict" or "I am sorry," and check if any of these phrases appear in the output.If such phrases are detected, we consider the answer to be an abstained.
Answer Prompting Each relation in Google-RE KG comes with the corresponding paragraphs from which it is extracted.We treat this paragraph as relevant context for a given triplet.We sample the paragraph from an unrelated triplet, i.e. not sharing subjects or objects as irrelevant context.For the antifactual context, we replace the correct answer with a randomly selected entity from KG.

B Additional Results
Question Analysis We first evaluate the validity of the LLM-based questions by calculating the similarity between the template-based questions in Table 6.Then, we report the diversity of LLMbased questions in Table 7.Since the templates are written by humans, higher similarities indicate the higher validity of the LLM-based question.From the results in Table 6, we can find that LLM-based questions are highly similar to template-based questions across all datasets w.r.t two question formats: true/false question (TFQ) and fill-in-blank (FiB).This verifies the good quality of the LLM-based questions which can be further used to assess the factual knowledge of LLMs.
Although the accountability of the template, the task of defining diverse templates can be quite burdensome.Due to the lack of templates for Whquestions, we evaluate the diversity of Wh-questions generated by ChatGPT using the self-bleu scores (Zhu et al., 2018).The lower the scores, the more diverse the questions.From the results in Table 7, we can see that compared to the TFQ and FiB questions generated based on templates.
The Wh-questions generated by ChatGPT achieve a higher diversity, which provides a more natural and clear instruction to assess the knowledge.
F1 score F1 score on different question types are shown in Tables 8 to 11.
Precision Precision score on different KGs are shown in Tables 12 to 14. Similar to Google-RE, ChatGPT are the top performer across all KG, followed by FLAN-T5-XL.
Recall Recall score on different KGs are shown in Tables 15 to 18.

Adversarial Context
The F1 score of different LLMs under different types of context is shown in Figure 4.

C Example Prompts
Question Generation Prompt The question generation prompts for TFQ, FiB and Wh-question can be found in Table 23, Table 24 and Table 25 respectively.

TRUE-FALSE QUESTION
I have a triplet extracted from a knowledge graph.The triplet is organized as (Subject, Relation, Object), which describes the relation between object and relation.Can you help me to generate a natural language sentence to describe this triplet as accurate as possible?

Figure 1 :
Figure 1: Our proposed assessment framework generates a diverse set of questions to evaluate factual knowledge in LLMs.

Figure 2 :
Figure 2: F1 score of LLMs on Google-RE with different context prompt: none, relevant, irrelevant and antifactual context (best seen in color).FLAN-A and FLAN-B denote FLAN-Alpaca and FLAN-Vicuna respectively.

Figure 3 :
Figure 3: Our question generation process iterates through all fact triplets and creates multiple question types for each triplet.

Figure 4 :
Figure 4: F1 score of LLM on google_re with different context prompt: no context (none), relevant, irrelevant and antifactual context.

Table 2 :
Average F1 score of LLMs across question types on different KGs.The best score is in bold.

Table 9 :
Table 26 provides the prompt template for different LLM families.LLaMA-7B and models finefuned on Alpaca dataset are prompted with the same instruction format.On the other hand, Vicuna and FLAN-T5-XL employ different templates.The instructions also vary for different question types and formats, as shown in Table 27.F1 on different question types generated from TREx KGs.

Table 10 :
F1 on different question types generated from WikiBio KGs.

Table 11 :
F1 on different question types generated from UMLS KG.

Table 19 :
Examples of question generation template for Google_RE, where [X] denotes the subject, and [Y] denotes the object.

Table 20 :
Examples of question generation template for Trex, where [X] denotes the subject, and [Y] denotes the object.

Table 21 :
Examples of question generation template for WikiBio, where [X] denotes the subject, and [Y] denotes the object.

Table 22 :
Examples of question generation template for UMLS, where [X] denotes the subject, and [Y] denotes the object.

Table 23 :
Question generation prompts for true-false question format.FILL-IN-BLANK QUESTIONI have a triplet extracted from a knowledge graph.The triplet is organized as (Subject, Relation, Object), which describes the relation between object and relation.Can you help me to generate a natural language sentence to describe this triplet as accurate as possible and replace Object with [MASK]?

Table 24 :
Question generation prompt for fill-in-blank question format.

Table 27 :
Instructions for different question type.