ALCUNA: Large Language Models Meet New Knowledge

With the rapid development of NLP, large-scale language models (LLMs) excel in various tasks across multiple domains now. However, existing benchmarks may not adequately measure these models' capabilities, especially when faced with new knowledge. In this paper, we address the lack of benchmarks to evaluate LLMs' ability to handle new knowledge, an important and challenging aspect in the rapidly evolving world. We propose an approach called KnowGen that generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are distinct from real-world entities. With KnowGen, we introduce a benchmark named ALCUNA to assess LLMs' abilities in knowledge understanding, differentiation, and association. We benchmark several LLMs, reveals that their performance in face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. We also explore the impact of entity similarity on the model's understanding of entity knowledge and the influence of contextual entities. We appeal to the need for caution when using LLMs in new scenarios or with new knowledge, and hope that our benchmarks can help drive the development of LLMs in face of new knowledge.


Introduction
Large-scale language models (LLMs) have made impressive progress in the last few years (Brown et al., 2020;Ouyang et al., 2022;Touvron et al., 2023;OpenAI, 2023), which perform surprisingly well on various tasks on various domains, to the extent that many traditional benchmarks (Thorne et al., 2018;Wang et al., 2018) are no longer sufficient to measure the capabilities of LLMs.Therefore, some new benchmarks have been proposed to evaluate the ability of the model to solve more complex tasks such as college entrance exams, law * These authors contributed equally to this work.school admission tests, math competitions and so on (Hendrycks et al., 2021;Guo et al., 2023;Zhong et al., 2023).LLMs also achieve promising results on these benchmarks.
However, it is surprising that there is not yet a benchmark to evaluate the ability of large models in face of new knowledge, which is very important and challenging.Why is this evaluation important?Firstly, we are in a changing world, where models encounter new knowledge frequently in practice.And some work (Peng et al., 2023) is exploring retrieval methods to augment large models, which will also cause the models to meet new knowledge frequently.So of course we expect the model to perform well in such situations, because re-training the model every time is very expensive and unrealistic.Secondly, as Elangovan et al. (2021) mentioned, the presence of overlap between the training and test data can lead to a misestimation of the model's memory ability as generalization ability, especially nowadays when LLMs are trained on an enormous amount of data.Whereas, evaluation on new knowledge does not need to worry about such data leakage, as new knowledge can usually lead to new data and thus more reliable and valuable assessment of the model's ability.
While such evaluation is important, it is challenging to construct the benchmark.The reason is that it is difficult to ensure that the knowledge contained in the benchmark is new for LLMs, since training data for some models are large and nonpublicly available.Furthermore, it is also difficult to ensure that the knowledge used for benchmarking will be not outdated and inefficient, as there are many LLMs that may soon include data from the benchmark in their training.In summary, such benchmark for new knowledge needs to exhibit three basic characteristics: it contains enough new knowledge for sufficient evaluation (sufficient), the knowledge is new to all models (model-agnostic) and the knowledge can remain new for a long time (long-lasting).
There are several possible solutions to the above challenges.One option is to always use the most updated data such as the news of the day (temporal knowledge).However, this is both labor-intensive to race against the LLM training and unclear about the lifetime of the proposed data.Another option is to keep the benchmark closed-source, with an authoritative committee managing the data and users calling API when evaluating, thus preventing using them for training LLMs .To reach this point further requires community coordination.
To better address these challenges, we propose an approach to GENerate new KNOWledge conveniently (KnowGen) based on the structured representation of existing entity knowledge by making reasonable changes to entity attributes and relationships.There are differences and associations between artificial entities and existing entities.Particularly, we apply KnowGen with structured biological taxonomic information data to rationally create a group of organisms that do not exist in the world.To test the model's ability in face of new knowledge, we construct a variety of questions about these artificial entities that can examine the model's ability to understand new knowledge (Knowledge Understanding), distinguish between model's internal knowledge and new knowledge (Knowledge Differentiation) and make multi-hop reasoning by linking model's internal and new knowledge (Knowledge Association).We use the ArtificialLy ConstrUcted kNowledge to Assess LLMs as a benchmark (ALCUNA).
We evaluate and analyze several popular large models based on ALCUNA, including ChatGPT1 , Alpaca, Vicuna, and ChatGLM with vanilla, CoT (Chain-of-Thought), Zero-Shot and Few-Shot settings (Brown et al., 2020;Kojima et al., 2023;Wei et al., 2023).We find that neither ChatGPT nor other models perform very well in face of new knowledge.ChatGPT does a good job of understanding and differentiating new knowledge, but almost all models fail to reason between the new knowledge and the internal knowledge.This reminds us to remain cautious when large models encounter new scenarios and knowledge.In addition, we explore the impact of entity similarity on the model's understanding of entity knowledge, the impact of contextual entities, etc.
The contributions of our work are listed below: 1) we propose a new method KnowGen to generate new knowledge for simulating real scenarios.2) we apply our method to produce an artificial biological entity knowledge dataset, ALCUNA, as a benchmark for evaluating the performance of models in face of new knowledge.3) we evaluate and analyze several popular large models and obtain insightful observations and conclusions.
Our benchmark has been released to the community to facilitate future research2 .
Algorithm 1: Knowledge Generation // Get the triplet set for heredity, variation, dropout and extension // Variation: replacing the object with an entity from the same class for (e p , r, e ′ ) in T v R do E ′sb ← sib(e ′ ) e ′sb ← RandomSelect(E ′sb ) TR(ẽ) ← TR(ẽ) ∪ {(ẽ, r, e ′sb )} // Variation: add gaussian noise to the value or copy from E psb for (e p , a, v) in 2 KnowGen: Knowledge Generation In this section, we begin by presenting our inspiration, then formally introduce our knowledge generation method KnowGen.

Inspiration
According to the ontological form (Sowa, 1995;Noy et al., 2001), we can represent most knowledge as entities, the classes to which they belong, their attributes and the relations between them.And inspired by organisms: organisms can produce organisms with new properties naturally through heredity and variation.Can knowledge also be "inherited" and "varied" in a broader context?Different entities of the same class have different properties as well as commonalities.Generally speaking, entities of the same class are similar to some extent, while they have some different properties.By analogy with biological terminology, we refer to this characteristic of knowledge of similar entities as "hybridizability".This inspires us to use different entities of the same class to fuse their properties, simulating the process of biological inheritance and variation, to generate new entity knowledge.
In the following we will formally define and describe how knowledge is "inherited" and "varied" in our approach.

Knowledge Formalization
Based on the design of the ontology, We represent knowledge from the viewpoint of Entity.Entity e ∈ E is a distinct and identifiable object or individual in the world.Each entity can possess various attributes a ∈ A, which describe the properties of the entity with a value v ∈ V.At the same time, each entity can participate in relations r ∈ R with other entities.Both the attributes and relations of entity e can be represented as a set of property triplets: {(e, a, v)} = T A (e) ⊂ T A and {(e, r, e ′ )} = T R (e) ⊂ T R .Entities with similar characteristics may be aggregated into a class C ⊂ E. For convenience, we denote the same properties across all entities in class C as T (C) = T A (C) ∪ T R (C).Without any special description, the triplet T (e) of entity e refers to its unique properties.
For example, Figure 1 shows an example in the form of such structured knowledge, where both Alpaca and Vicuna are existing entities belonging to the class Camels and Alcuna is an artificial entity generated by us.Alpaca has attributes such as "Body mass" and relations such as "Eaten by", and can be formed into triplets such as (Alpaca, Body mass, 60kg) and (Alpaca, Eaten by, Cougar).

Construction of New Entities
Focusing on knowledge in the form of ontology, we aim to construct artificial entities reasonably to accelerate the process of new knowledge generation in the real world.When creating an artificial entity within a specific class, it must adhere to certain common properties shared by other entities in the same class.Furthermore, it is essential for the artificial entity to possess some unique properties that differentiate it from existing entities.To address these requirements for individuality and commonality, we propose a fast and effective method to construct an artificial entity that fuses attributes and relations of entities within the same class.
Initially, we select an existing entity e p from a specific class C to serve as the parent entity for our artificial entity ẽ and consider other entities within the same class C as sibling entities of parent, denoted by sib(e p ) = {e 1 , ...e n }.Our goal is to construct an artificial entity that exhibits high similarity to the parent entity and conforms to the commonality of the class (heredity) while incorporating properties from the sibling entities or reasonably changing property values (variation).As an example, in Figure 1, the artificial entity Alcuna inherits the "Diet" and other attributes from the parent Alpaca, while the "Body mass" is varied.
Besides the above operations of heredity and variation, we construct new entities with additional extension and dropout of the properties of the new entities, in order to mimic human progressive cognitive processes of entities.As the example in Figure 1 shows, we extend the attribute of "First appearance" from Vicuna to Alcuna, and drop out the "Life span" from the parent entity Alpaca.
The whole method can be seen in Algorithm 1.A detailed description of KnowGen with natural language and expressions is shown in Appendix A. The entities constructed in this way are not only reasonable but also convenient for us to assess the model's cognitive ability to associate and differentiate new knowledge with the existing knowledge.

Question Answering as Evaluation Task
Based on the constructed artificial entities, a natural form of evaluation is to ask questions to LLMs in the context of new knowledge.In specific, we leverage an attribute triplet (ẽ, a, v) for generating a onehop question q(ẽ, a, v) by specifically asking about the object v, given the subject ẽ and attribute a.With a chain of relation triplets T C = (ẽ, r, e 1 ) → (e 1 , r 1 , e 2 ) → ... → (e N −1 , r N −1 , e N ), we construct a multi-hop question q(T C ) asking about the tail object e N , given the head subject ẽ and relations {r, r 1 , ..., r N −1 }.
We propose that LLM requires the knowledge understanding, knowledge differentiation and knowledge association abilities when faced with new knowledge.To enable a more detailed assessment, we design diverse categories of questions to evaluate each ability, as shown on the right side of Figure 1.Specifically, we sample some attribute triplets in the variation set T v A (ẽ) and the dropout set T d A (ẽ) to create KD question set, which is ideally suited for evaluating the knowledge differentiation ability of a model to distinguish between the parent entity e p existing in its internal knowledge and the newly constructed artificial entity ẽ provided in the context.
The proficiency of LLMs in reasoning along mastered knowledge graph has been demonstrated in previous studies (Hu et al., 2022;Choudhary and Reddy, 2023).However, it remains uncertain whether it can effectively establish connection between newly encountered artificial entity and the existing knowledge for multi-hop reasoning task.
To investigate this, we incorporate the relations of artificial entity to construct a new graph, which encompasses both existing entities and the artificial entity.We then perform a breadth-first-search on the relation graph to identify a chain of relation triplets T C with the artificial entity serving as root (e.g., [(Alcuna, Eaten by, Jaguar), (Jaguar, Compete with, Maned Wolf)]), and then utilize the chain to generate a multi-hop question q(T C ) (e.g., What organism is the competitor of the Alcuna's natural enemy?).We group such questions into KA question set.
For the rest of the artificial entity's property triplets, we utilize them to evaluate the ability of remembering and understanding new knowledge by simply asking questions about the objects in triplets.We group such questions into KU question set.

ALCUNA: Our Benchmark
With our proposed method, one can create a large amount of new entities quickly based on existing structured knowledge.A natural attempt is to construct new organisms on already discovered ones, since the biological taxonomy is perfectly adapted to our proposed approach.Therefore, we utilize KnowGen to propose ALCUNA, a biological dataset for evaluating the ability of the model in face of new knowledge as shown in Figure 1.

EOL Database
We utilize the structured data from the EOL3 (Encyclopedia of Life) database (Parr et al., 2014) to provide existing knowledge, which is an online, freely accessible database that aims to provide information about all known species on Earth.EOL organizes all biological taxons into a taxonomic tree, in which each entity belongs to a class.The most intriguing feature of EOL is that it constructs rich structured data in the form of key-value pairs for each biological entity, including taxonomic rank, attributes, relationships and information source.As a whole, it contains 2404790 entities with a total of 13625612 properties consisted of 669 property types.The substantial volume of data, coupled with its well-organized format, renders EOL the most suitable data source for constructing ALCUNA.

Artificial Entity Construction Details
Each time we select a class C from the taxonomic tree of EOL and consider its members as entities.We then divide them into one parent entity e p and its siblings sib(e p ) from which we construct the artificial entity.Since a high-level taxon is usually not a specific organism, the properties it possesses may be too homogeneous, so we screen out all taxons belonging to kingdom, phylum and domain.
In the real world, the naming scheme of a newly found creature usually incorporates the same prefix or suffix of other creatures of the same species.In order to mimic the real world scenario and considering the tokenization algorithms used in LLMs, we firstly split the names of related existing entities (i.e. the parent entity and sibling entities of parent) into subwords 4 .Then we randomly select names of related entities, and for the i-th selected entity we choose its i-th subword.For example, "ALCUNA" is created from Alpaca and Vicuna.
We leverage ChatGPT in the process of question generation to avoid expensive labor costs following Petroni et al. (2019).Specifically, we use ChatGPT to generate a question template with a placeholder [T] given only the relevant properties to avoid introducing too much model's knowledge of a specific entity.Then we generate questions from the question template by replacing [T] with the name of head subject.We generate five question templates for each property group in form of multiple choice, fill-in-the-blank and Boolean questions.The details about the prompts used for question generation and examples are shown in Appendix B.2.To ensure the quality of automatic question generation by this method, we randomly sample 100 questions each for one-hop and multi-hop questions for human checking.It turns out that for the generated onehop questions, 98% are correct; for the multi-hop questions, 95% are correct.It shows that this way of constructing questions is acceptable. 4We utilize the tokenizer of GPT-2 for tokenization.

Dataset Summary
With the previous steps, we constructed a dataset, ALCUNA, for evaluating the ability of LLMs in face of new knowledge.The ALCUNA dataset consists of a total of 84351 questions about 3554 artificial entities.We ensure that the constructed artificial entities contain rich and unique attributes and relationships by filtering out parent entities with less than three properties.Specifically, each artificial entity contains 11.75 property triples and 25.39 siblings on average.The distribution of the number of property triplets is shown in Figure 2. We organize the dataset in terms of questions, and for each question we collect the corresponding property triplets as evidence and the relevant artificial entities' information as new knowledge.We divide all questions into three subsets, KU, KD and KA, as mentioned in Section 2.4 to measure the corresponding capabilities of LLMs in a fine-grained manner.In specific, KU, KD and KA contain 11316, 27186 and 15353 questions respectively.The details about question forms in the three subsets are shown in Appendix B.1.

LLMs Selected for Evaluation
We select several popular LLMs for evaluation and analysis on our benchmarks, including ChatGPT, Alpaca-7B, Vicuna-13B and ChatGLM-6B.The detailed description of our selected models can be found in Appendix C.

Evaluation Methods
In order to adapt the approach in the era of large models and to match the application scenarios in practice, we introduce two types of evaluation methods: zero-shot and few-shot.We implement both the vanilla and "Chain-of-Thought" (CoT) reasoning forms for zero-shot and few-shot setting.
For experiments in the zero-shot setting, our inputs are structured representations of new knowledge and questions to be answered by the model.For experiments in the few-shot setting, we include several examples of the same form together with the answers, which we hope will help the model understand.For the zero-shot CoT, we append "Let's think step by step." at the end of questions, and for the few-shot CoT, the reasoning process for the answers of examples is also attached.Please refer to Appendix D for the detailed method description.An example of prompt used in our experiment is shown in Table 12.

Evaluation Metric
Since our questions are in the form of multiple choice, fill-in-the-blank or Boolean questions, the golden answers are usually just one or a few words.Therefore, we determine the correctness of the model output by matching it with the answer (like Accuracy).Since there may be more than one possible answer to some questions, such as those asking about the geographical distribution of entities, etc., we consider the answer to be correct as long as it matches one of the correct answers.This is a less stringent measurement, but as seen in Section 5, most models still perform poorly.Using the proposed dataset, we evaluate each model's ability with each method for knowledge understanding, knowledge differentiation, and knowledge association, respectively.We report the average score on the entire benchmark in each setting.

Data Filtering
Since there are differences of internal/existing knowledge of different models due to the different model size and training data, we further filter the samples in our dataset for each model before testing, based on previous work (Petroni et al., 2019), in order not to be influenced by the difference (which is not the theme of our paper) and to compare the models' performance in face of new knowledge in a more focused and fair way.For our method of filtering questions, please refer to Appendix E. We experiment and analyze the four models mentioned in Section 4.1 based on the filtered new knowledge, using the evaluation settings introduced in Section 4.2.

Overall Results
The performance of the LLMs on our benchmark under different settings is shown in Table 1.We can see that ChatGPT has the best performance in all settings, which is consistent with our usual beliefs.Vicuna has the second best performance among all models.In terms of methods, the few-shot setting performs better than the zero-shot setting overall, and CoT performs better than the vanilla form in most cases.
In face of new knowledge, as seen in Table 1, LLMs do perform poorly except for ChatGPT on KU and KD experiments.Among all abilities, knowledge association is obviously the most difficult for LLMs, and all of them have difficulty in relating to their internal knowledge through new knowledge provided, and thus in making multi-hop reasoning correctly.The performance of knowledge understanding and knowledge differentiation is better than that of knowledge association, but yet not satisfactory for most LLMs.
In summary, current LLMs perform relatively poorly in face of new knowledge, slightly better in knowledge understanding and knowledge differentiation, and have more difficulty in reasoning across new and existing knowledge.In order to have a clearer view of models output, please refer to Appendix F for the analysis of models output.
Considering that it is expensive, slow and unstable to call ChatGPT's API, without loss of generality, all the following comparison experiments for analysis are conducted on three other models.In addition, for convenience, the following analysis experiments are performed in the setting of the vanilla few-shot method, and structured input artificial entity knowledge, if not specifically stated.

Impact of Entity Similarity
In this section we explore the effect of the similarity between the artificial entity and the parent entity on the model performance over the KD questions, which are designed to assess the model's ability to distinguish between new knowledge and existing knowledge.Specifically, we explore attribute similarity and name similarity.
The More Similar, the More Confusing (unless powerful enough) We define the proportion of overlap of properties between entities as the property similarity of entities.As shown in Figure 3 we divide questions according to the property similarity between the parental entities and the artificial entities, and calculate the scores of models in different similarity ranges on the KD problem respectively.We can find that the performance of the robust ChatGPT to differentiate entities is almost independent of the similarity.But other than it, all other models are affected by entity similarity.The more similar the new entities are to the existing entities, the harder they are to differentiate them, which illustrates the flaws of LLMs.The further analysis is shown in Appendix ??
The Name also Plays a Role Since each artificial entity is identified by its name, we think that the name might have some effect on the model's ability to distinguish new knowledge, so we conducted experiments on KD questions for comparison.We assign two different names to artificial entities with the same properties: one is a randomly generated sequence of characters (random), and the other is a random substitution of one character (similar) for the name of its parent entity.The results of the experiments on the KD questions are shown in Table 2.We can find that the influence of names on model cognition does exist, and similar names are more likely to cause confusion in the model.However, the effect is not very large and both results are lower, which shows that the modeling of new entities by LLM is both derived from the names and influenced by the properties.

Impact of Provided Knowledge
To further explore the performance of LLMs when new knowledge is provided in the context, we vary the knowledge content provided for the analysis experiments.
Parent Entity Aggravates Confusion According to Section 5.2, the models suffer from a certain degree of confusion when faced with new knowledge that overlaps in name or properties with internal knowledge.To further verify this, we explicitly introduce parent entities in the context.Specifically, we conducted two comparison experiments: 1) adding parent entity knowledge to the context; 2) adding a random existing entity knowledge as control variables.As shown in entities in the context, which is consistent with previous work (Shi et al., 2023).More significantly, for Vicuna and ChatGLM models, the parent entity brings more substantial performance degradation compared to the irrelevant entity, again confirming the confusion problem of existing large models in face of new knowledge.

Chain Entities are Key to Knowledge Association
To more clearly analyze why all models performs poorly on the knowledge association problem, we conduct two additional sets of experiments on the KA questions: 1) adding knowledge about the entities involved in the reasoning chain to the context.2) randomly sampling the same number of entities to the context for a fair comparison.The final results are shown in Table 4.We can find that the score of all models improves very much after adding the information of the entities required for inference, and the performance of all models decreases after adding irrelevant entities.This also shows that the main problem is that LLMs really cannot make the association between new and existing knowledge well, and not just the problem of not being able to make reasoning.
Structured Knowledge is Better Since our knowledge is represented structurally, the input of knowledge in our experiments is also structured, as shown in Table 12.To explore the effect of the form of the knowledge representation, we additionally do comparison experiments with knowledge in natural language form as input (NL).We use templates to transform each attribute into a language description for input similar to the template-based question generation process in Section 3.3.As can be seen from Table 5, all models generally perform better with the structured input setting (JSON).The models' understanding of this structured text may come from the code in the training data.This indicates that for this kind of high-density knowledge input, a clear and structured representation is more helpful for the model's understanding.
6 Assess New Models with ALCUNA In this section, we discuss how to apply the proposed ALCUNA benchmark to other models.There are two different application scenarios of our benchmark.First, if one wants to assess the knowledge understanding performance of different LLMs in the face of new knowledge, ALCUNA can be directly utilized for evaluation.On the other hand, if one wants to compare the knowledge differentiation and association abilities, the different background knowledge inside the different models may lead to an unfair comparison.Therefore, we need to conduct an additional filtering on ALCUNA to ensure the existing entities are possessed by all models, which will cause a shrinkage of our benchmark.Despite this, the current benchmark has been filtered on models such as Alpaca.A reasonable assumption is that the models that come after that will be more and more powerful, so the resulting shrinkage won't be very severe.
These models have shown breakthroughs on a variety of tasks, and some have been applied as a commercial product in daily work.Since the world is constantly changing, the ability of the models to perform when faced with new knowledge is critical.
Existing Benchmarks Benchmarks, such as SQuAD (Rajpurkar et al., 2016), SNLI(Bowman et al., 2015), GLUE (Wang et al., 2018), Super-GLUE (Wang et al., 2020), LAMBADA (Paperno et al., 2016), etc., is essential for setting evaluation criteria and tracking model performance.While some traditional benchmarks like SQuAD and SNLI assess single-task abilities, GLUE and Su-perGLUE evaluate general language models across various NLP tasks.More recently, with rapid development of LLMs, new benchmarks have been proposed to evaluate on more complex tasks such as college entrance exams, law school admission tests and so on (Hendrycks et al., 2021;Guo et al., 2023;Zhong et al., 2023).However, these benchmarks are still based on existing knowledge, which is abundant in the training data of LLM.Some knowledge benchmarks (Onoe et al., 2021;Mallen et al., 2023;Arodi et al., 2023) evaluate model's ability of knowledge integration while they either only evaluate on small models which are very different with LLMs (training data, ability, etc.) or simply use knowledge that already exists.Therefore, benchmarks that assess LLMs' ability with new knowledge rather than existing knowledge are urgently needed.
Source of New Knowledge Many works use temporal knowledge as a source of new knowledge to evaluate new knowledge behavior of LLMs (Lazaridou et al., 2021;Agarwal and Nenkova, 2022;Jang et al., 2023;Zaporojets et al., 2023).There is also some work that uses entity or attribute substitution to create new knowledge (Longpre et al., 2022;Zhou et al., 2023).A discussion of the differences and strengths and weaknesses of our work versus prior work is in Appendix I.

Conclusion
We propose a new approach KnowGen to construct new knowledge and build an artificial biological entity benchmark ALCUNA for evaluating the ability of LLMs faced with new knowledge.We test and analyze several popular models with commonly used methods, and find some useful conclusions.Our proposed approach and benchmark can help the development of more powerful LLMs that can understand, differentiate and reason across new and existing knowledge.In the future, we expect that more LLMs will be evaluated on our benchmark.More benchmarks in other disciplines also can be constructed based our method.

Limitations
Although the method we design can be used for the construction of any knowledge that satisfies the ontological representation, we have implemented it only on biological data.It is absolutely possible to use this approach to create new knowledge in other domains for evaluating LLMs.It is because KnowGen method for creating new knowledge only requires some defined classes, and some entities in the classes, entities with their own attributes, and connections between the entities.Any knowledge base with such a structure can be used to create new knowledge with our KnowGen method.In the future, we hope that new knowledge datasets from more domains will be available.
We evaluate only a few powerful models due to the fact that some models are closed source or the number of parameters is too large.We expect that more models can be measured using our benchmark.

A KnowGen Method
Here, we will describe knowgen in the form of natural language and math formulas.
Following the symbolic representation in Section 2.2, we divide the property triplets T (e) = T A (e) ∪ T R (e) of parent entity into three sets: remain, change and delete, denoted as T * (e) = T * A (e) ∪ T * R (e), * ∈ {r, c, d}.We randomly sample some triplets from sibling entities as add set T a (ẽ) = T a A (ẽ) ∪ T a R (ẽ).Subsequently, we fuse the above four sets5 to form the properties of artificial entity.We retain T r (e) for high similarity between the parent and artificial entities.
We modify the object of triplets in T c (e) with some perturbation to create uniqueness in the properties of the artificial entity, but without introducing overly unusual values.
, in which perturb() represents a perturbation function that can modify values according to specific needs.For numeric attribute value v, we add a gaussian noise to it.For non-numeric attribute value v, we use the object of the same attribute from siblings entities.For association object entity e ′ , we randomly choose one of its siblings as the modified value.
We also include T a (ẽ) to ensure the commonality of entities within the specific class C.  and Boolean questions.The specific distribution of question types is shown in Table 6.Note that we generate only multiple choice questions for the DA question set.This is done for two reasons: one is that the mutli-hop questions are too hard for current LLMs as shown in Table 1, and the other is that there may be multiple answers to one multi-hop question, and to evaluate the results more accurately and avoid false negatives, we utilize choices to limit the answer space.

B.2 Natural Language Question Generation
We prompt ChatGPT to generate question templates given a property triplet in one-hop setting or a chain of property triplets in multi-hop setting.We provide the prompt to generate Boolean, fill-inthe-blank question templates under both settings in Figure 4, 5 and 6.We also show some generated question templates in Table 8.To generate multiple choice questions, we simply append four choices to the corresponding fill-in-the-blank questions.
In order to guarantee the accuracy of the questions generated through this approach, we select a random sample of 100 questions for both one-hop and multi-hop questions to be checked by humans.The results indicate that over 98% of the one-hop questions and over 95% of the multi-hop questions generated are accurate.

C Introduction to Our Model We Select
We choose four representative and popular LLMs.ChatGPT, is one of the most powerful models, but is close-source, so we can only call it through the API.The other three are all excellent open-source System: You are a powerful question generation model with biological knowledge.Given a biological taxon's property name.You need to generate a question template with a placeholder [T] about the given property so that the placeholder [T] can be replaced with any taxons name.Try to create the question even if the property is not a biological property.Don't reply with any explanation or other information.• ChatGPT: ChatGPT is a sibling model to In-structGPT (Ouyang et al., 2022), which is trained on a vast instruction dataset and further tuned by reinforcement learning with human feedbacks (RLHF).
• Alpaca-7B: Alpaca is an open-source instruction-following LLM trained for academic purposes, which is fine-tuned from the LLaMA 7B model (Touvron et al., 2023) on 52K instruction-following demonstrations.
• Vicuna-13B: Vicuna is trained by fine-tuning LLaMA using 70K conversations with Chat-GPT shared by users.A preliminary evaluation using GPT-4 as a judge shows that Vicuna-13B achieves more than 90% of the quality of ChatGPT.
• ChatGLM-6B: ChatGLM is also an opensource instruction-following LLM, which is based on General Language Model (Du et al., 2022) framework.
We download the above three open source models from Huggingface6 , and thanks to the convenient design of the FastChat7 library, we unify the testing framework for all models and call them through the API.
We also consider several models with 7B parameters for evaluation to compare the performance of models of same size, which may help to analyze the impact of different training processes on the ability in face of new knowledge.We select Llama2-Chat-7B, Alpaca-7B, Vicuna-7B, and ChatGLM-7B specifically.The results are presented on Table 9.

D Introduction to Our Experiment Method D.1 Zero/Few-shot Evaluation Setting
The zero-shot setting is where the model is given explicit instructions to directly complete the mission.This scenario evaluates the ability of the original model to solve the problem autonomously without training.In our benchmark, the input to the zero-shot is the new knowledge of the artificial entity and a question to be asked about it.
Compared to the zero-shot setting, in the fewshot setting the model is given several additional examples from the same task as a reference.This allows evaluating the ability of the model to learn the task quickly based on a limited number of samples, and is also consistent with practical situations where supervised training is not convenient.According to the Min et al. ( 2022)'s study, in our benchmark assessment, we provided 3 to 5 examples for each type of problem, expecting that it would be sufficient to be able to demonstrate the labeling space for this type of problem.

D.2 Chain-of-Thought (CoT) Form
For both the zero-shot and few-shot evaluation settings, we add the design of the CoT form.For the zero-shot setting, we added the words "Let's think step by step." at the end of the question, expecting the model to output the thinking process, which can help LLM to reason about complex problems.For Llama2-Chat-7B Alpaca-7B Vicuna-7B ChatGLM-6B Llama2-Chat-7B Alpaca-7B Vicuna-7B ChatGLM-6B the few-shot setting, we add the thought process in the answer to each sample question shown to inspire the model.

E Details of Our Approach to Filtering Questions
Specifically, we retain only those artificial entities whose parent entities could be perfectly recalled by the model.In addition, since answering multi-hop questions requires the model to make use of each single-hop knowledge, we then filter out any reasoning chains that contain knowledge that cannot be correctly recalled by the model.The method used for the above two filtering is to construct question templates for the knowledge involved, including attributes and relationships, based on previous work (Petroni et al., 2019), and then to query the model using the few-shot setting.We filter samples in our benchmark for every evaluated model to ensure that our questions are specific to the ability about new knowledge, and then select the intersection of filtered questions for fair experimentation and analysis.The number of questions per category left is shown in Table 10.

F Analysis about Models' Output
An example of the output of the model is shown in Table 11.To better analyze the models' responses, as shown in Table 7, we divide all the models' error outputs into 3 categories, including rejecting responses, answering multiple options, and other incorrect responses.
We can find that the percentage of answering multiple options for all models is very small, which indicates that all models can understand and comply with our requirements very well.In addition, some of the questions are rejected by some models, probably because some models recognizes that it cannot answer the corresponding question and responds with "I don't know" or "I am sorry".

G Impact of Different Modifications
As shown in Section 5.2, models commonly struggle with knowledge differentiation when the artificial entity and the parent entity are similar.In this section, we further conduct an ablation study to investigate the specific impact of different modifications (i.e.variation and dropout).
To ablate one type of modification, we reconstruct artificial entities in the KD dataset.For each question in the original KD dataset, we have a parent entity e p and the corresponding attribute a to be queried.We then reconstruct several artificial entities by modifying one attribute of e p (except a) at a time.From this, we obtain several artificial entities with different similarity caused by the same type of modification.For a fair comparison, we only experiment with e p with 10 attributes.We randomly sample 1000 parent entities and create a total of 1000 × 10 = 10000 new artificial entities.Finally, we conduct experiments by querying models about them with the same question about a.
The experiment results are presented in Table 7. Different models exhibit almost the same trend under both modifications.It is clear that dropout yields a stable improvement as the similarity decreases while the impact of variation is relatively weak and insignificant.

Figure 1 :
Figure 1: Demonstration of ALCUNA, including heredity, variation, extension and dropout operations in KnowGen, generated artificial entity named Alcuna and three types of questions related to it.

Figure 2 :
Figure 2: (a) The number of entities with different counts of attributes.(b) The number of entities with different counts of relations.

/
* exemplars */ User: Property name: skeleton contains Agent: Question template: Which organic compound is a component of a [T]'s skeleton?User: Property name: body shape Agent: Question template: What's the body shape of [T] User: Property name: litters per year Agent: Question template: How many litters can [T] have per year?/* query */ User: Property name: {$Property name}

Figure 6 :
Figure 6: Prompt for generating fill-in-the-blank question templates.

Table 2 :
, Results on KD questions of the entity with the same property and different name.

Table 3 :
Results on KD questions with different knowledge in context.

Table 4 :
Results on KA questions with different knowledge in context.

Table 5 :
Results on ALCUNA with knowledge in JSON and natural language format (NL).

Table 7 :
The percentage of categories of incorrect responses from large models.

Table 9 :
Performance of LLMs of 7B parameters at our benchmark under different settings.

Table 10 :
Number of different forms of KU, KD and KA questions after filtering.