EpiK-Eval: Evaluation for Language Models as Epistemic Models

In the age of artificial intelligence, the role of large language models (LLMs) is becoming increasingly central. Despite their growing prevalence, their capacity to consolidate knowledge from different training documents - a crucial ability in numerous applications - remains unexplored. This paper presents the first study examining the capability of LLMs to effectively combine such information within their parameter space. We introduce EpiK-Eval, a novel question-answering benchmark tailored to evaluate LLMs' proficiency in formulating a coherent and consistent knowledge representation from segmented narratives. Evaluations across various LLMs reveal significant weaknesses in this domain. We contend that these shortcomings stem from the intrinsic nature of prevailing training objectives. Consequently, we advocate for refining the approach towards knowledge consolidation, as it harbors the potential to dramatically improve their overall effectiveness and performance. The findings from this study offer insights for developing more robust and reliable LLMs. Our code and benchmark are available at https://github.com/chandar-lab/EpiK-Eval


Introduction
Developing systems that can reason through language understanding has been a cornerstone in natural language processing research.Recent progress (Devlin et al., 2019;Brown et al., 2020;Touvron et al., 2023) has showcased notable advancements in a variety of reasoning tasks (Hwang et al., 2021;Cobbe et al., 2021;Yang et al., 2022;Han et al., 2022;Zelikman et al., 2022;Lampinen et al., 2022).Arguably, the ability of LMs to act as knowledge bases (Izacard et al., 2022) has been a large factor in these successes.However, observed errors (Kim et al., 2021;Zhang et al., 2023) on tasks which entail learning dependencies among multiple facts can be potentially 10% 45% 45% Tom ate an apple.
Tom ate a pear.
Bob ate an orange.
Tom ate an apple.
Tom ate a pear.
Tom ate an orange.
Tom ate a pear.
Bob ate an orange.
What did Tom eat?
Tom ate an apple and a pear.

Type I Systems Type II Systems
Figure 1: When training on samples (red), Type I systems process each sequence independently, unable to discern their interrelations.Presented with a question (gray), they are unable to consolidate their knowledge and instead assign a probability to each fact when answering (green).In contrast, Type II Systems can learn these relationships and possess a unified knowledge state, allowing them to answer accurately.linked to this knowledge being diffused, a state where the known information remains independent (AlKhamissi et al., 2022).
Meanwhile, humans maintain a consistent internal representation of the world which they actively use for reasoning (Nader, 2009;Johnson-Laird, 2010).This motivates language models to be equipped and evaluated to be knowledge consistent (Moghaddam and Honey, 2023;Hao et al., 2023), as the lack of consistency and consolidation in parametric knowledge could result in poor reasoning (Madsen et al., 2022;Valmeekam et al., 2022;Zheng et al., 2023).Extrapolating from AlKhamissi et al. (2022), we focus on the behaviour of LMs as epistemic models (Rendsvig and Symons, 2019;Osband et al., 2023) with a consolidated and consistent retention of multiple learned facts in its parameters, a knowledge state.
When the facts are concatenated into a long context, the knowledge state can be constructed solely from this context.The success of in-context learning, where a LM infers over a specific prompt describing the task and a few examples (Brown et al., 2020;Lu et al., 2021;Wu et al., 2022), primarily relies on the information in the input to be correct (Liu et al., 2021).However, real-world scenarios rarely adhere to this setting.For instance, a LM might have to recall information stored in its parameter space, but the information can originate from multiple sources encountered during training.Consequently, to maintain a consolidated knowledge state, LMs must serve as epistemic models, effectively modeling knowledge dependencies.As LMs continue to establish themselves as fundamental tools in machine learning research (Ahn et al., 2022;Huang et al., 2022;Hao et al., 2023), understanding their knowledge structure becomes imperative.The central question emerging from this exploration is whether the knowledge within these models exists as dispersed, standalone elements, or whether they possess the capacity to sustain an interconnected and consistent knowledge state.
Thus far, assessing parametric knowledge representations has garnered interest on two ends of a spectrum.On one side, the paradigm of LMs as knowledge bases hypothesizes that LMs store and retrieve knowledge when prompted, with improved efficiency possible by storing everincreasing amounts of knowledge (Petroni et al., 2019;Wang et al., 2020;Heinzerling and Inui, 2020;Sung et al., 2021;Dhingra et al., 2022).Others (Gu et al., 2023;Sap et al., 2022;Ruis et al., 2022;Zhang et al., 2023;Moghaddam and Honey, 2023) evaluate theory-of-mind (Premack and Woodruff, 1978), the ability to impute mental states to oneself and others, in LMs and show they fall short of having a consistent world belief state.Although theory-of-mind abilities for LMs enhance their reasoning and applications, evaluating and equipping the LMs with a first-order knowledge state is a necessary next step from LMs merely being knowledge bases.
To this end, we propose the novel Epistemic Knowledge Evaluation (EpiK-Eval) benchmark, to evaluate this ability to leverage such a consolidated knowledge state.EpiK-Eval trains LMs on In comparison, they perform much better if the same information can be found within a single document (blue).
stories segmented throughout the training corpus, analogous to news articles covering certain topics through time in large web corpora.These LMs are evaluated on their ability to consolidate the knowledge of the segmented narratives.Specifically, we test 7 different categories of reasoning involving complex yet explicit relations over the presented information.Although EpiK-Eval tasks require reasoning beyond explicit factual knowledge, they do not need modeling of other agent's belief states.As such EpiK-Eval is positioned an order of complexity above vanilla knowledge extraction tasks and an order below complex theory-of-mind tasks.We assess where LMs lie on the spectrum between Type I and Type II systems, based on their inferred knowledge state evaluated through aggregate performance on EpiK-Eval.Type I systems maintain information independently across different observations, whereas Type II systems are characterized by their ability to consolidate information from across those observations (example in Figure 1).Overall, our findings indicate that LMs exhibit characteristics of Type I rather than Type II systems.Indeed, we observe a significant performance gap between LMs trained on these segmented narratives versus unsegmented ones (Figure 2).Specifically, these models struggle to recall and consolidate the proper information and hallucinate facts and events at a higher rate than those trained on unsegmented stories.This pronounced disparity highlights an intrinsic shortcoming in existing LMs.We posit that this can be attributed to their training objective, suggesting a need for the development of novel Bob arrived at the restaurant at 6:00 PM. 2 minutes after arriving, Bob ordered a drink.10 minutes after ordering a drink, Bob ordered a hamburger.5 minutes after ordering a hamburger, Bob asked for the bill. 2 + 10 + 5 = 17.The answer is 6:17 PM.
Bob arrived at the restaurant at 6:00 PM. 2 minutes after arriving, Bob ordered a drink.10 minutes after ordering a drink, Bob ordered a hamburger.5 minutes after ordering a hamburger, Bob asked for the bill.

Epistemology & Language Models
Epistemic frameworks (Wang, 2015;Rendsvig and Symons, 2019) are formal systems used to represent knowledge, belief and the uncertainty that entails what a reasoning system knows and/or believes.This is enabled through organizing the knowledge observed by the system.The rules to combine the knowledge in the abstract framework governs combining a new information to the current set of information, or when to ignore the new information, and using the current beliefs to anticipate related events.While LMs behave as KBs to store known relations, epistemic logic provides us with the inspiration to describe how these models organize and update their knowledge.
Consider the example from Figure 1, where we have the knowledge x 1 : "Tom ate an apple.",x 2 : "Tom ate a pear." and x 3 : "Bob ate an orange.".Prompted with the question "What did Tom eat?", the model must recall knowledge from within its parameter space.It has to connect x 1 and x 2 while also ignoring x 3 .To answer the query, a system is expected to consolidate the information and retain a knowledge state over the information it had seen until then.However, an inability to draw the connections would leave the facts disconnected.We describe the model that struggles to consolidate as Type I, and one that is better at it and infer over a consolidated knowledge state as Type II.
With LMs being used in real-world scenarios where information is frequently presented as a pe-riodic flow, it is necessary that they use such information appropriately during inference.While techniques like self-prompting and generation over selfretrieval are gaining popularity, the performance relies on the quality of the prompt, which adds to the robustness concerns on the performance of LMs on varying reasoning tasks.Inspired by epistemology, we design EpiK-Eval to diagnose whether LMs comply with a first-order knowledge state following a sequence of facts which holds a consolidated summary of information during inference.

EpiK-Eval
The EpiK-Eval benchmark presents a suite of novel, narrative-based diagnostic tasks, meticulously designed to evaluate a LM's capacity to construct a comprehensive, unified knowledge state.
Dataset: Our benchmark comprises 18 tasks, which are questions about relations between facts and events in stories, e.g., "Does x happen before/after y?".Table 2 provides the full list of tasks.For each task, we generate 100 stories following a per task template.Task 2 for instance uses the following template: [Task 2] {name}'s Vacation {name} went {activity} on {day}. . . .where the first line is the story title, the {name} is randomly sampled such that it is unique to each story and the {activity} and {day} in a sentence are randomly sampled from the list ["fishing", "hiking"] and ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] respectively.The story can have a random number of sentences, with the range pre-determined for each task, ex.Task 2 stories can have between 3 and 5 sentences.An example story for Task 2 is

Category Description Tasks
Counting Tests proficiency in quantifying occurrences and quantities.
• How many times does x happen?
Listing Tests ability to identify and enumerate items within a given set or list.
• List the different x.
• Is x the y'th on the list?• Among the list of x, is there y?

Ranking
Tests understanding of relative amounts, frequency, and ranking.
• Does x happen more/less often than y?
• Is x the same as y?
Temporal Tests if the model has learned temporal dependencies in the data, such as what events follow each other.
• Does x happen before/after y?
• When x happens, does y happen?
• Between x and y, does z happen?
• How much time has passed between x and y?
• At what time does x happen based on y?
• After how many x does y happen?
• What is the state of x when y happens?
Causal Tests understanding of cause-effect.
• If x had/hadn't happened, would y have happened?
Uniqueness Tests understanding of exclusivity or uniqueness in the data.
• Is x the only time that y happens?
• The x'th time that y happens, what is a unique detail about y compared to the other x times?• Among the list of x, is there only y?

Consistency
Tests ability to recognize consistency in patterns or states.
• Every time x happens, is y always the same?Thus, with a 100 stories generated for each 18 tasks, there is a total of 1800 stories, which referred to as our dataset of unsegmented stories D U = {x 1 , x 2 , ..., x 1800 }.After generating these stories, we also generate a second dataset, consisting of the segmented version of these stories.
For each given story, we segment it into individual sentences and add a part number to the title.For example, given the previous story about Tom, we would get the following three text sequences: [Task 2] Tom's Vacation, Part 1/3 Tom went fishing on Monday.
We do this for all 1800 stories and get 6800 story segments, which form our dataset of story segments D S = {s 1 , s 2 , ..., s 6800 }.
For each story, we also generate one questionanswer pair.Questions are re-phrasings of the task.For example, for Task 2 "How many times does x happen?",we have "How many times did {name} go fishing?".The question-answer pairs are also generated following a template.The template always consists of a question followed by the answer which itself has three parts: recall of the entire story, an optional reasoning part depending on the task and the final answer.For example, questionanswers pairs in Task 2 uses the following template [Task 2] How many times did {name} go fishing?{story} The answer is {count}.
with an example of a generated question-answer pair being [Task 2] How many times did Tom go fishing?Tom went fishing on Monday.Tom went hiking on Wednesday.Tom went fishing on Saturday.The answer is 2.
A description of each task, its templates and examples are provided in Appendix B. A few examples are also provided in Table 1.
Having generated one question per story, we have a total of 1800 question-answer pairs split randomly into two sets: the validation and the test set.For the models to learn the answer format, we add question-answer examples to the training set.We thus generate an additional 1800 stories and question-answer pairs.We discard the stories and add these 1800 question-answer pairs to the training set, such that there are no overlaps between questions in the training, validation and test set.
Evaluation Process: To evaluate pre-trained LMs for their ability to consolidate knowledge, given a pre-trained language model we make two copies of it: M U and M S .We fine-tune M U on the unsegmented stories and M S on the segmented stories.The prior setting ensures all necessary information for answering a given question to be found in a single text sequence without requiring the model to learn dependencies across multiple text sequences.The latter requires consolidating information from the narrative segments.Having both allows to measure the effect of information being spread across separate text sequences and the LMs' ability to consolidate this knowledge at inference, by measuring the gap in performance between both models.
M U and M S are fine-tuned on their respective dataset, D U and D S , as well as the training set of question-answer examples.Thus, one epoch for M U consists of 3600 samples (1800 stories + 1800 q/a examples) and one epoch for M S of 8600 samples (6800 segments + 1800 q/a examples).Samples are shuffled such that a batch may contain a mix of stories and question-answer examples in the case of M U or story segments and questionanswer examples in the case of M S .Models are fine-tuned with their respective pre-training objective.Specifically, in the case of encoder-decoder style models, the story's title (first line in the text sequence) is fed to the encoder and the decoder is expected to output the rest of the story in the case of M U or the story segment in the case of M S .As for question-answer pairs, the question is fed to the encoder and the model is expected to output the answer.For causal language models, they are simply expected to predict the next token in the given sequence, as is standard procedure.Precisely, for M U , a text sequence is either an entire story or a question concatenated with its answer, while for M S , a text sequence is either a story segment or a question concatenated with its answer.
During fine-tuning, both models are also periodically evaluated on the validation set.Models are run in inference mode as described in the papers they were introduced in.We prompt models with questions from the validation set and model answers are compared to the target answers.For an answer to be deemed as correct, it must match the exact target answer.This is to capture potential recall and reasoning errors as well as verify the final answer.This is important for evaluating M S 's ability to consolidate the separate story segments, which is why we require the model to recall the entire story when answering a question.Here, M U serves as an upper-bound on the performance and any potential gap in performance between it and M S showcases the added difficulty of consolidating knowledge from the story segments.The number of correct responses over the total number of questions is referred to as the accuracy.We also measure an additional metric, which we refer to as the hallucination rate.Given an answer, consider only the recall part of the answer and disregard the reasoning part and the final answer.The hallucination rate is the number of recalled sentences that contain an error (does not match with the actual sentence in the narrative) over the total number of recalled sentences.This provides a more finegrained examination of the recall and knowledge consolidation capabilities of the model.We want to evaluate if the model is more likely to hallucinate facts, events or segments when recalling these from multiple training sequences (segmented setup) versus a single training sequence (unsegmented setup).
Once both models have been fine-tuned, we take the best performing checkpoint of each model on the validation set and evaluate these on the test set.This is done in the same manner as the validation, except that the questions are from the test set.

Experiments
We experiment with three different LLMs: T5 (Raffel et al., 2020), its instruction-tuned variant Flan-T5 (Chung et al., 2022), and OPT (Zhang et al., 2022).For T5 and Flan-T5 models, we benchmark sizes from Small to XL.For the OPT model, we benchmark sizes 125M, 350M, 1.3B and 2.7B parameters.Unless otherwise stated, the reported performance is on the test set.Performance scores presented in this section are always averaged over the 18 tasks of our benchmark.Individual task performance can be found in Appendix B, and training details are provided in Appendix A.

Are LMs Type I or Type II Systems?
Answering this question relies on 1) the model performing well in the unsegmented setting and 2) equal performance in the segmented setup.
Performance on our benchmark is shown in In the unsegmented setting, Flan-T5 surpasses T5.OPT, on the other hand, starts behind both but matches T5's performance in its largest variant.Interestingly, in the segmented scenario, all models exhibit comparable performance.
When scaling the LMs, performance generally improves as LMs are scaled in both segmented and unsegmented setups.The only exception is T5 when trained on unsegmented stories.

In-Depth Answer Analysis
In order to better understand the models' behaviour, we take a closer look at the models' answers.We break these down into three parts: the recall of the story, the reasoning and the final answer.
Recall: We initially examine the models' recall capabilities.The left plot of Figure 3 presents the percentage of correct recalls.We observe: • A consistent trend with Figure 2, models trained on unsegmented stories greatly outperform those trained on segmented ones.
• Within the unsegmented setting, OPT lags slightly behind T5, while T5 and Flan-T5 show comparable recall capabilities.Scaling effects are more pronounced for OPT, while T5 and Flan-T5 show marginal improvements.
• Models trained on segmented stories all demonstrate similar performance, with notable improvements as they scale.
Analysis of model recall lengths compared to target distribution revealed similar patterns, indicating that segmentation doesn't impact the recall span in terms of sentence numbers.See Appendix D.
Reasoning: When narrowing down to answers with correct recall, we analyze reasoning capabilities, as depicted in the center plot of Figure 3. Noteworthy observations include: • Models trained on segmented stories perform slightly better than their unsegmented counterparts, although this may be due to variance from the much smaller subset size for segmented stories, rather than better reasoning capabilities.
• Among unsegmented models, T5 trails both Flan-T5 and OPT.While it's expected for Flan-T5 to outperform T5 due to its instruction tuning, OPT outperforming T5 is intriguing.
• For segmented models, performance is generally uniform across all models.However, in the largest variants, both T5 and Flan-T5 experience a significant drop in performance.
• In both the segmented and unsegmented setting, scaling doesn't enhance reasoning skills.
Final Answers: Focusing on answers with both correct recall and reasoning, or just correct recall for tasks without a reasoning component, we assess the correctness of the final answers (right plot of Figure 3).We observe that: • Segmented models show superior performance, but the variance argument remains relevant.
• The performance of a given model seems to follow a similar trend in both settings.
• As models scale, Flan-T5 and OPT both show improved performance in each setting.However, T5's performance declines with its largest variant.
9528 Given these results, the drop in performance of T5-XL in Figure 2 can be explained by its poor reasoning and final answer performance rather than issues with recall.
Overall, this reveals that while unsegmentedtrained models may falter in recall, reasoning, or providing the correct final answer, segmentedtrained models predominantly grapple with recall errors.This shows that their main challenge is consolidating knowledge effectively in order to solve the problem.

Hallucinations
Our next analysis of model behaviour looks at the tendency of these models to "hallucinate".Examples of such hallucinations can be found in Appendix C. Figure 4 showcases the hallucination rate, defined as the percentage of sentences in the recall part of an answer that aren't present in the target.This rate is presented for both the training and test sets.
For the training set, the hallucination rate remains nearly 0% for both segmented and unsegmented stories, with the highest observed rate being 0.2%.However, a distinct difference emerges in the test set.Models trained on segmented stories display a significant gap in hallucination rates compared to those trained on unsegmented stories.This suggests that models recalling and consolidating information from multiple training documents are more susceptible to hallucinations, which highlights one of the potential reasons why hallucinations happen in LLMs.
Upon examining the unsegmented-trained mod-els, the hallucination rate of T5, Flan-T5 and OPT decreases as model size increases.Notably, these models exhibit a slightly higher hallucination rate on the test set than on the training set.This could be attributed to the change in context, where the model is prompted with a question instead of a story title.Interestingly, OPT models in the unsegmented setting hallucinate more than T5 and Flan-T5 models on the test set, but not on the training set.This behavior might stem from OPT models overfitting training samples with positional embeddings, affecting their performance when prompted with questions, which differ in length from titles.
Conversely, for the segmented-trained models, hallucination rates among different models are more similar and also decrease with scale.However, whether this decline continues as models increase in size is uncertain.To elucidate this, experiments with larger models are essential.

Effect of Scale
Both key metrics we use to study knowledge consolidation: recall performance and hallucination rate, seem to improve as model size increases.However, given the improvement in performance in both the unsegmented and segmented settings, this is not conclusive evidence to knowledge consolidation happening with scale.To support the emergent behavior hypothesis (Wei et al., 2022b), the improvement rate in the segmented setting should significantly outpace that in the unsegmented one.Additionally, it remains uncertain if performance in the segmented scenario will eventually plateau, perhaps before reaching the performance levels of models trained on unsegmented stories.To truly gauge the impact of scale on knowledge consolidation, experiments with larger models are needed, but we unfortunately lack the compute to run them.

Related Work
Knowledge Representation: Results from probing neural language models have shown models not only encoding facts (Petroni et al., 2019) or linguistic knowledge (Tenney et al., 2019) in their parameters, but also using them in downstream tasks (Peters et al., 2018;Goldberg, 2019;Kazemnejad et al., 2023).The amount of knowledge a model retains in the parameters (Dhingra et al., 2022;AlKhamissi et al., 2022;Roberts et al., 2020) is perceived as a reflection of the models' success in downstream tasks (Moghaddam and Honey, 2023).However, relying on parameters for knowledge has shown that language models can hallucinate (Ji et al., 2023) and struggle to model less frequent data (Kandpal et al., 2022).Going further than the existing work, with the proposed EpiK-Eval framework we attempt to understand LMs' behavior towards knowledge representation of segmented text chunks describing a set of relation-categories.
Multi-Hop QA: In multi-hop question answering (QA) benchmarks (Welbl et al., 2017;Yang et al., 2018;Ho et al., 2020;Mavi et al., 2022), models are tasked with answering questions by navigating through multiple documents and extracting relevant information from them.This process is pure inference, with the model relying on external knowledge sourced directly from the documents.
Conversely, we focus on investigating how well these models can recall and consolidate the knowledge already embedded within their parameter space-knowledge acquired during training (referred to as "internal knowledge").This contrasts with merely assessing the model's ability to conduct document-based searches.

Artifacts of Reasoning in LMs:
To utilize the stored knowledge, approaches such as prompting and in-context learning (Wei et al., 2022a,b,c;Liu et al., 2023) have gained popularity for tasks involving reasoning over a given context.While LMs have shown strong reasoning skills when information is fully available in the context (Han et al., 2022;Zelikman et al., 2022), inconsistent results appear when such is not the case (Gu et al., 2023).While Li et al. (2021) demonstrate that LMs maintain state information, the authors probe for factual information that does not require consolidation.Unlike existing works, using EpiK-Eval, we focus on studying the effect of information spread during a LM's training on the model's ability to recall and consolidate the knowledge at inference.

Discussion
Consolidating Knowledge in Language Models: Our study delineates the limitations of language models in consolidating knowledge across different text sequences, compared to a noticeably stronger performance when working within a single text sequence.We attribute this disparity primarily to the core objective of such models: to enhance word prediction within given sequences, while also using knowledge from previously processed text sequences, encoded in the model's parameters.
Current pre-training objectives such as masked and causal language modeling (Devlin et al., 2019;Brown et al., 2020) potentially prioritize learning dependencies within text sequences over those spanning across multiple ones.For instance, a cause-and-effect relationship could exist between two sequences.However, if the content of the first does not explicitly help in predicting the second's content, the model might not learn this relation.Consequently, numerous inter-sequence dependencies in the training corpus, which may hold significant importance in downstream tasks, may be ignored owing to their perceived irrelevance in the next-word prediction task.In contrast, the model can readily establish correlational dependencies within individual sequences which can even lead to the direct memorization of text, a frequent occurrence in LLMs (Carlini et al., 2020;McCoy et al., 2021;Tirumala et al., 2022;Carlini et al., 2023).
In light of these arguments and results, we assert the need to revisit the training objectives of language models.To utilize these models effectively, we should prioritize devising training methods that capture and consolidate the numerous information dependencies within the training corpus.A potential avenue to explore could be to guide these models in consolidating their knowledge via methods such as RL-HF (Bai et al., 2022) or self-taught (Zelikman et al., 2022).
Exploiting Longer Context vs Knowledge Consolidation: In response to the knowledge consolidation challenge faced by LMs, it could be ar-gued that the inclusion of a comprehensive context through prompts could be an effective alternative to having the LM remember the necessary context autonomously.This proposition is emboldened by recent successes in extending the context window size (Xiong et al., 2022;Ratner et al., 2023;Anthropic, 2023) as well as the sequence length (Dai et al., 2019;Gu et al., 2022;Poli et al., 2023;Bertsch et al., 2023).Such additional information can be supplied by either a user or an auxiliary system (Nakano et al., 2022;Schick et al., 2023;Patil et al., 2023;Paranjape et al., 2023).
Expecting humans to provide comprehensive context may, however, be impractical.Given the diverse range of specialist knowledge needed for various tasks, it's possible for a user to lack the necessary expertise.On the other hand, integrating auxiliary systems to provide these contexts presents a challenge analogous to that faced by LMs.To be useful, such an auxiliary system must understand and retain all relevant interdependencies within the training corpus related to problem-solving.Unfortunately, current auxiliary systems, such as search engines or retrieval tools (Karpukhin et al., 2020;Guu et al., 2020;Lewis et al., 2020), fall short of this holistic understanding and recall of context.
Another strategy leveraging longer context windows can be to train LMs on concatenated text sequences with inherent relevance (Shi et al., 2023).This approach, however, presents its own complexities.The innumerable ways texts can interrelate complicates the process of determining and training on all possible combinations.Hence current solutions do not provide a comprehensive solution to this issue.Knowledge Consolidation at Scale: Our study underscores a substantial discrepancy in performance between models trained on segmented stories and those trained on unsegmented stories.If we assume that the recall performance for models in the segmented setting continues to improve without plateauing prematurely, our estimates (Caballero et al., 2022) suggest that a model with 172B parameters, trained on our benchmark's segmented stories, would be required to match the performance of an 80M parameter model trained on the unsegmented stories.
Although consolidating knowledge from fragmented text sequences arguably poses a greater challenge than from a singular cohesive text, the margin for enhancement in this domain is possi-bly significant.As we venture into the realm of real-world applications (OpenAI, 2023;Anil et al., 2023;Touvron et al., 2023), there exist a wide array of settings that necessitate a LLM to recall and integrate data from multiple text sequences.Accordingly, enhancing this ability can potentially elevate the efficiency, robustness and performance of such models, thereby redefining the landscape of complex language tasks.
One challenge with studying this problem at scale is distinguishing whether LLMs demonstrate an improved ability to model dependencies within their training corpus (emergent behavior) or if the dataset diversity enables the extraction of most dependencies of interest within single text sequences in the corpus.To probe for knowledge consolidation at scale, we propose the use of self-contained narratives such as short stories or books.These documents can be segmented and dispersed within the training corpus of LLMs (Touvron et al., 2023;Computer, 2023) and evaluation can be performed in a similar fashion as EpiK-Eval, where questions can assess the understanding of the overall narrative and the various relations in the story.With complex enough naratives, this methodology should provide a robust framework for examining the knowledge consolidation capabilities of LLMs.

Conclusion
In this paper, we presented the EpiK-Eval benchmark, a tool designed specifically to evaluate the proficiency of LMs in consolidating their knowledge for problem-solving tasks.Our findings underscore the limitations of current LMs, which appear to mostly maintain a disjoint knowledge state of training observations.Further, we note a significant performance gap and an increased rate of hallucinations for models trained on segmented narratives compared to those trained on unsegmented ones.We attribute these discrepancies to the training objectives of the models, which underscores the need to more effectively model the dependencies within the training corpus.By highlighting current limitations and opportunities for improving LMs, these results delineate paths for future research, hopefully enabling the growth of language models beyond simple knowledge bases.

Limitations
Ensuring that EpiK-Eval's data doesn't leak into the pre-training set of LLMs is a challenge.This inclusion could skew the benchmark's results.One straightforward solution is to check if the data exists within the pre-training set, though this method is computationally intensive.Another practical approach is to generate and release a new version of the dataset periodically, for instance, annually.To further safeguard against potential leaks, we've encrypted the data in the public release of the benchmark.Users are required to decrypt it locally before use.

Ethics Statement
This study employs machine learning algorithms, specifically large language models, which are trained on vast amounts of text data.While these models have shown remarkable predictive capabilities, it is important to underscore the ethical concern that arises from their training process.These models often learn from data that is intrinsically embedded with human biases, which can subsequently be reflected in their outputs.Therefore, it is paramount to approach any output produced by these models with critical consideration of this potential for embedded bias.

A Training Details
T5 & Flan-T5: All models are fine-tuned for 360,000 steps with a batch size of 50.We use the Adam optimizer, setting a base learning rate of 1 × 10 −4 .The learning rate undergoes a linear warmup for the initial 1% of training steps, after which it remains constant.No weight decay or gradient clipping is applied.
OPT: Except for the learning rate, we use the same hyperparameters as with T5 and Flan-T5.The base learning rates for different OPT model sizes are:

B Per Task Description & Results
We provide a detailed description of each task in Tables 3-20, along with the per task results in Figures 5-22.

C Hallucination Examples
In Table 21 and Table 22, we present examples of hallucinations observed in models trained on segmented stories.Our analysis revealed no significant differences in the patterns of hallucinations across various models.It's also worth noting that models trained on unsegmented stories exhibited similar hallucination patterns, albeit at a reduced frequency (as shown in Figure 4).

D Recall Length Distribution
We analyzed the length of story recalls in relation to the target distribution to determine the impact of training on segmented versus unsegmented stories.Figure 23 displays the distribution of the recall length, measured in number of sentences, for both the model and the target.For brevity, we present results only for the largest variant of each model, noting that similar patterns were observed across all model sizes.Our analysis revealed no significant differences between these distributions, leading us to conclude that training on segmented stories does not influence the recall length of the models' outputs.
Task 1: "List the different x." Category: Listing Description: The objective of this task is to identify and list the days on which a person worked from home.Tom worked from home on Friday.

Template
The answer is Monday and Friday.
• The story can span one to three sentences, excluding the title.Sentences are ordered chronologically based on {day}.
Table 3: Templates for generating Task 1 stories, questions, and answers, with an example provided.Task 2: "How many times does x happen?" Category: Counting Description: The task aims to count the number of times the fishing activity occurred within the story.
• {day}: Randomly sampled without replacement from the seven days of the week.
• {answer}: A numeric value representing the count.
• The story comprises 3 to 5 sentences, excluding the title.Sentences are ordered chronologically by {day}.
3:00 PM -Tom has a meeting with co-worker B. 4:00 PM -Tom fills up some forms.5:00 PM -Tom has a meeting with co-worker A. Question: Question: [Task 3] Does {name} have more meetings with co-worker A or B?
[Task 3] Does Tom have more meetings with co-worker A or B? Answer: Answer: {story} 1:00 PM -Tom has a meeting with co-worker A. The answer is {answer}.
3:00 PM -Tom has a meeting with co-worker B. 4:00 PM -Tom fills up some forms.5:00 PM -Tom has a meeting with co-worker A.
The answer is A.
• The story consists of 3 to 5 sentences, excluding the title.Sentences are chronologically ordered by {time}.
Table 5: Templates for generating Task 3 stories, questions, and answers, with an example provided.Task 4: "Does x happen before/after y?" Category: Temporal Description: The task is designed to ascertain whether a specific event happened before or after another event.Additionally, a reasoning based on the order of months is provided to justify the answer.Tom goes on a vacation in June.The answer is {answer}.
Tom gets married in October.
March is not after October.
The answer is no.
• The story consists of 2 to 3 sentences, excluding the title.Sentences are chronologically ordered by {month}.
Table 6: Templates for generating Task 4 stories, questions, and answers, with an example provided.Tom was in Paris on Monday....
Tom was in New York on Tuesday.{name_a} was in {location} on {day}.
Alice was in Los Angeles on Monday.{name_b} was in {location} on {day}.
[Task 5] When Tom is in Paris, is Alice in Rome?

Answer:
Answer: {story} Tom was in Paris on Monday.Those are {reasoning} days.
Tom was in New York on Tuesday.The answer is {answer}.
Alice was in Los Angeles on Monday.
Alice was in Rome on Tuesday.
Those are different days.
The answer is no.
• {location_a} and {location_b} are randomly drawn from the sampled {location} for person A and B respectively.• {reasoning}: Specifies whether the days of the events in question are "the same" or "different".• {answer}: A simple "yes" or "no".
• Each person's events are ordered by {day}-person A's events are listed first, followed by person B's events.There can be between 2 and 3 sentences per person.
Table 7: Templates for generating Task 5 stories, questions, and answers, with an example provided.The answer is yes.
• The story contains 4 to 5 sentences, excluding the title.Sentences are ordered by {day}.
• {activity_c} is sampled from the list of activities but cannot be the same as {activity_a} or {activity_b}.
• The story contains 3 to 4 sentences, excluding the title, with sentences ordered chronologically by {time}.
Table 9: Templates for generating Task 7 stories, questions, and answers, with an example provided.
• {event} is selected without replacement from ["wrote a letter", "sent an email", "made a phone call", "started a video chat"].• {event_a} and {event_b} are randomly chosen among the sampled {event}, with {event_b} always occurring after {event_a}.• {reasoning} describes the subtraction of the times corresponding to {event_a} from {event_b}, representing the duration in hours.• {answer} indicates the number of hours.
• The story contains 3 to 4 sentences, excluding the title, arranged chronologically by {time}.

Details
• {day} can be any day of the week.
• {activity} in each statement can be either "canoeing" or "hiking".However, "hunting" must be picked at least twice but no more than three times.• {friend} is randomly sampled from a list of names.• {q_activity} can be either "canoeing" or "hiking".
• {x} is a number between 1 and the number of times {q_activity} occurs.
• {answer} is the name of the person who was with {name} during the {x}'th occurrence of the {q_activity}.• The story comprises 4 or 5 sentences, not including the title.Sentences are arranged by {day}.
Wednesday, Rheanna sold a staple for 1$. 3 + 2 = 5.The answer is no.There is no significant difference between these distributions, suggesting that training on segmented stories does not affect recall length.

Figure 2 :
Figure 2: Performance on EpiK-Eval, measuring accuracy as the percentage of correct answers.Models struggle to answer questions that require consolidating knowledge from multiple training documents (orange).In comparison, they perform much better if the same information can be found within a single document (blue).

Figure 3 :
Figure 3: Breakdown of model answers into three parts: story recall, reasoning and final answer.(Left) percentage of correct recalls.(Center) percentage of correct reasonings when recall is correct.(Right) percentage of correct final answers when recall and reasoning are correct or when recall is correct and task has no reasoning part.Recall performance is worse when models need to recollect information from multiple training documents (orange) versus from single documents (blue), but reasoning and final answer capabilities seem unaffected.

Figure 4 :
Figure 4: Model hallucination rate on the training set (left) and the test set (right).Models which need to recall information from multiple documents seen during training (orange) are more prone to hallucinations during testing than models which only need to recall information from a single training document (blue).

Figure 5 :
Figure 5: Task 1 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left) and final answers (right).

Figure 6 :
Figure 6: Task 2 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left) and final answers (right).

Task 3 :
"Does x happen more/less often than y?" Category: Ranking Description: The objective of this task is to determine whether the person has more meetings with Person A or Person B.

Figure 7 :
Figure 7: Task 3 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left) and final answers (right).

Figure 8 :
Figure 8: Task 4 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left), reasoning (center), and final answers (right).

Task 5 :
"When x happens, does y happen?"Category: Temporal Description: This task aims to determine whether, on days when person A is in one specific location, person B is in another specific location.{name_a} and {name_b}'s Travel Log [Task 5] Tom and Alice's Travel Log {name_a} was in {location} on {day}.

Figure 9 :
Figure 9: Task 5 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left), reasoning (center), and final answers (right).

Figure 10 :
Figure 10: Task 6 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left) and final answers (right).

Figure 11 :
Figure 11: Task 7 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left) and final answers (right).

Figure 12 :
Figure 12: Task 8 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left), reasoning (center), and final answers (right).

Figure 14 :
Figure 14: Task 10 results.Top left: percentage of correct answers.Top right: hallucination rate for both train and test sets.Bottom: percentage of correct recalls (left) and final answers (right).

Figure 23 :
Figure23: Comparison of the number of sentences in the recall part of answers from three models: T5-XL (left), Flan-T5-XL (center), and OPT-2.7B(right).This compares the target distribution with models trained on unsegmented and segmented stories.Similar patterns were observed for other model sizes.There is no significant difference between these distributions, suggesting that training on segmented stories does not affect recall length.
Alice goes for a walk.Noon, Alice makes a phone call.Afternoon, Alice makes tea.Evening, Alice reads a book.The answer is no.

Table 1 :
Sample stories, questions and answers from our dataset.Additional examples can be found in Appendix B. methods aimed towards improvements in knowledge consolidation.By investigating how LMs consolidate and reason from segmented knowledge, we aim to catalyze further research in the pursuit of more sophisticated, reliable, and knowledgeconsistent machine learning systems.

Table 2 :
The 18 tasks in the EpiK-Eval benchmark, categorized by type.Tasks aim to encompass a wide range of fact and event relationships.
Task 6: "Is x the only time that y happens?" Category: Uniqueness Description: Determine whether a person engaged in a specific activity only once during the week.

Table 8 :
Templates for generating Task 6 stories, questions, and answers, with an example provided.
Task 7: "Between x and y, does z happen?"Category: Temporal Description: Determine if a person performs a specific activity between two other distinct activities during the day.
Task 8: "How much time has passed between x and y?" Determine the duration in hours between two activities a person engaged in.

Table 10 :
Templates for generating Task 8 stories, questions, and answers, with an example provided.
Task 10: "The x'th time that y happens, what is a unique detail about y compared to the other x times?"Category: Uniqueness Description: Determine who accompanied the person the x'th time they engaged in a specific activity.

Table 21 :
Hallucination examples from models trained on segmented stories for Tasks 1 to 9. The Target Answer is provided for comparison, with hallucinations highlighted in bold.For hallucination examples corresponding to Tasks 10 to 18, see Table 22.Demontre went canoeing with Emelita.Tuesday, Demontre went hunting with Taifa.Thursday, Demontre went canoeing with Maibelle.Friday, Demontre went hunting with Ebere.Sunday, Demontre went hunting with Amyty.The answer is Maibelle.Tuesday, Demontre went canoeing with Maibelle.Thursday, Demontre went canoeing with Xian.Friday, Demontre went hunting with Neria.Sunday, Demontre went hunting with Sidoney.Monday, Lamya drives to pharmacy in a SUV.Tuesday, Lamya drives to pharmacy in a minivan.Wednesday, Lamya drives to grocery store in a SUV.Thursday, Lamya drives to pharmacy in a SUV.The answer is no.Marayna meets Person B in the morning.Marayna meets Person C at noon.Marayna meets Person D in the afternoon.Person D is the third.The answer is yes.Marayna meets Person B in the morning.Marayna meets Person A at noon.Marayna meets Person C in the afternoon.Person D is the first.The answer is no.[Task 14] Among the snacks that Kornelis ate, is there an orange?Kornelis ate a banana at 8am.Kornelis ate an apple at 2pm.The answer is no.Kornelis ate a banana at 8am.Kornelis ate an apple at 10am.Kornelis ate a cherry at 12pm.The answer is no.Corrine got an A in English.Corrine got a B in Spanish.Corrine got an A in French.Corrine got an A in Biology.Corrine got an A in Physics.Corrine got an A in Chemistry.The answer is no.Corrine got an A in English.Corrine got a B in Spanish.Corrine got an A in French.Corrine got an A in Biology.Corrine got a B in Physics.The answer is no.[Task 16] Did Trella go to the beach as many days as to the cinema?Monday, Trella went to the beach.Tuesday, Trella went to the cinema.Wednesday, Trella went to the beach.Thursday, Trella went to the park.Friday, Trella went to the cinema.The answer is yes.Monday, Trella went to the beach.Tuesday, Trella went to the cinema.Wednesday, Trella went to the park.Thursday, Trella went to the park.Friday, Trella went to the park.The answer is no.

Table 22 :
Hallucination examples from models trained on segmented stories for Tasks 10 to 18.The Target Answer is provided for comparison, with hallucinations highlighted in bold.For hallucination examples corresponding to Tasks 1 to 9, see Table21.