Memory Injections: Correcting Multi-Hop Reasoning Failures During Inference in Transformer-Based Language Models

Answering multi-hop reasoning questions requires retrieving and synthesizing information from diverse sources. Large Language Models (LLMs) struggle to perform such reasoning consistently. Here we propose an approach to pinpoint and rectify multi-hop reasoning failures through targeted memory injections on LLM attention heads. First, we analyze the per-layer activations of GPT-2 models in response to single and multi-hop prompts. We then propose a mechanism that allows users to inject pertinent prompt-specific information, which we refer to as “memories,” at critical LLM locations during inference. By thus enabling the LLM to incorporate additional relevant information during inference, we enhance the quality of multi-hop prompt completions. We show empirically that a simple, efficient, and targeted memory injection into a key attention layer can often increase the probability of the desired next token in multi-hop tasks, by up to 424%.


Introduction
Transformer-based Large Language Models (LLMs) (Vaswani et al., 2017;Brown et al., 2020) have shown exceptional promise for basic knowledge retrieval and language generation; however, they often lack the ability to perform basic reasoning tasks (Arkoudas, 2023;Guo et al., 2023;Blair-Stanek et al., 2023).In this work, we focus on the simple task of answering multi-hop prompts (i.e., prompts in which the subject is not stated explicitly), which humans handle easily but with which LLMs often struggle (see Fig. 1).
Researchers have attempted to rectify multihop reasoning failures by using various prompting methods such as Chain-of-Thought (CoT), Treeof-Thought (ToT), and Graph-of-Thought (GoT) reasoning (Wei et al., 2022;Wang et al., 2023;Long, 2023;Xie et al., 2023;Yao et al., 2023; Besta * Correspondance to sakarvadia@uchicago.edu The largest coral reef in the world is located off the coast of the Philippines LLM (a) Multi-hop prompt.
The Great Barrier Reef is located off the coast of

Australia
The name of the largest coral reef is the Great Barrier Reef LLM LLM (b) Multi-hop prompt broken into 2 single-hop prompts.
Figure 1: A multi-hop prompt vs. two analogous singlehop prompts.The outputs are from GPT2-Small.et al., 2023).However, these approaches often put the burden on users to know how to elicit desired responses-and, in the hands of non-expert users, can lead to unreliable prompt completions.Researchers have also proposed model editing (Meng et al., 2022a,b;Zhong et al., 2023;Li et al., 2023) approaches that may hard code distant relationships directly into model weights, rather than enhancing the model's abilities to recall and then link simpler relationships.These approaches can be computationally expensive and have unintended effects on other knowledge originally embedded in the model's weights (Cohen et al., 2023).
Our approach to this problem is based on the hypothesis that LLMs often fail to recall relevant memories when attempting to answer a prompt that requires multiple "hops" of reasoning, rather than lacking knowledge of the memories altogether.For example, when attempting to complete the multi-hop prompt, "The largest coral reef system in the world is located off the coast of. . .," we hypothesize that the model does not correctly recall that "the largest coral reef system in the world" is "the Great Barrier Reef" before predicting the next token in the sequence.Yet the model can accurately complete both the corresponding singlehop prompt "The Great Barrier Reef is located of the coast of. . .," and, when prompted, "the largest coral reef" as "the Great Barrier Reef."Clearly, this information was encoded in the model during training but is not incorporated when answering questions that reference the prompt's subject indirectly.In this case, therefore, we define the missing memory to be "the Great Barrier Reef." To study our hypothesis, we first attempt to reverse engineer a key mechanism by which transformer-based LLMs conduct reasoning.Specifically, we find that in transformer-based models it is attention heads, rather than multi-layer perceptrons, that are responsible for retrieving memories critical to successful model predictions; our finding is further substantiated by similar findings by Li et al. (2023);Geva et al. (2023); Dar et al. (2022).We then study instances in which this mechanism fails in multi-hop reasoning tasks and find that this mechanism is likely the source of incorrect, insufficient, or irrelevant memory retrievals (Contribution 1)-for an example, see Fig. 2.
We then propose a lightweight memory injection method that can be employed to correct a multi-hop reasoning failure during inference (Contribution 2).As an example: by employing our method to inject the memory of "The Great Barrier Reef" into the multi-hop prompt "The largest coral reef system in the world is located off the coast of. . ." during inference, we increase the probability of the next token "Australia" by 189%; refer to Fig. 3 for details.
For our analyses, we hand-crafted a dataset for interpretabilty purposes (Contribution 3) and make use of a larger programmatically-generated dataset-refer Table 1 for more information.
Finally we conduct additional experiments (Contribution 4) to: 1. Identify the ideal layer and magnitude for the memory injection.
2. Demonstrate the significance of curating prompt-specific memories for injection.
3. Analyze if memories drawn from different parts of speech-namely, nouns, adjectives, adverbs, conjunctions, verbs-behave differently during memory injection.

Background & Notation
We define single-vs.multi-hop prompts and provide a formal definition of the transformer model.

Multi-hop vs. single-hop prompts
We refer to a prompt as single-hop if the subject of the relation is stated explicitly in the prompt, and multi-hop otherwise.Multi-hop prompts refer to their subject in a way that requires an additional "hop" or inference step.For example, consider the single-hop prompt, "George Washington fought in the. . ." with a correct answer being "Revolutionary War."In the analogous multi-hop prompt, "The first president of the United States fought in the. . .," a preliminary inference step is needed to identity of the first US president before predicting the next token.
For additional examples of single-and mutlihop prompts, see Table 3.

Transformer Architecture
We introduce a common notation for the components of the transformer-based language model architectures that are the focus of our analyses.Specifically, we focus on auto-regressive, decoderonly models.We adopt much of our notation from Elhage et al. (2021) and Geva et al. (2023).

Embedding Inputs
An input text is parsed into N distinct tokens t 0 , • • • , t N .Each token t i is then embedded as , where V is the vocabulary and d is the hidden dimension.

Residual Stream
Following the embedding layer, all tokenized embeddings x 0 i are passed through a series of residual blocks.The outputs of each residual block are added back into the model's residual stream denoted by R ℓ (∀ℓ ∈ {1, • • • , L}) where L is the number of layers in the LLM.
We define the residual stream at layer ℓ as: where x ℓ i is the representation of token i at layer ℓ.The residual stream is updated by its respective residual block r ℓ : and the output of a residual block r ℓ is: where a ℓ is the output of the Multi-Headed Self Attention (MHSA) layer and m ℓ is the output of the Multi-layer Perceptron (MLP).We define MHSA and MLP in the following sections.

Multi-Headed Self Attention (MHSA)
Each MHSA layer ℓ is defined via four parameter matrices and the hyperparameter H denoting the number of attention heads.Following Elhage et al. (2021) and Geva et al. (2023), we can further dissect our parameter matrices to better observe the relationship between unique sets of parameters and individual attention heads: . Now, we can define the output of each MHSA a ℓ as the sum of all attention head outputs, where h ℓ,j is the output of the j th head in layer ℓ: (5) where the softmax(•) is performed as a row-wise operation, ⊙ is the Hadamard product, and M ∈ {0, 1} N ×N is an auto-regressive attention mask where masked token positions are set to 0.

Multi-Layer Perceptron (MLP)
Each MLP is defined via two parameter matrices W ℓ F , W ℓ I ∈ R d×dp with inner-dimension d p and a nonlinear activation function, σ.

Unembedding Predictions into Logits
After the final residual block, all token positions x −1 i will be projected back into the vocabulary domain via the unembedding matrix W U ∈ R d×|V | .The output of the last token position is the next token prediction of the model.

Experimental Overview
Our central aim is to better understand how the outputs of the attention heads affect model performance with respect to predicting the correct next token in prompts requiring single-hop reasoning versus in prompts requiring multi-hop reasoning.

Dataset Descriptions
We employ three datasets in this work.Two, used to assess model prompt completion accuracy, are our own high-quality manually curated dataset of single and multi-hop pairs and a programmatically generated dataset of prompt pairs.The third comprises lists of words from common parts of speech, which we use to study how the effectiveness of our intervention varies with the part of speech of injected tokens.

Programmatically Generated Dataset
The 2WikiMultiHop dataset (Ho et al., 2020) contains pairs of knowledge triples {(s 1 , r 1 , s 2 ) 1 , (s 2 , r 2 , s 3 ) 2 }, each with two subjects s and a relationship r.We used these knowledge triples, plus a set of predefined templates, to generate a set of pairs of singleand multiple-hop questions, 2WMH: see Tables 1  and 3.

Human-Generated Dataset
As evidenced by the example presented above, the 2WMH dataset, while scalable, contains many grammatical flaws.Therefore, we construct an additional dataset for multi-hop reasoning with a focus on grammatical and factual correctness presented below.We hand-crafted 106 (single-hop, multiple-hop) prompt pairs, each in the same form as those in 2WMH: e.g., single-hop: "St.Peter's Basilica is in the city of. . .[Rome]" and multihop: "The biggest church in the world is in the city of. . .[Rome]".Each prompt pair was also evaluated by two external reviewers for factual and grammatical accuracy.We hereafter refer to this dataset as Hand; see Tables 1 and 3.

Part of Speech Dataset
We used a subset of the Corpus of Contemporary American English (Davies, 2011)  word frequencies (Davies, 2010) to generate lists of (i) the most common words from various parts of speech: 824 adjectives, 331 adverbs, 40 conjunctions, 2635 nouns, 969 verbs, and (ii) the 5050 most common words overall ("top 5050").
Both have a vocabulary of ∼50K tokens.

Tools & System Setup
We use the Transformer Lens Python package (Nanda and Bloom, 2022) to cache, inspect, and construct interventions on model inference passes.
We ran experiments on a single A100 GPU with 40 GB RAM.Experimental code, dependency information, and datasets are available on GitHub.1

Proposed Methods
Recent work suggests that attention heads are knowledge retrievers during a model's inference pass (Geva et al., 2023;Li et al., 2023).Extending this result to multi-hop prompts, we hypothesize that attention layers play an important role in retrieving memories relevant to the "hop" in a given prompt.Therefore we define two algorithms below: one for analyzing attention head outputs in embedding space and the other for injecting a targeted memory into a model's hidden activations in order to correct faulty/incomplete reasoning.

Interpreting Attention Heads
We want to further understand the outputs of individual heads, and more specifically assess if any individual attention heads are exercised differently by single-hop vs. multi-hop prompts.
Inspired by Logit Lens (nostalgebraist, 2021), we leverage the model's unembedding matrix to study the internal mechanism of each attention head.For attention head j in layer ℓ, h ℓ,j , we apply the model's unembedding matrix W U followed by a softmax(•) operation and interpret the last token position (out of N total tokens) as a set of probabilities over tokens in the vocabulary space: See in Fig. 2 an example of discrepancy in attention head behavior, when using Eq. ( 8), for analogous single vs. multi-hop prompts.See additional examples in Table 5.
A potential limitation of this approach is that it may portray attention head behavior inaccurately due to representational drift between model layersand, like (nostalgebraist, 2021), may not generalize to other models.Nevertheless, we find it to be an effective preliminary tool for studying the function of attention heads in updating the output distribution.We leave the development of an interpretability tool that considers these drawbacks to future work.

Memory Injections to Correct Failures
Fig. 2 shows how Eq. ( 8) can reveal discrepancies between attention head behaviors for singlevs.multi-hop prompts.We hypothesize that such discrepancies arise because the model, when updating the output distribution in each layer, fails to incorporate information about the implicit entity in the multi-hop prompt.This seems reasonable, as to retrieve information about an implicit entity one likely must first relate that entity to some explicit subject and then retrieve relevant information (hence our notion that processing prompts with implicit subjects requires an extra hop compared to those with explicit subjects).

MLP "
The Great Barrier Reef is located off the coast of" "The largest coral reef system in the world is located off the coast of"
Thus we design a method (see Fig. 3) for injecting a missing hop directly into the output hidden states of an attention head before those outputs are added back into the transformer's residual stream: 1. Let m be a memory (a phrase, for example: "The Great Barrier Reef") and let τ be the magnitude of the memory injection.
2. Tokenize the memory m into t 0 , • • • , t q where q is the number of tokens.We encode each token t i into a one-hot vector b i ∈ {0, 1} |V | and sum all resulting one-hot vectors b i together into a binary vector B ≜ i b i .
3. Embed the binary vector, B, back into the model's latent space by applying the transpose of the unembedding matrix: 4. Then, to inject a memory at the attention layer of layer ℓ, add the embedded memory into the outputs of the attention heads during the inference pass: See additional examples of memory injections in Table 4.

Results and Discussion
We report, in turn, on our curated memory, random memory, and part-of-speech injection experiments.

Curated Memory Injections
We hypothesize that a model's poor performance on multi-hop prompts is due to its inability to resolve the implicit subject (e.g., "The largest coral reef system in the world") to an explicit subject (e.g., "The Great Barrier Reef").This failure limits the later layers' ability to retrieve relevant information about this subject before predicting the next token.Therefore, in this experiment, we curate sets of tokens to inject into our model's residual stream such that it can resolve the explicit subject more easily.We further study the effect that the injection magnitude τ has on its success.Experimental design: For every multi-hop prompt in our datasets, we extract the explicitly stated subject from the corresponding single-hop prompt and inject those tokens as memories into each attention layer as described in Section 4.2.For example, given the single-hop prompt "The Great Barrier Reef is located off the coast of. . ." and the multi-hop prompt "The largest coral reef system in the world is located off the coast of. . .," the memory is "The Great Barrier Reef." We assess the effects of injection layer ℓ and magnitude τ ∈  0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34   Layer ( )   0 2 4 6 8 10   Layer ( )   0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Layer ( ) -8 0 % -4 0 % 0 % 2 0 % 4 0 % -8 0 % -5 0 % -2 5 % 0 % 2 0 % 4 0 % 6 0 % -8 0 % -4 0 % 0 % 2 0 0 % 4 0 0 % -4 0 % -2 5 % -1 5 % 0 % 6 0 % 1 2 5 % 2 0 0 % Figure 4: Curated memory injections.From left to right: GPT2-Small + Hand, GPT2-Large + Hand, GPT2-Small + 2WMH, GPT2-Large + 2WMH.Each cell in each heatmap is the average percent difference between the pre-and post-injection next token predictions for multi-hop prompts.Green cells denote a positive percent difference (i.e., correct prediction is more likely), while red cells denote a negative percent difference (i.e., correct prediction is less likely).When computing the averages for each (ℓ, τ ) pair we exclude outliers not within ±2 standard deviations from the mean.
of these two parameters for both GPT2-Small and GPT2-Large.We measure the success of a memory injection by calculating the percent increase between the model's predicted probability for the expected next token from the multi-hop prompt with and without the injection.A greater positive difference indicates a more successful injection.Discussion: Results are in Fig. 4. We observe that each model/dataset combination has an optimal layer ℓ and magnitude τ for memory injections: the darkest green areas, which signify the highest average percent increase in probability of the expected next token for the respective dataset.The best (ℓ, τ ) pair injection results are in Table 2.Additional examples of memory injections are in Table 4.

Random Memory Injections
In Section 5.1, we identify ideal (ℓ, τ ) pairs for each model and dataset for a curated memory injection.We now demonstrate that the results we observe are not spurious: i.e., the information that we inject at each head should be related to the explicit subject.We demonstrate the need for our particular injection routine by assessing the effects on model accuracy of randomly injecting tokens from various parts of speech.
Experimental design: We conduct targeted injections for the high-scoring (ℓ, τ ) pairs identified via the experiment in Section 5.1, Table 2. Instead of injecting curated subject tokens, we select as candidate injections the 40 most common words from each of the adjectives, adverbs, conjunctions, nouns, verbs, and top 5050 subsets of our Part of Speech dataset.We then apply each word as an individual injection for every prompt in our multihop dataset at the ideal (ℓ, τ ) pair.We term these injections "random," as they were not curated to be relevant to our prompts.
Discussion: The results are in the right half of Table 2.We observe that a random injection led, on average, to a degradation in predictive performance across most parts of speech considered, as indicated by a negative percent difference (decrease in correct answer probability) between the pre-and post-injection expected next token probabilities for multi-hop prompt completions.Additionally, no random injection result exceeded the performance of a curated injection.These findings suggest that the choice of injected tokens is critical for improving multi-hop prompt completion success.

Memory Injections for Parts of Speech
We have tested curated vs. random memory injections at ideal (ℓ, τ ) pairs.Now we assess whether memory injections from specific parts of speech more broadly have positive impacts on prompt completions, not just at the ideal locations for curated memories, but also at other (ℓ, τ ) pairs.Our hypothesis is that if a transformer-based LLM has learned a division of labor regarding which attention layers are responsible for retrieving specific concepts (e.g., parts of speech) then this experiment might highlight those learned roles.
Experimental design: This experiment is identical to that of Section 5.1, except that: (i) for each part of speech pos ∈ [adjectives, adverbs, conjunc-  ) pairs for the best token injections, along with the average percent difference (excluding outliers >±2 standard deviations from the mean) between pre-and post-injection expected next token predictions for multi-hop prompts.Each random injection column indicates 40 random injections from [Adjectives, Adverbs, Conjunctions, Nouns, Verbs, Top 5050] at the ideal (ℓ, τ ).tions, nouns, verbs, top 5050], we use a randomly selected word: e.g., "apple" from "nouns"; and (ii) when searching for the ideal (ℓ, τ ) pair for a given part of speech and multi-hop prompt, we use a new random word for each injection.Discussion: The results are in Fig. 5.We note that for no part of speech considered here does the average performance of the studied memory injections exceed that of the curated memory injections presented in Table 2. Additionally, memory injections from adjectives, adverbs, nouns, verbs, and top 5050 seemed to exhibit similar behavior.Memory injections from conjunctions, however, typically outperformed all other parts of speech.We hypothesize that this is because conjunctions often play a neutral role in prompt completions.Thus, while a random noun (e.g., "apple") might distort prompt completion, a random conjunction (e.g., "and," "for") is less likely to do so.
We note also that for each part of speech, performance averaged over all injections for most (ℓ, τ ) pairs was reduced (< 0) for Hand (refer Fig. 5: subplots c, d, g, h), but was sometimes improved (> 0) for 2WMH (refer Fig. 5: subplots a, b, e, f ).We attribute this result to the relative difficulties of the two datasets.Hand has, on average, lower surprisals than does 2WMH, as seen in Table 1, suggesting that there is additional information that the model could use successfully for 2WMH, but not for Hand.
These results (see also the Appendix; Figs 6-9) suggest that while curated memories are ideal for correcting multi-hop reasoning failures, language models can also benefit from injections of different parts of speech.This result suggests that different parts of a language model (namely, early layers) serve specialized roles, with some dealing with processing related to specific parts of speech.
In future work we will curate relevant memories from various parts of speech for each prompt, to better understand the effects of curated memories.

Related Work
Much recent work has focused on the inner workings of Transformers (Vaswani et al., 2017;Devlin et al., 2019;Brown et al., 2020;Radford et al., 2019).Nanda et al. (2023) explore how the emergent properties of LLMs form during training.Recent interpretability research has focused on the mechanisms by which linear layers in LLMs retrieve information, characterizing them as keyvalue stores of information (Geva et al., 2021;Dai et al., 2022a,b) and showing that tokens can be characterized by their distribution in the output vocabulary (Geva et al., 2022).
Others have also examined the intermediate activations of LLMs in order to uncover underlying reasoning mechanisms.nostalgebraist (2021) applied GPT-2's unembedding matrix to intermediate layers to interpret how the model arrives at its final answer.Belrose et al. (2023) employed a learned transformation to mitigate the effect of any bias introduced by using the unembedding matrix.
There has been much recent interest in whether LLMs are reliable stores of information for attempting to both identify where knowledge exists and how to edit stored factual knowledge effectively (Mitchell et al., 2022a,b;Elazar et al., 2021;Hase et al., 2023).Recent approaches to knowledge editing make use of learned hyper-models to edit weights, additional trained parameters, or direct interventions on model weights (De Cao et al., 2021;Huang et al., 2023;Dhingra et al., 2022).However, these approaches raise another issue: dealing with knowledge retention and preventing catastrophic forgetting (Jang et al., 2022;Hase et al., 2021;Zhong et al., 2023).Additionally, it is not clear that the mechanisms by which model predictions are constructed is fully understood, limiting our ability to improve model performance (Turpin et al., 2023).Some approaches propose to use external knowledge stores such as knowledge graphs to augment the factual capabilities of LLMs (Jiang et al., 2023;Sun et al., 2018;Zhang et al., 2022).

Conclusions and Future Directions
We demonstrate that a key reason LLMs perform worse on multi-hop prompts is because they fail to recall intermediary information that is relevant to a hop.We find that attention heads play an important role in this factual recall process, and that in the case of multi-hop reasoning, certain attention layers fail to recall relevant information.To rectify this shortcoming, we establish an algorithm for injecting "memories" directly into the model's hidden activations during inference.Through experimentation, we find that injecting relevant memories into the hidden activations of the attention heads during inference is an efficient way to boost model performance on multi-hop prompts.
We anticipate that our memory injection scheme can extend a model's longevity by enabling less frequent retraining/fine-tuning.We also hope in future work to demonstrate the use of memory injections to correct stale or incorrect information, remove private or harmful information, and combat bias during LLM inference.
There is also a tremendous opportunity to scale online-memory injections to enhance the quality of thousands/millions of model inferences, if we can automate the process of memory selection via unsupervised algorithms, for instance by connecting LLMs with knowledge bases.

Limitations
Internal biases of the question writers as well as the rigid structure that had to be imposed on the prompt structure mean that our human-generated dataset is representative only of a small fraction of the many types of multi-hop questions.Furthermore, our hand-generated dataset is relatively small compared to our programmatically generated dataset.Additionally, our analyses were limited to GPT2-Small and GPT2-Large; further work is needed to determine whether, as we expect, other language models sharing a transformer-based architecture and a similar unsupervised causal language modeling training objective display similar behavior.Lastly, we rely on the model's unembedding matrix W U to interpret model hidden states and embed memories for injection.While for our work, results indicate that this transformation was sufficient, we acknowledge that this unembedding matrix is not tuned to interpret intermediate layers; we aim to address this shortcoming in future work by instead using layer-specific learned projections to transform between hidden states and vocabulary.

Ethics
Our attention head inspection mechanism uncovered several sources of bias (such as racism); refer Table 5 for examples.We expect a more detailed study of the attention heads of GPT2-Small and GPT2-Large, as well as other LLMs, to reveal additional undesirable behaviors.We aim in future work to use our inspection method to uncover (and hopefully address) these biases.
Broader Impacts: Memory injections can extend model longevity by allowing users to apply lightweight, non-gradient-based edits directly to the model's inference path.Thus, they can reduce the need for costly model fine-tuning/re-training in order to meet standards for factual correctness or incorporate new information into an existing model.Additionally, memory injection can further augment the abilities of smaller LLMs, as smaller LLMs display a reduced capacity to store as much information as their larger counterparts.In this situation, memory injections, if applied correctly, may enhance the performance of AI in resourceconstrained settings.As more robust and scalable methods for selecting memories are discovered in future work, memory injection can be adopted into existing inference workflows and as a means of augmenting LLMs with large knowledge stores.

Figure 2 :
Figure 2: Diagram of language model reasoning.Highest ranked attention outputs of GPT2-Small at layer ℓ = 9, head h = 8 when projected into vocabulary space (via the GPT2-Small embedding matrix) for a singlehop prompt (green) and its multi-hop counterpart (red).

Figure 5 :
Figure 5: Part of speech memory injections.This figure shows the average effect of memory injections from various parts of speech as a function of layer ℓ (top row) and magnitude τ (bottom row).The standard deviation scaled by 10% is pictured across magnitudes (top row) and layers (bottom row).

Figure 6 :
Figure 6: GPT2-Large, 2WMH dataset.Heatmap shows average percent difference between pre-and post-injection answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean across various parts of speech.

Figure 7 :
Figure 7: GPT2-Large, Hand dataset.Heatmap shows average percent difference between pre-and post-injection answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean across various parts of speech.

Figure 8 :
Figure 8: GPT2-Small, 2WMH dataset.Heatmap shows average percent difference between pre-and post-injection answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean across various parts of speech.

Figure 9 :
Figure 9: GPT2-Small, Hand dataset.Heatmap shows average percent difference between pre-and post-injection answer probabilities for multi-hop prompts excluding outliers not within ±2 standard deviations from the mean across various parts of speech.

Table 1 :
which compiles Surprisal Prompt len.Answer prob.Surprisal Prompt len.Properties of the datasets used in our work.
Size: Number of prompts.Answer prob.: Average model probability model for expected next token.Surprisal: Average model surprisal value for expected next token (surprisal ≜ − log(p) where p is a probability).Prompt len.: Average tokenized length of prompt.