A Unified Encoder-Decoder Framework with Entity Memory

Entities, as important carriers of real-world knowledge, play a key role in many NLP tasks.We focus on incorporating entity knowledge into an encoder-decoder framework for informative text generation. Existing approaches tried to index, retrieve, and read external documents as evidence, but they suffered from a large computational overhead. In this work, we propose an encoder-decoder framework with an entity memory, namely EDMem. The entity knowledge is stored in the memory as latent representations, and the memory is pre-trained on Wikipedia along with encoder-decoder parameters. To precisely generate entity names, we design three decoding methods to constrain entity generation by linking entities in the memory. EDMem is a unified framework that can be used on various entity-intensive question answering and generation tasks. Extensive experimental results show that EDMem outperforms both memory-based auto-encoder models and non-memory encoder-decoder models.


Introduction
A large amount of real-world knowledge is related to entities, e.g., persons, nations, and events.Entity knowledge is the information describing facts and attributes related to entities.Many entity-intensive NLP tasks require models obtain entity knowledge to generate informative outputs, such as answering factual questions (Kwiatkowski et al., 2019), explaining claims (Onoe et al., 2021), or making informative conversations (Dinan et al., 2019).Pretrained encoder-decoder models can be directly applied on such entity-intensive tasks (Ye et al., 2020;Roberts et al., 2020), but their ability to store and use knowledge is still questionable (Lewis et al., 2021;Wang et al., 2021).A popular approach to incorporate knowledge into the generation process is retrieving evidence documents from external sources (Lewis et al., 2020b;Izacard and  2021; Oguz et al., 2020;Yu et al., 2022c).However, they suffer from significant computational overheads in indexing, retrieving, and reading a large number of extra documents (Lee et al., 2021;de Jong et al., 2022).Therefore, it is important to give encoder-decoder models access to entity knowledge without sacrificing too much efficiency.
Recently it has been proposed to use an in-model memory to augment auto-encoder models with entity knowledge on entity linking tasks (Févry et al., 2020;Verga et al., 2021;Sun et al., 2021).The entity memory stores entity knowledge as dense vectors which can be directly incorporated into the hidden states of Transformer models (Vaswani et al., 2017), with no need to encode extra text.However, the auto-encoder framework in previous approaches can only select entities from a pre-defined entity vocabulary.Hence, they are not able to give an entity outside the vocabulary, nor to generate answers or text beyond a single entity.
In this paper, we propose a novel Encoder-Decoder framework with an entity Memory (ED-Mem), as shown in Figure 1.EDMem is a unified framework on various entity-intensive QA and generation tasks, in which we train an entity memory for efficient knowledge incorporation.First, EDMem is pre-trained on Wikipedia documents, where it learns entity embeddings in the memory along with an encoder-decoder model.EDMem learns to select relevant entities from the memory via an entity linking objective, and learns to generate answers using entity knowledge via a language modeling objective.Second, to precisely generate entity names, we design three decoding methods that utilize the entity linking ability of EDMem in its generation process, when we fine-tune it on downstream tasks.These include (1) free-form: left-to-right generation with entity identifiers; (2) static entity linking: first select entities by entity linking, build prefix trees for the selected entities, and then perform constrained entity generation using the trees; (3) dynamic entity linking: select entities on-the-fly for constrained entity generation.
We conduct experiments on two popular testbeds of entity knowledge: open-domain QA and entityintensive generation.With the incorporation of entity knowledge, EDMem outperforms non-memory encoder-decoder models on both tasks, and it retains the efficiency advantage of closed-book (i.e., non-retrieval) models.Compared to memory-based auto-encoders, EDMem achieves both higher overall accuracy (+9%) and better entity precision (+8%) on open-domain QA datasets, and it generates high-quality text from the memory-supported decoder on generation datasets when auto-encoders fail to do so.To summarize, EDMem is the first knowledge-augmented closed-book framework to perform both tasks in a unified manner.

Related Work
Closed-Book Models Closed-book models are pre-trained models that store knowledge in their own parameters.For example, COMET (Bosselut et al., 2019) fine-tuned GPT2 (Radford et al., 2018) to construct knowledge graphs by generating commonsense triples.Recently, fine-tuned BART (Lewis et al., 2020a) or T5 (Raffel et al., 2020) models are proved to be competitive on open-domain QA (Ye et al., 2020;Roberts et al., 2020).Therefore, closed-book models are able to memorize some entity knowledge after pre-trained on massive data.However, studies showed that closed-book models just recalled similar inputs and answers in their pre-training corpus (Wang et al., 2021), and their performances were behind openbook models.
Open-Book Models Open-book models first retrieve evidence documents from external corpora and read these documents to predict an answer (Chen et al., 2017).REALM (Guu et al., 2020) proposed a self-supervised approach to pretrain a retriever-reader model.DPR (Karpukhin et al., 2020) devised a contrastive objective to train a dense bi-encoder retriever on open-domain QA.Subsequent approaches combined DPR with a generative objective to build large, powerful models on open-domain QA and generation tasks (Lewis et al., 2020b;Izacard and Grave, 2021;Sachan et al., 2021;Yu et al., 2022a).However, openbook models have to process the raw text of all retrieved documents, which leads to extremely long inference time.Besides, additional overheads are brought by loading the document index and retrieving evidence documents for each example.
Entity Memory EaE (Févry et al., 2020) was the first to pre-train an entity memory with an autoencoder framework to perform entity prediction on open-domain QA.FILM (Verga et al., 2021) followed EaE and added a fact memory containing representations of Wikidata triples.To better encode relational knowledge, OPQL (Sun et al., 2021) learned latent relational representations for arbitrary entity pairs.Recent work focused on learning a huge mention-level memory (~150M entries) with extensive pre-training (de Jong et al., 2022) or leveraging the entity memory in domain adaptive training (Kang et al., 2022).These models are all based on an auto-encoder framework.Thus, they are able to predict entities IDs but would fail to generate any non-entity answers or sentences.There is a preprint paper contemporaneous to our work which trained a memory with an encoder-decoder model (Chen et al., 2022).However, it used QA pairs as memory entries instead of entities, limiting its application to QA tasks.Besides, their memory is much heavier (60M entries) than ours (1M).

Proposed Framework
Suppose we have a pre-defined vocabulary of N entities E = {e 1 , . . ., e N }.A mention is the actual tokens in context which refer to an entity.The set of all mentions in the corpus is denoted as M. Thus, there is a global alias table T : E → 2 M , where each entity is mapped to all its mentions.The input of EDMem is a sequence of tokens x of length S, and the target output is another sequence y = [y 1 , • • • , y T ] of length T .Both sequences contain a pre-labeled set of mentions.Each mention refers to an entity in E. We add two special tokens [E s ] and [E e ] to represent "entity start" and "entity end" boundaries of a mention, e.g., "[E s ] Brett Hart [E e ] is the president of the [E s ] United Airlines [E e ]".These special tokens come from either Wikipedia hyperlinks (in pre-training, §3.3) or an entity linking model (in fine-tuning, §3.4).

Architecture
An overview of EDMem is presented in Figure 1.The framework has a transformer encoder, a transformer decoder, an entity memory, and two prediction heads.Both the encoder and decoder have two parts: (L 1 ×) lower layers and (L 2 ×) upper layers.Transformer layers in EDMem have the same architecture with BART (Lewis et al., 2020a).At the end of lower layers, EDMem is allowed to use the hidden states as a query to access the entity memory.The knowledge representation obtained by each memory access is summed and normalized with the hidden states before performing further reasoning in upper layers.Two prediction heads use the final hidden states of the decoder for prediction: an LM head for token prediction and an entity linking head for entity prediction (Details are in §3.3).In practice, we follow EaE (Févry et al., 2020) to set L 1 = 4 and L 2 = 8.

Entity Memory
The entity memory contains a large embedding table, which stores the embeddings of entities in E. Intuitively, an entity embedding contains the contextual information around all mentions of the entity in Wikipedia documents.During encoding and decoding, EDMem queries the entity memory whenever it encounters a mention.It recognizes mentions by identifying the [E s ] token.EDMem takes the hidden state of the [E s ] token as query to retrieve relevant knowledge from the entity memory by attending to the entity embedding table (bias terms are omitted): where e i is the embedding of entity e i .h low s denotes the hidden state of the [E s ] token (from lower encoder/decoder layers).h ent s is the aggregated entity representation, which is summed and normalized with h low s to put into upper layers.W in and W out are linear projection layers for dimension matching.Following EaE, during inference, we aggregate the entity representaion of top 100 entities (sorted by α i ) instead of attending to all N entities.

Pre-Training Corpus
We pre-train EDMem on the whole Wikipedia corpus.All documents are split into 128-token passages.In addition, we set a 10-token sliding window between passages to avoid an entity being split into two adjacent chunks.Such a setting yields a total of 39M passages, of which we hold out 0.5% of them as the validation set during pre-training.We leverage Wikipedia hyperlinks as gold annotations of 249M mentions and their linked entities.Since hyperlinks do not cover all mentions in text, we heuristically label missing mentions to create more training signals for the entity memory.We use the alias table T to label all mentions in a Wikipedia page if they match either (1) a linked entity in the same page, or (2) the title entity of this page.This leads to a total of 468M mentions in the pre-training corpus.We collect 1M most frequently linked entities to form the entity vocabulary E.More details can be found in Appendix A.

Pre-Training Objective
Our pre-training objective is a combination of language modeling and entity linking.For language modeling objectives, we randomly corrupt parts of the input sequence and train EDMem to reconstruct the original sequence.We adopt two kinds of sequence corruption: random token masking and salient span masking.In random token masking, each token has a probability of P rtm to be replaced by a [MASK] token.Salient span masking is adapted from (Guu et al., 2020), where each mention has a probability of P ssm that all tokens within the mention are replaced by [MASK].Such explicit masking of whole mention names encourages EDMem to rely on the entity memory in predicting mentions, which facilitates the learning of entity embeddings.The LM head performs token prediction through a linear-softmax layer, and the LM loss is the negative log-likelihood of the target sequence: L LM = − T j=1 log P (y j |x, y 1:j−1 ).EDMem utilizes direct supervision signals to the entity memory for entity representation learning.The entity linking loss is applied each time it queries the entity memory.Besides in the middle of the encoder and decoder, EDMem queries the memory in the entity linking head, as shown in Figure 1.The entity linking head predicts the corresponding entity using the hidden states of each mention, the same as Equation (2).We use a cross-entropy loss to maximize the attention weights of the labelled entities: L EL = − m log α i , where m is a mention in the input or output sequence that is linked to the i-th entity in E. The final loss function is L LM + λ EL L EL , where the coefficient λ EL is a hyper-parameter.

Fine-Tuning
EDMem is fine-tuned on downstream tasks via an LM objective and an entity linking objective.The LM objective is to maximize the probability of the task-specific output.The entity linking objective links mentions to entities in the memory, the same as pre-training.Mention boundaries are pre-labeled using an state-of-the-art entity linking model (Li et al., 2020).In entity-intensive downstream tasks, the entity memory assists sequence generation by not only providing entity knowledge but also generating entity names.Thus, we design three decoding settings to let the entity linking objective assist sequence generation.A sketch of different settings is given in Figure 2.
Free-Form Generation In this setting, the model generates the output sequence entirely based on the probability given by the LM head.This includes the special tokens [E s ] and [E e ] which indicate an access to the memory.There is no constraint on what tokens to generate between [E s ] and [E e ], i.e., the subsequence [E s ], y i , • • • , y j , [E e ] may not be a valid entity name in the entity vocabulary.One advantage is that the model processes the entity knowledge in a latent manner, which does not explicitly affect the probability distribution of the language model.However, this may affect the model's performance in tasks where entity names are strictly required, e.g., open-domain QA tasks where exact match is used as evaluation.
Static Entity Linking Static entity linking explicitly restricts the model to generate entity names for QA.Here, the decoding process is divided into two steps: entity linking and constrained generation.First, given a question, the model selects one or multiple entities as references.As shown in Figure 2(b), the question with an appended [E s ] token as a placeholder is passed into the decoder, and the entity linking head is trained to predict the entity ID of the gold answer1 .Then we have the selected top-k entities for each test question.We restrict the generation space to the top-k entities when the model is trying to generate an entity name.To achieve this, inspired by (Cao et al., 2021), we build a prefix tree for k entities for each test example.The prefix tree tells the model which tokens are allowed to generate given a prefix (i.e., previous generated tokens).When the model generates an [E s ] token, we restrict the following generated tokens to be one of the k entity names (i.e., one of the paths in the prefix tree).In this way, the model can either generate an entity answer (by generating [E s ] and traversing the pre-built prefix tree), or generate a non-entity answer (if no [E s ] token is generated).Readers can refer to (Cao et al., 2021) for more implementation details.
Dynamic Entity Linking Static entity linking is applicable only when the downstream task can be converted into an entity linking objective.Another way to generate entities is to predict the entities on-the-fly.After each time the model generates an [E s ] token, the entity linking head predicts topk entities using the hidden state of [E s ] based on previous generated tokens, as shown in Figure 2(c).This differs from static entity linking, where the model makes entity predictions solely dependent on the input sequence.A prefix tree of the names of top-k entities is also built on-the-fly for constrained entity generation.

Experiments
We test our EDMem framework on two testbeds of entity knowledge: open-domain QA and entityintensive generation tasks.

Data
Open-domain QA is a task where models are required to answer questions without any provided evidence.Questions are usually related to real-world facts and entities.We test EDMem on three popular datasets: Natural Questions (NQ) (Kwiatkowski et al., 2019), TriviaQA (TQA) (Joshi et al., 2017) and WebQuestions (WQ) (Berant et al., 2013).We follow the in-house splits introduced by (Lee et al., 2019).We also report on dev set of the TQA official split to compare with EaE (Févry et al., 2020).We report exact match (EM) scores on these datasets.
We mainly compare with previous closed-book models (i.e., models without evidence retrieval), including traditional encoder-decoder models like BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020), and memory-based auto-encoder models like RELIC (Ling et al., 2020), EaE, and FILM (Verga et al., 2021).Besides, We pre-train two ablations of EDMem.EncMem is composed of an encoder and an entity memory, and is trained via the same objectives as EDMem.EncDec removes the entity memory from EDMem, and is trained via the same LM objectives.We also list the performance of state-of-the-art open-book models (i.e., models with evidence retrieval to assist prediction) for reference, such as REALM (Guu et al., 2020), RAG (Lewis et al., 2020b), and FiD (Izacard and Grave, 2021).We test three variants of EDMem, i.e., free-form generation (-free), static entity linking (-stat.)and dynamic entity linking (-dyn.).

Results
Experimental results on open-domain QA datasets are listed in Table 1.With the same architecture, EncDec outperforms BART due to the additional salient span masking pre-training.Memory-based auto-encoder models like EaE and EncMem perform entity linking to provide answers.They outperform traditional encoder-decoder models by a large margin on TQA and WQ.However, target an- the RAM, (2) retrieve evidence documents from the index, and (3) read all evidence documents to generate an answer.In addition to the overhead caused by accessing the index, the model needs to encode the raw text of all evidence documents (i.e., 100 documents for FiD) before generating an answer with the decoder, but EDMem and BART only needs to encode the question itself.Thus, EDMem is able to achieve significant improvement over traditional encoder-decoder models while retaining the efficiency advantage of closed-book models.

Size of Entity Memory
We compare the performance of EDMem and its auto-encoder variant EncMem based on different sizes of the entity memory.We randomly mask out entities from the original 1M vocabulary and re-train the model.Embeddings of masked entities do not participate in computing attention while accessing the memory.According to the curves in Figure 3, due to EDMem's ability of closed-book generation, it is less sensitive to the size of the entity memory, resulting in a smaller slope when less entities are visible.Particularly, EDMem is still able to generate many correct answers even  when we remove the whole memory.In contrast, EncMem can only predict random entities when the entire memory is masked, which leads to a score close to zero.These results show the advantage of encoder-decoder models over auto-encoder models when jointly trained with an entity memory, especially on low-resource scenarios.
In addition, we also illustrate the performance trend of EDMem on entity answers and non-entity answers in TQA.When all entities are masked, the model deteriorates to its non-memory variant EncDec.As more entity knowledge are available, EDMem performs better in predicting entity answers, while its generation performance remains consistent.These results show the advantage of memory-based EDMem over traditional encoderdecoder models on entity-intensive task comes from the incorporation of entity knowledge from the entity memory.We report ROUGE (Lin, 2004) and unigram F1 scores, as well as BERTScore (Zhang et al., 2020a) for semantic-based evaluation.We also include metrics on evaluating entity generation.Given the entities in the ground-truth as reference, we calculate the coverage ratio of reference entities in the model-generated output.We also consider the mentions of these entities as correct matches, according to the alias table T .To avoid cases where entities in the output can be directly copied from the input 4 , we report the coverage ratio of unseen entities, i.e., entities in the ground-truth output that do not exist in the input.

Results
Auto-encoder models like EaE are not applicable on these datasets, thus we compare our EDMem to the traditional encoder-decoder model BART.As shown in Table 4, the free-form EDMem outperforms BART on both reference-based metrics (ROUGE, F1, BERTScore) and entity coverage scores.This indicates that the entity knowledge in the entity memory helps generate sentences with desired entities and correct entity-related information.Since these datasets cannot be directly converted to an entity linking setting, EDMem-static is not applicable here.The dynamic entity linking variant outperforms the free-form variant and BART in entity coverage scores on all datasets, while it does not sacrifice much language fluency in reference-based metrics.We find that both ED-Mem variants outscore BART on entity coverage by a large margin (up to 56% on overall and up to 93% on unseen), which indicates much stronger 4 For example, in CREAK dataset, the topic entity of the input claim usually appears in the explanation as well.
ability of EDMem models in entity generation.

Human Evaluation
To test whether the model generations are reasonable to humans, we leverage Amazon's MTurk platform to conduct human evaluation.For each dataset, we sample 50 data examples with generations from BART and EDMem.We ask three annotators to evaluate each example on fluency and two knowledge-related metrics.For CREAK and MSMARCO, knowledge-related metrics are topic relevance and factual correctness, given the groundtruth as reference.For ELI5 and WoW, since the ground-truth is not the only possible answer to the context, we use informativeness and reasonability as knowledge-related metrics, and we also evaluate the human-written answers.Detailed descriptions of these metrics are in Appendix F. When evaluating informativeness and reasonability, annotators are asked to rank these generations from #1 -#4, thus lower rankings indicate better results.
As shown in Table 5, EDMem generates more informative and factually correct sentences, compared to BART which lacks knowledge incorporation from the entity memory.Besides, such knowledge incorporation does not harm the fluency of model generations.In 3 out of 4 datasets, EDMemdynamic achieves the best results on knowledgebased metrics.This indicates that integrating entity linking with text generation is beneficial for generating informative sentences with rich entity knowledge.Interestingly, annotators even prefer ED-Mem's generations over human answers on ELI5.One possible reason is that human answers are usually longer than model-generated ones, so not all clauses are closely related to the question.Also, the quality of some Reddit responses (i.e., the source of ELI5 data) may not be reliable.
Claim: Chicago Symphony Orchestra started in Indiana.This is false because _____ Ground truth: Chicago is in Illinois so it did not start in Indiana.

Case Study
In Table 6, we show an example from CREAK with generations of different models.Without knowledge augmentation, BART fails to generate an informative explanation on why the starting place of the orchestra is not Indiana.Although EDMemfree steps closer to the correct explanation, it falsely predicts that the orchestra started in Utah.However, "Utah" does not exist in the top-5 linked entities during memory access.After we constrain the generation space of EDMem-dynamic to the top-5 predicted entities, "Utah" is no longer valid to generate, and the model finds "Illinois" as the correct location.Examples from other datasets can be found in Appendix G.

Impact of Entity Richness on Generation Improvement
In Table 7, we show detailed ROUGE-L scores according to the number of entity mentions in the ground-truth.Examples with larger numbers of mentions require more entity knowledge to generate.We list scores for CREAK dataset where the outputs are short factual claims, and ELI5 dataset where the outputs are long and diverse answers.In both datasets, the improvement of EDMem over BART occurs on entity-rich generations.For examples which do not need entity knowledge to solve (contain 0 mentions), there is not much difference between the performance of two models.This further demonstrates the effectiveness of incorporating knowledge from the entity memory on entityintensive generation tasks.

Conclusions
In this work, we proposed EDMem, an encoderdecoder framework with entity memory.The entity memory was pre-trained on Wikipedia to provide entity knowledge for the encoder-decoder model.EDMem also performed entity linking with the memory to assist entity generation in downstream tasks.As a unified framework, EDMem outperformed previous closed-book models on various entity-intensive QA and generation tasks, and still retained the efficiency advantage over open-book models.Further analysis proved that the proposed EDMem was enabled for entity linking with the entity memory and for generation with the encoderdecoder framework.

Limitations
First, if applying EDMem to other datasets, its performance may correlate to the density of entity mentions in data examples.EDMem may not be able to acquire sufficient entity knowledge from the memory if there are few mentions in the specific task.Another limitation of our work is that the pre-trained entity memory may not be generalized to special domains, e.g., biomedical text.A lot of specific terminology is not included in our pre-trained entity memory, which may require additional training on domain-specific corpora.

A Pre-Training
A.1 Pre-Training Data We pre-train our model on the Wikipedia corpus of over 5 million documents.All documents are split into 128-token passages.The last passage is round up to 128 tokens by appending tokens from the beginning of the same document, so there are no cross-document passages.In addition, we set a 10token sliding window between passages to avoid an entity being split into two adjacent chunks.Such a setting yields a total of 39 million passages, of which we hold out 0.5% of them as the validation set during the pre-training process.For supervision signals on the entity memory, we leverage Wikipedia hyperlinks as gold annotations.Each hyperlink provides the boundaries of a mention, and also the corresponding entity5 that the mention is linked to.However, the average density of Wikipedia hyperlinks is only one in 21 words, which means 6 mentions per passage.This is because in a specific page, (1) only the first mention of an entity is linked and (2) the title entity is not linked since a hyperlink always redirects to a different page.To provide more supervision signals for entity embedding learning, we label the missing mentions using heuristic rules.We use the alias table T to label all mentions in a Wikipedia page if they match either (1) a linked entity in the same page, or (2) the title entity of this page.After such heuristic labeling, the hyperlink density increases to one in 11 words, with a passage having 12 mentions on average.We manually checked 50 passages and found the precision of such heuristic labeling to be 92%, a pretty acceptable rate.

A.2 Pre-Training Settings
We pre-train our model on the Wikipedia corpus containing 39 million passages for 1 million steps using a batch size of 2048.AdamW (Loshchilov and Hutter, 2019) optimizer is used with maximal learning rate 1 × 10 −4 and the weight decay coefficient is 0.01.The learning rate is scheduled to be warmed up for 10% of the training steps and then linearly decay.The mask rate for random token masking is P rtm = 0.3, and the mask rate for salient span masking is P ssm = 0.5 (ablations in Appendix E.1).The maximum length of input sequence is set to 128.The coefficient of the entity linking objective is set to λ EL = 1.0 and the dropout rate is 0.1.The whole model is trained from scratch.We tried to initialize the encoderdecoder model with BART (Lewis et al., 2020a) and derive entity embeddings from BART embeddings, but the model showed up to be unstable in further training.We use the mixed precision floating point arithmetic (Micikevicius et al., 2018) to speed-up training.The full setting of EDMem takes about two weeks to train on 16×A100 GPUs.

B Fine-Tuning
Different from pre-training on Wikipedia, in opendomain QA and generation tasks, there are no gold annotations of mention boundaries in the input and output.Therefore, we annotate mention boundaries as well as the linked entities using a state-of-the-art neural entity linker ELQ (Li et al., 2020).For generation datasets, we pass the source sequence and the target sequence into the ELQ model respectively, and obtain their mention annotations.For open-domain QA datasets, since the answers are usually short, we concatenate the question and the answer as input to the ELQ model.During fine-tuning, we tune the hyperparameters within the following ranges: learning rate ∈ {5e-6, 1e-5, 2e-5, 3e-5}, λ EL ∈ {0.5, 1.0, 2.0}, dropout rate ∈ {0.1, 0.2, 0.3}, beam size ∈ {1, 3, 5}, #candidate entities (in static/dynamic entity linking) ∈ {1, 3, 5}.They are tuned based on on the main evaluation metric of the specific task (open-domain QA: EM; WoW: F1; other generation datasets: ROUGE-L) on the dev set.Batch size is fixed to 256 unless it exceeds GPU memory or the dataset is too small (e.g., CREAK and WQ).Early stopping is used with 20 waiting steps on the dev set.Fine-tuning EDMem usually costs a few hours (e.g., ~3 hours on TQA) on 8×V100 GPUs.

C Entity Memory Settings
We collect the 1-million most frequent entities in Wikipedia documents as our entity vocabulary E. The frequency of an entity is calculated based on how many hyperlinks are linked to the Wikipedia page of that entity.The dimension of entity embeddings learned in the memory is set to 256.The model attends to all 1 million entities during training.During inference, top-100 entities are selected according to the dot product similarity, and we only integrate the embeddings of these 100 entities when performing attention.

D Datasets
D.1 Open-Domain QA Datasets Statistics of the open-domain QA datasets are listed in Table 8.In TQA, most previous works used the in-house split provided by (Lee et al., 2019), while we also test EDMem on the official dev set to compare with the scores reported by EaE.

D.2 Generation Datasets
Here we provide detailed descriptions of the generation datasets used in our experiments.Statistics of these datasets are listed in Table 9.
MSMARCO MSMARCO (Nguyen et al., 2016) is originally collected for the abstractive QA task.We use the NLGen split where answers are sentences carefully written by human workers.

E Additional Experiments E.1 Pre-Training Mask Rates
We test the performance of EDMem on different mask rates during pre-training, and list the results in Table 10.In pre-training, we adopt two masked language modeling objectives: random token masking (RTM) and salient span masking (SSM).When using smaller mask rates, there is more visible contextual information to the model.Therefore, when evaluating the model on the validation set during pre-training, smaller mask rates lead to lower language model perplexity and better entity linking performances.However, larger mask rates finally lead to better performances on the downstream task.With larger mask rates, more training signals are applied to the model and thus the model is more sufficiently trained.Specifically, in SSM, the model is encouraged to leverage the entity knowledge from the entity memory to predict the masked mention.Therefore, a larger SSM rate leads to more sufficient learning of the entity memory, where the contextual information of masked mentions is integrated into the corresponding entity embedding.

E.2 Pre-Training Steps
We fine-tune EDMem on TQA using pre-trained checkpoints of different number of training steps.
As is shown in Figure 5, longer pre-training leads to better performance on the downstream task.Although there is no sign of overfitting, as the learning rate gradually decays and the model gradually converges, there is not much improvement of the model performance after 500K steps.

F Human Evaluation Details
Here we provide the actual questions that we asked Amazon Mturk annotators in human evaluation, along with their rubrics.For fluency, relevance and correctness metrics, annotators are asked to give scores on a 1-3 scale.For informativeness and reasonability, ranking evaluation is applied.This is because in ELI5 and WoW datasets, the human-written answer is not the only possible one to the context, so we do not compare model generations to the ground-truth.Instead, we let annotators evaluate the human written answer along with the model-generated ones.Since informativeness and reasonability are hard to set clear rubrics if no ground-truth is given, we adopt a ranking-based evaluation.The annotator is asked to rank all sequences (three model generations and the humanwritten answer) from #1 -#4, with lower rankings indicating better results.
• Fluency: How is the fluency of the machinegenerated explanation?(Do not consider its correctness) 3 -Fluent English 2 -Readable with grammar errors or typos 1 -Not fluent at all • Relevance: Does the machine-generated explanation contain the same concepts as the human-written reference?(Synonyms are allowed, and do not consider its factual correcteness) 3 -Contains the same concepts as the reference 2 -Misses some concepts in the reference, or contains redundant concepts 1 -Does not contain any concept in the reference • Correctness: Does the machine-generated explanation express similar meanings with the human-written reference?(Paraphrases are allowed) 3 -Expresses similar meanings with the reference 2 -Expresses partial meanings of the reference 1 -Expresses totally different meanings with the reference • Informativeness: Rank these answers based on the amount of information they contain.#1: most informative, #4: least informative.Ties are allowed (e.g., 1/1/3/4 or 1/2/2/2).You do not need to consider whether the information is relevant to the question.

G Case Study
We present an example on TQA with a non-entity answer in Table 11.We use our auto-encoder variant EncMem to represent memory-based autoencoder models.We also provide additional examples on generation datasets in Tables 12 -14.

Figure 1 :
Figure 1: An overview of the EDMem framework.H denotes the final hidden states of the encoder.

Figure 2 :
Figure 2: Three decoding methods in downstream tasks.

Figure 3 :
Figure 3: TQA performance of EDMem and EncMem on different memory sizes.

Figure 4 :
Figure 4: TQA performance of EDMem on entity and non-entity answers, trained with different memory sizes.

Figure 5 :
Figure 5: TQA performance of EDMem with different pre-training steps.

Table 1 :
Exact match scores on open-domain QA datasets.Bold scores and underlined scores are the best and second best results among closed-book models.Non-Ent.Total Ent.Non-Ent.Total Ent.Non-Ent.Total Ent.Non-Ent.
(*traditional encoder-decoder models, †memory-based auto-encoder models) swers are mainly entities on both datasets 2 .While on NQ where there are fewer entity answers, the performance of EncMem is similar to BART-Large.over previous closed-book models, we calculate EM scores on two subsets divided w.r.t. the answer type (i.e., entity answers and non-entity answers).

Table 2 :
(Ringgaard et al., 2017)tity answers ("Ent.")andnon-entityanswers("Non-Ent.") in open-domain QA.If an answer can be directly linked to a Wikipedia entity according to Google's SLING(Ringgaard et al., 2017)phrase table, it is counted as an entity answer, otherwise a non-entity answer.As shown in Table2, as an entity linking model, EaE cannot predict non-entity answers; and as an encoderdecoder model, BART is able to generate a portion of non-entity answers while its accuracy on entity answers is much lower than EaE due to the lack of entity knowledge.EDMem incorporates entity knowledge into an encoder-decoder model, making it competitive in both entity answers and non-entity answers.However, the free-form generation variant is not as accurate on entity answers as EaE, because it may generate any form of answers while EaE always predicts a valid entity name.
down the inference time from 10s to 28s.However, such a time cost is much smaller than the gap between closed-book models and the open-book FiD (85min).In the open-book setting, the model needs to (1) load the pre-computed index from disk to

Table 3 :
Inference time on TriviaQA test set.T ind is the time for loading the index, which is a fixed amount of time; T ret and T pred denote time for document retrieval and answer prediction, which will linearly increase as the number of test examples increases.Inference time of EDMem-stat. is the accumulation of the entity linking step and the constrained generation step.

Table 4 :
Results on entity-intensive generation datasets.Bold scores are best results among closed-book models.

Table 7 :
ROUGE-L scores based on different number of mentions in the ground-truth reference.

Table 8 :
Statistics of open-domain QA datasets.

Table 9 :
Statistics of generation datasets.
To keep a reasonable density of entities, we filter a subset where the input question and the output response are both no longer than 75 tokens.We also remove those examples which have no entity mentions in the output.This result in a total of 85K data examples.
(Fan et al., 2019)l., 2019)is a dataset for generating long-form responses for factual questions.

Table 10 :
Performance of EDMem with different pretraining mask rates.PPL: perplexity of the masked tokens on validation set (lower is better); ACC: entity linking accuracy of masked mention spans on validation set; EM: overall exact match scores; Entity: exact match scores on entity answers.