Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond

Is the output softmax layer, which is adopted by most language models (LMs), always the best way to compute the next word probability? Given so many attention layers in a modern transformer-based LM, are the pointer networks redundant nowadays? In this study, we discover that the answers to both questions are no. This is because the softmax bottleneck sometimes prevents the LMs from predicting the desired distribution and the pointer networks can be used to break the bottleneck efficiently. Based on the finding, we propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax, a state-of-the-art softmax alternative. In summarization experiments, without significantly decreasing its training/testing speed, our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.


Introduction
When recurrent neural networks such as LSTM (Hochreiter and Schmidhuber, 1997) are the mainstream language model (LM) architecture, pointer networks, or so-called copy mechanisms (Gu et al., 2016), have been shown to improve the state-of-the-art LMs for next word prediction (Merity et al., 2017) and summarizations (See et al., 2017) by a large margin.However, after transformer (Vaswani et al., 2017) becomes the dominating LM architectures, the pointer networks are rarely used in the state-of-the-art pretrained LMs.One major reason is that the attention mechanism in every transformer layer can learn to copy the words from the context, so it * indicates equal contribution † The work is done while the author was at UMass

GPT-2
After debating whether to bow to the king or the woman first, the jester decided on the

Input context Ct
Figure 1: Illustration of the softmax bottleneck and pointer network using an example from Chang and McCallum (2022).GPT-2 cannot output both king or woman as the possible next word due to the parallelogram structure in the output word embedding space, while the pointer network could solve this by directly copying words from the context.The standard softmax estimate the probabilities of outputting king and woman by the dot products between the hidden state h ct,V and their global word embeddings.By contrast, The pointer networks compute the dot products between the projected current hidden state h ct,S and projected hidden states h e,. for king and woman to estimate their probabilities.
seems to be redundant to add a copying mechanism on top of the transformer.
In this paper, we demonstrate that the architectures like pointer networks can still substantially improve the state-of-the-art transformer LM architectures such as GPT-2 (Radford et al., 2019) and T5 (Raffel et al., 2020) mainly due to breaking the bottleneck of their final softmax layer (Yang et al., 2018;Chang and McCallum, 2022).
In Figure 1, we illustrate a simple example from Chang and McCallum (2022) to explain the softmax bottleneck and why the pointer networks could alleviate the problem.When predicting the next word, most LMs would try to output a hidden state h ct,V that is close to all the next word possibilities.For example, when the next word should be either king or woman with similar probabilities, the ideal hidden state is supposed to be the average of the global output word embeddings of king and woman.However, there might be other interfering words (queen and man in this case) between the ideal next word candidates, which force the LM to output the wrong distribution.
To solve this problem, we can let the LMs predict the probability of copying the words in the context separately by paying attention to the previous hidden states (Gu et al., 2016) and we call this kind of architecture pointer networks in this paper.That is, we can compute the dot products with the hidden states of king h e,k and the hidden states of woman h e,w rather than with their global output word embeddings in order to estimate the probabilities of copying these two words in the context.Our experiments show that the pointer networks consistently improve the performance of GPT-2 in next word prediction and the quality of summarization from T5 and BART.
Contrary to the mainstream explanation in previous pointer network literature, we discover that most of the improvements in our experiments do not come from the attention mechanism.To study these improvements, we propose a very simple pointer network variant that does not use any previous hidden states and we show that the proposed method can achieve similar improvements.
As shown in Figure 2, we simply project the last hidden state into two embeddings.One embedding h ct,S is to compute the dot product with the context words, and h ct,V is for the dot product of the other words.Then, the GPT-2 can output the hidden state for context words h ct,S as the average embedding of the king and woman without interfered by the words of man and queen that are handled by h ct,V .We call this method context partition.In addition to words in the context, we can also use another embedding for the top-k likely next words.This can be viewed as a very simple and efficient alternative to a reranker, so we call it reranker partition.
In our experiments, we show that the context partition performs similarly to pointer networks while combining a pointer network, context partition, and reranker partition would significantly outperform each individual method.Compared to the state-ofthe-art solutions for alleviating the softmax bottleneck such as mixture of softmax (Yang et al., 2018;Chang and McCallum, 2022), our proposed method is more efficient while achieving lower perplexity on GPT-2.Furthermore, we find that adding a very expensive word-by-word reranker only improves our method slightly, which suggested the difficulty of further improving the final softmax layer over the proposed alternatives.
In the text completion task using GPT-2, we find that the proposed softmax alternatives reduce hallucination by copying more proper nouns from the context even though we did not provide any partof-speech information during training.In summarization, our methods and pointer networks output a more specific summary, increase the factuality, and consistently improve 9 metrics, especially in the smaller language models.Finally, we show that the softmax bottleneck problem is not completely solved in GPT-3.5 in the limitation section.

Main Contributions
• We propose a series of efficient softmax alternatives that unify the ideas of pointer network, reranker, multiple embeddings, and vocabulary partitioning.1 • We evaluate the proposed softmax alternatives in text completion tasks and summarization tasks using various metrics to identify where our methods improve the most.
• Our experiments indicate pointer networks and our proposed alternatives can still improve the modern transformer-based LMs.By breaking the softmax bottleneck, our methods learn to sometimes copy the context words to reduce generation hallucination and sometimes exclude the context words to reduce the repetition.Besides, we find that the softmax bottleneck problem won't be completely solved by the huge size of GPT-3.5.

Background
Before introducing our method, we would first briefly review the problem we are solving and its state-of-the-art solutions.

Softmax Bottleneck Problem
Most LMs use a softmax layer to compute the final probability of predicting the word x: where c t is the context words.Typically, the logit Logit(x, c t ) = (h M ct ) T w x , h M ct is the M th-layer hidden state given the input context c t and w x is the output word embeddings for x.
One problem is that the output word embeddings w x are global and independent to the context.After pretraining, the similar words would have similar output word embeddings.However, the similarity structure in the word embedding space might prevent LMs from outputting the desired distribution.The parallelogram structure among the embeddings of king, queen, woman, and man is a simple example.Chang and McCallum (2022) generalize this observation and show that some words in a small subspace would create some multi-mode distributions that a LM cannot output using a single hidden state h ct in the softmax layer.

Mixture of Softmax Method
To overcome the bottleneck, one natural solution is to have multiple hidden states and each hidden state corresponds to a group of possible words (Yang et al., 2018).For example, we can have one hidden state for king and another hidden state for woman.
One major concern of this mixture of softmax (MoS) approach is the computational overhead.MoS needs to compute the final softmax multiple times and merge their resulting distributions.That is, we need to compute the dot products between every hidden state and all the words in the vocabulary, which is expensive especially when the vocabulary size is large.

Multiple Input State Enhancement
In MoS, the multiple hidden states come from the linear projections of the last hidden state.Chang and McCallum (2022) point out that the total degree of freedom among the multiple hidden states is limited by the dimensionality of the hidden state.
To allow LMs to move multiple hidden states more freely, Chang and McCallum (2022) propose to concatenate the projection of a block of hidden state with the last hidden state h M ct so as to increase its dimensionality: where GELU is the non-linear transformation used in GPT-2 and L h is a linear transformation that allows us to consider more hidden states without significantly increasing the model size.
is the concatenation of a block of hidden states.We set the block size to be 3x3 in our GPT-2 experiments and 1x3 in our summarization experiments (i.e., considering the last 3 hidden states in the last layer as shown in Figure 3).

Methods
To break the softmax bottleneck more efficiently compared to MoS, our overall strategy is simple.
If we can identify a small partition of words that are very likely to become the next word, we can just compute the dot products between a hidden state and the embeddings of these likely words instead of all the words as in MoS.For example, if we can identify king and woman are much more likely to appear than queen and man, we can only compute the dot product between a hidden state and the embeddings of king and woman without being interfered by other words.Specifically, when we compute the next word probability in Equation 1, the logit of the word x given the context c Figure 3: Architectures of our method for T5/BART that computes Logit CEP R in Equation 6.In GPT-2, we use same architecture except that we take the 3x3 input hidden state block rather than the 1x3 block and there are no encoder-related components, which are marked by dotted lines.
where f ct,S = L f S (q ct ) and f ct,V = L f V (q ct ) are the linear projections of the hidden state concatenation q ct in Equation 2. As shown in Table 1, different softmax alternatives have different ways of constructing this set S and use different word embeddings e x .
To simplify our explanation, we will focus on the decoder-only LM (i.e., GPT-2) first and extend our method to encoder-decoder LM (i.e., T5 and BART).

GPT-2
We will explain each softmax alternative individually and their connections to previous work such as pointer networks or rerankers.

Pointer Network (P) as Local Word Embedding
Similar to Pointer Sentinel (PS) (Merity et al., 2017), we treat the words in the context differently (S = {x|x ∈ c t }) and let their word embeddings e x come from the previous hidden states: where c i t is the ith input words in the context c t , L f LD is a linear layer, and As a result, we can use the GPT-2 model to not only predict the hidden state f ct,S = f ct,P D = L f P D (q ct ) and f ct,V but also predict the word embedding of context words e x .Unlike the global word embedding w x , the local word embedding e x is context-dependent, so the LM can break the softmax bottleneck by adjusting the similarity of words based on the context.For example, GPT-2 could increase the similarity between e king and e woman to output the high probability for both words easily.
We call this version of pointer network local decoder (LD) embedding, which has some minor differences compared to PS (Merity et al., 2017) and other variants.For example, we merge their logits while PS merges their probabilities.PS does not do normalization when computing e x .In our experiments, we would show that these pointer network variants all have very similar improvements in modern LMs.

Context Partition (C)
To understand the source of the improvements from pointer networks, we simplify their architectures by setting the word embedding e x = w x and the partition S is still the set of context words.Although much simpler, the LM with this context partition method can still break the softmax bottleneck by properly coordinating the hidden state specifically for the context words f ct,S = f ct,C = L f C (q ct ) and the hidden state for other words f ct,V .Compared to the pointer network, one advantage of context partition is that the LM can still leverage the learned global word similarity when estimating the probabilities of context words.

Reranker Partition (R)
In some cases, the possible next words might not be mentioned in the context.For example, in the context My favorite actor is Ryan [MASK], the next word could be Reynolds, Gosling, or the last names of other Ryan.Hence, using only the context partition does not completely solve the multimodal distribution problem.
Inspired by the idea of the reranker, we set S to be the top k words with the highest logits f T ct,V w x .In practice, finding an ideal k could be difficult.When k is small, the reranker partition might not include the very likely next word.When k is large, the reranker partition might not be able to separate the output candidates and the interfering words.To alleviate the problem, we can have multiple reranker partitions and use different hidden state embeddings (e.g., f ct,R1 and f ct,R2 ) for different partitions.

Hybrid Approach (CPR)
Local embeddings in the pointer networks and global embeddings in the context partition are complementary.Using local embeddings is representational powerful while using global embedding can leverage the global similarity of words.Hence, we can combine the two methods by summing their dot products.
For the methods that use different S, we can simply determine an order of computing the dot products and let the later dot products overwrite the existing values.In our experiments, we always use the order illustrated in Figure 3.That is, we compute the logits (Logit CP R (x, c t )) by where W (k 2 ) is the top k 2 words with the highest f T ct,V w x and W (k 1 ) is the top k 1 words with the highest max(f T ct,V w x , f T ct,R2 w x ).

T5 and BART
In the encoder-decoder architectures, our local decoder embedding, context partition, and reranker partitions are still applicable.Besides, we can leverage the words in the encoder input to further improve the performance.

Encoder Partition (E) and Local
Encoder Embedding (P) Similar to the context partition, the encoder partition handles the words in the encoder input I differently by setting S = {x|x ∈ I} and using the global word embedding e x = w x .
As in Equation 4, we can also let the hidden states in the last layer pass through another linear layer L f LE () to predict the embeddings of the words in the encoder input.The method is called local encoder (LE) embedding.

Hybrid Approach (CEPR)
Similar to GPT-2, we combine local encoder embedding and encoder partition for computing the probabilities of the words that are in the encoder context but not in the decoder context.As shown in Figure 3, we compute Logit CEP R (x, c t ) by which is the same as Equation 5except that we add the encoder partition and local encoder embedding, and we remove the second reranker partition.

Experiments
The pointer network was a popular technique in language modeling (Merity et al., 2017) and summarization (See et al., 2017).Thus, we also focus on these two fundamental applications.

Perplexity Comparison
In Table 2, we first compare their predictions on the next word distribution using the testing data perplexity, which is a standard metric in the LM architecture studies.In the and faster inference speed than the mixture of softmax (MoS) (Yang et al., 2018;Chang and McCallum, 2022).The inference speed is measured by our pure PyTorch implementation, which we believe could be further accelerated by implementing some new PyTorch operations using CUDA code.
If only using one method, the context partition (Softmax + C + Mi) is better than the reranker partitions (Softmax + R:20,100 + Mi) while performing similarly compared to local decoder word embedding (Softmax + P + Mi), Pointer Generator (PG + Mi) (See et al., 2017), and Pointer Sentinel (PS + Mi) (Merity et al., 2017). 2 Their similar performances indicate that the improvement of pointer networks come from breaking the softmax bottleneck.The significantly better performance of PS + Mi compared to PS further supports the finding.
To know how well our method breaks the softmax bottleneck, we implement a word-by-word reranker model on GPT-2, which appends the most likely 100 words to the context when predicting each next word (see Appendix C.3 for more details).In Table 3  Table 4: ROUGE-1 F1 (%) of different methods on GPT-2.We compare the scores between the generated text and the reference (i.e., continuation), and between the generation and context.More methods and metrics are reported in Table 8.
slightly, which suggests the challenges of further improving LM by breaking softmax bottleneck.

Generated Text Comparison
Next, we would like to understand how the distribution improvement affects the text generation.We sample some contexts in the test set of Wikipedia 2021 and compare the generated text quality of the different models given the contexts.The quality is measured by the ROUGE-1 F1 scores between the generated text and the actual continuation.To know how much the different models copy from the context, we also report the ROUGE-1 scores between the generation and the contexts.
The results in Table 4 show that different meth-ods have very similar overall ROUGE-1 scores.Nevertheless, compared to Softmax + Mi, Softmax + CPR:20,100 + Mi is 21% more likely to copy the proper nouns (i.e., entity names) from the context and 9% more likely to generate the proper nouns in the actual continuation.This suggests that our method could alleviate the common incoherence problem of entities in generated text (Shuster et al., 2022;Papalampidi et al., 2022;Zhang et al., 2022;Guan et al., 2022;Goyal et al., 2022b).In Table 8, we compare methods using more metrics to further support the conclusion.

Qualitative Analysis
In Table 5, we visualize some distributions to explain our improvements.The softmax layer of GPT-2 is unable to properly learn to copy or exclude the word from the input context.For example, Softmax + Mi and MoS + Mi might output "There are plates, keys, scissors, toys, and balloons in front of me, and I pick up the phone", which causes a hallucination problem, while Softmax + CPR:20,100 + Mi and Pointer Sentinel (PS) + Mi can output the mentioned options with similar probability by copying the words in the context.In addition, GPT-2, MoS, and PS + Mi are very likely to output "I like tennis, baseball, golf, basketball, and tennis".This repetition problem happens because the next word should be some words similar to the listed sports names except for the sports that have been mentioned and the softmax layer has difficulties in outputting a donut-shape next word distribution in embedding space.In contrast, Softmax + CPR:20,100 + Mi can learn to exclude the listed sports by putting very negative logits on the context words, which yield the desired donut-shape distribution.
In the main paper, we evaluate the quality of summaries using four metrics.ROUGE-1 F1 (Lin, 2004) measures the unigram overlapping between the generated summary and the ground truth sum-mary; CIDEr (Vedantam et al., 2015) adds a tfidf weighting on the n-gram overlapping score to emphasize correct prediction of rare phrases; factCC (Kryscinski et al., 2020) evaluates the factuality of the summary; MAUVE (Pillutla et al., 2021) compares the word distribution of summary and ground truth in a quantized embedding space.To further support our conclusions, we also compare the quality measured by several other metrics and their model sizes in Table 9 and Table 10.
The results are reported in Table 6.Similar to the GPT-2 experiments, the results are generally better as we combine more partitions and local embedding approaches.This demonstrates that we can directly fine-tune the LMs with our softmax alternatives without expensive pretraining.
Unlike the GPT-2 experiments, multiple input hidden state enhancement (Mi) is not very effective, so we mainly compare the methods without Mi (i.e., q ct = h M ct , unlike Equation 2).We hypothesize one possible reason is that we haven't pretrained the T5 and BART with our softmax alternatives.
Our improvements are larger in smaller models.This is probably because in a smaller word embedding space, there are more likely to be interfering words between the desired next word possibilities.Compared to our methods, the pointer networks perform well in BART-base but usually perform worse in other LMs.We need further investigations in the future to explore the reasons.
Compared to ROUGE-1 score, the improvement percentage of CIDEr is overall higher.One major problem of the summarization LMs is that the generated summary contains too many commonly used phrases (King et al., 2022) and our considerably higher CIDEr scores indicate the alleviation of the problem.Our improvement on the factCC is also significant (Cao and Wang, 2021).Finally, our MAUVE improvement percentage on Book-Sum Paragraph dataset could reach around 30% in T5-Small.We hypothesize this is because we often mention the global entity names in the news (e.g., Obama) while the meaning of names in stories (e.g., John) is often defined by the context.

Related Work
Repetition and hallucination are two common problems in language generation tasks.One common solution for repetition is to avoid outputting the words in the context, which is often called unlikelihood training (Welleck et al., 2020;Jiang et 9 and Table 10. 2022b; Su et al., 2022).However, when LM should mention some names in the context, this might exacerbate the hallucination problem.In contrast, our method can learn to copy and exclude the words in context as in Table 5.
Our analyses demonstrate that parts of the hallucination and repetition problem come from the softmax bottleneck.The findings provide an explanation for the effectiveness of prior studies such as the above reranker approaches and pointer networks (Li et al., 2021;Zhong et al., 2022;Ma et al., 2023).Another example is encouraging the word embeddings to be isotropy (Wang et al., 2020;Su et al., 2022).Their improvement might also come from reducing linear dependency of the candidate word embeddings.Nevertheless, their side effect of breaking the similarity structure in the word embedding space might hurt the generation quality in some cases.Concurrently to our work, Wan et al. (2023) also use the softmax bottleneck theory (Chang and McCallum, 2022) to explain the improvement of a pointer network.Their empirical results also support our conclusion that softmax bottleneck is a major reason that causes the factuality problem of LMs.
Our work is motivated and inspired by Chang and McCallum (2022).In their work, they also propose to use different hidden states for different vocabulary partitions, but their partitioning is global and needs to be combined with the mixture of softmax (MoS) approach, which adds a significant overhead compared to the standard softmax layer.Our dynamic partitioning methods not only perform better but greatly reduce the overhead by removing the reliance on MoS.

Conclusion
Since the transformer becomes the mainstream encoder and decoder for LMs, the output softmax layer seems to be the only reasonable option for computing the word probability distribution.Although being simple and efficient, the softmax layer is inherently limited while the existing solutions are relatively slow (Chang and McCallum, 2022).This work proposes a series of softmax alternatives that can improve the text generation models without increasing the computational costs significantly.Our experiments suggest that the main improvement of the pointer network on top of a transformer comes from breaking the softmax bottleneck.Our results also indicate that the alternatives could alleviate some problems of hallucination, repetition, and too generic generation.Furthermore, all of the proposed alternatives can be applied to the LMs that have already been pretrained using softmax without requiring retraining from scratch.For the practitioner, we recommend using all the partitioning methods together to get the best performance, or using only the simple context partition to keep the architecture simple while getting the majority of the gain.

Acknowledgement
We thank Nadar Akoury and the anonymous reviewers for their constructive feedback.This work was supported in part by the Center for Data Sci-ence and the Center for Intelligent Information Retrieval, in part by the Chan Zuckerberg Initiative under the project Scientific Knowledge Base Construction, in part by the IBM Research AI through the AI Horizons Network, in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative, and in part by the National Science Foundation (NSF) grant numbers IIS-1922090 and IIS-1763618.Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

Limitations
In our experiments, we find that the improvement of our methods tend to be larger in relatively smaller language models.Due to our limited access of computational resources, we are not able to try our methods on larger LMs.To know if a larger LM still suffers from the softmax bottleneck problem, we input the examples we used in Table 5 to GPT-3.5 and report their results in Figure 4.
We find that although GPT-3.5 greatly reduces the chance of hallucination compared to GPT-2, the next word distribution is still not ideal.For example, in Figure 4a, although the incorrect answer queen receives only a small probability, GPT-3.5 puts around 67% probability on woman.Similarly, even though GPT-3.5 is unlikely to hallucinate the sentence: There are plates, keys, scissors, toys, and balloons in front of me, and I pick up the phone as GPT-2, Figure 4b and Figure 4d show that the output distribution is still heavily biased toward one of the options and the most likely next word could change if the order of the options in the context changes.These results suggest that increasing model size indeed alleviates the softmax bottleneck problem but the problem is not completely solved even if a huge hidden state size (12k) and model size (175B) are used (Brown et al., 2020).We expect that adding our methods to the large LMs could rectify the biased distributions as shown in our experiments on smaller LMs (Table 5).Therefore, although improving smaller LMs has already had wide applications in practice, trying our methods on a larger LM is a promising next step, which we haven't been able to do.
The current implementation of our methods also has some room for improvements.Our codes cur-  rently contain some unnecessary computation to circumvent the restrictions of PyTorch library, so we should be able to further accelerate it by writing CUDA code.Furthermore, our codes haven't supported the pretraining of BART or T5.We expect that completing the future work could make our method faster and better.
Since the focus of this paper is improving the architecture of general transformer decoder, our evaluation of each application is not as comprehensive as the studies for a particular application.For example, although we test our methods using many metrics and the metrics show a consistent trend, there are many other factuality metrics we haven't tried (Li et al., 2022).We also haven't conducted human evaluation to further verify our conclusion because conducting human evaluation properly is challenging (Karpinska et al., 2021) and time-consuming.In addition, if we include more words in a context partition, the performance might be better at the cost of extra computational overhead.We leave the analyses of the tradeoff as future work.

Ethics Statement
In our experiments, we find that our methods usually copy more words from the context or encoder input.The tendency might have some potential issues.For example, our improvements might be reduced on the languages with more morphology.Furthermore, in some summarization applications, increasing the factuality by increasing the extractiveness might not be ideal (Ladhak et al., 2022;Goyal et al., 2022a).
As described in Section 2.1, one major limitation of the popular softmax layer is its global word embeddings.The problem would become more serious when there are more tokens whose meanings are locally defined (e.g., names in the BookSum dataset).Our methods would be more useful in those circumstances and might alleviate some biases described in Shwartz et al. (2020) and Ladhak et al. (2023).Moreover, the meaning of tokens are also locally defined in many other applications such as variables in code or math problems, the new terminologies in a scientific paper, or the products in a sequential recommendation problem.We believe that our methods could become an efficient alternative of reranker (Cobbe et al., 2021;Welleck et al., 2022) and create impacts in those areas.
Finally, our results show that when there are some uncertainties in the next word (e.g., could be king or woman), existing LMs could have some difficulties of copying the words from the context and our methods alleviate the problem.Thus, our methods should also be able to improve the lexically controllable language generation models that put the desired keywords into the context such as Goldfarb-Tarrant et al. ( 2019) and Lu et al. (2021).

A Appendix Overview
In the appendix, we first analyze our methods using more metrics in Appendix B and describe what we learn from the results.Next, we provide some details of our methods and baselines in Appendix C. Finally, we specify some experiment setups and hyperparameters in Appendix D.

B More Results and Analysis
In this section, we will report more results and provide more detailed analyses accordingly to investigate the advantages of different methods.

B.1 GPT-2 Experiments
Kaplan et al. (2020); Henighan et al. (2020) demonstrate that the loss decreases linearly as the log of the model size increases.Therefore, a new architecture needs to perform better than the old architecture with a similar model size to verify that the improvement does not come from memorizing more information through the extra parameters.From the loss versus log(model size) curve in Figure 5, we can see that our proposed methods are significantly better than MoS and slightly better than a pointer network baseline as the model becomes larger.
We use the following metrics to measure the text generated by GPT-2.
• ROUGE-1 Context (R1C): The prediction F1 for unigram in the context.• ROUGE-1 Proper (R1P): The same as ROUGE-1 except that only the proper nouns are considered.We measure this metric because the correctness of the entity name prediction is critical to the factuality of the generation.
• ROUGE-1 Proper Context (R1PC): The same as ROUGE-1 Context (R1C) except that only the proper nouns are considered.
• Proper Noun Ratio (P Ratio): The average number of proper nouns in the generation divided by the average number of proper nouns in the actual continuation.The LMs usually generate fewer proper nouns compared to the actual continuation (See et al., 2019), so the values are usually lower than 1.The P Ratio closer to 1 is better.
• CIDEr (Vedantam et al., 2015): A metric for measuring the quality and specificity of the generation.
• NIST (Doddington, 2002): Similar to CIDEr.CIDEr uses tf-idf to weigh the n-gram while NIST measures the information gain.
The results are reported in Table 8.In terms of R1, R2, CIDEr, and NIST, our proposed methods such as Softmax + C + Mi and Softmax + CPR:20,100 + Mi are significantly better than the pointer network baselines PS + Mi and PG + Mi.Comparing with Softmax + CPR:20,100 + Mi, PS + Mi has a significantly higher P Ratio and R1PC but similar R1P.This indicates that PS + Mi copies   more proper nouns from the context while there is a similar number of proper nouns that are in actual continuation, so Softmax + CPR:20,100 + Mi actually has a higher accuracy on the proper noun prediction.
In text corpus such as Wikipedia, we do not know the ground truth next word distribution and which context leads to multiple probable next words, so we cannot quantitatively analyze the improvement on the ambiguous contexts.To alleviate the concern, we test our methods on the synthetic dataset constructed by Chang and McCallum (2022).The dataset is built using templates and Google analogy dataset (Mikolov et al., 2013), so we know the ground truth next word distribution.The dataset consists of the ambiguous contexts such as I went to Paris and Germany before, and I love one of the places more, which is, where the next word is either the diagonal words of the parallelogram such as Paris and Germany or the edge words such as Paris and France.For the details of the experimental setup, please refer to Chang and McCallum (2022).
In Table 7, we can see that Softmax + CPR:20,100 + Mi achieves the lowest perplexity in all subsets and outperforms the Softmax + Mi baseline by a large margin, especially in the diagonal subset where the ground truth word embedding distribution has multiple modes.Notice that the performance of MoS + Mi is worse than what reported in Chang and McCallum (2022) probably because we shared the input and output word embeddings.

B.2 Summarization
Compared to Figure 5, Figure 6 shows that our methods improve the loss of T5 in CNN/DM more than GPT-2 in Wikipedia.
In Table 9 and Table 10, we compare the different summarization models by their model size, evaluation losses, inference time, and other metrics which we use in subsection B.1.The pointer network baselines and our methods significantly improve most metrics over the softmax baseline, which is used ubiquitously in nearly all LMs.Although our method generally improves less on the T5-Base model, the percentages of additional parameters and inference time overhead are much

C.1 Proposed Methods
To allow us to start from existing LMs that are pretrained using softmax, we keep the modified softmax layer initially working almost the same as the original softmax layer.We initialize the linear transformation weights of L f P D (), L f LD (), L f P E (), and L f LE () as 10 −10 • I.The other linear weights L f .() are initialized as the identity matrix I.
In the local decoder embedding method Softmax + P + Mi, the initialization would give the 0 logit to all context words.To solve the issue, we revise Equation 3 a little and compute Logit P (x, c t ) by That is, we initially rely on the original softmax layer to compute all the logits and let the term f T ct,P D f x,ct,LD gradually influences the logits of the context words.
In MoS + CPR:20,100 + Mi, our proposed method only revises the logit in one of the softmax.

C.2 Pointer Network Baselines
The pointer networks are originally designed for RNN, so we are unable to use exactly the same formula proposed in the papers.Nevertheless, we try our best to adapt the pointer networks for the transformer encoder while keeping the gist of the formulas.In all methods, to let the results more comparable to our methods, we use f ct,P E and L f LE to determine the probability of copying the words from the context, and use f T ct,V w x to determine the probability of generating all the words in the vocabulary.
In CopyNet (Gu et al., 2016), we compute the probability of outputting the word x as Notice that CopyNet needs to sum up the exponential of dot products, which often causes overflow problems in GPT-2.We can set b to be a large negative value initially to solve the problem, but its perplexity is much worse than the other two pointer network variants.Thus, we choose to skip the CopyNet in the GPT-2 experiments.
In Pointer Generator (See et al., 2017), we compute the probability of x using where ) .We skip the coverage mechanism in the pointer generator paper to make it more comparable to other methods.In T5 experiments, its training loss is sometimes very large, so we set b ptr as 3 initially to keep the p gen close to 1 (i.e., turn pointer part off initially).In other experiments, we set b ptr = 0.
In Pointer Sentinel (Merity et al., 2017), the probability of x is computed by In our experiments, we find that the pointer network variants usually have similar performance (except that PG sometimes performs much worse in summarization due to some training stability issues).This suggests that the differences in the pointer network variants often do not influence the performance significantly, which justifies our simplification of the formulas in the original paper and supports our conclusion that the improvement comes from breaking the softmax bottleneck.
Notice that in the above pointer network variants, the pointer part can only increase the probability of the context words from the generator part.As a result, it cannot alleviate the repetition problem in the last example of Table 5.

C.3 Word-by-word Reranker Baseline
We illustrate our word-by-word reranker (wbwR) in Figure 7.The method has two stages.In the first stage, we compute the logits using the projected hidden state f ct,V and retrieve the top k words.At the second stage, we append the top k words to the input context along with the hidden state f ct,R for reranking the context words. 3We use the same positional embeddings for all candidates to encourage the model to change the ranking of the words.Next, we use the hidden states corresponding to the candidates to compute their local word embeddings as f x,ct,LD .Finally, we re-estimate the probabilities of top k words by To improve the quality of our top k candidates, the final loss is the addition of the wbwR loss at the second stage and the loss of the original softmax layer that only uses the logits from f T ct,V w x at the first stage.When we combine the wbwR with Softmax + CPR:20,100 + Mi, we simply use Softmax + CPR:20,100 + Mi at the first stage and use the wbwR to overwrite the logits of Softmax + CPR:20,100 + Mi at the second stage.
Using this method, we can update the embeddings of the words that are not in the context and allow the candidates to interact with the input context to determine their probabilities as the classic two-stage reranker while keeping the model size roughly unchanged.Nevertheless, the method can only change the probability of the top k words and its computational overhead and memory requirement prevents us from using a very large k.Unlike the standard GPT-2, we cannot get the probability of all positions in one forward pass because the input contexts are different when computing the probability at each position and the input of the second stage reranker depends on the results of the previous forward at the first stage.To speed up, we reuse the computed hidden states and batchify the forward passes.

GPT-2 encoder ……
In our implementation, we first get the top k candidates corresponding to all tokens in the stage1 (just original GPT2) as the input of stage2 reranker.To avoid recalculating the hidden states of the context at stage2, we store the hidden states using the past-key-value in Hugging Face and only compute the hidden states corresponding to the top k candi-date tokens at stage2.
We divide the computation of the whole input sequence into several blocks as shown in Figure 8.In each block, we input a batch containing the last few tokens and top k candidates into the GPT-2 while reusing the hidden states of their common contexts from stage1.In this way, we can increase parallelism by increasing the block size if the GPU memory allows it.
Even though we spent substantial effort on optimizing the wbwR, the method is still too slow to be practically useful.Even if we use four RTX 8000 (a faster GPU with a larger memory), our wbwR implementation is still around 10 times slower than our proposed Softmax + CPR:20,100 + Mi that uses only one RTX 2080.

D Experiment Details
For the reproducibility, we provide some experimental configuration in this section.

D.1 GPT-2 Experimental Details
We mostly follow the experimental setup Chang and McCallum (2022) except that we share the input and output word embeddings as in the standard GPT-2 models.As in Chang and McCallum (2022), we use the last 2% of the corpus as the test set and the 2% before that as the validation set. 4 In summarization example code.For factCC metric 11 , we use the author-provided checkpoint to evaluate CNN/DM results since factCC is originally trained in CNN/DM.For the other three summarization datasets, we follow the author's codes to constructed positive and negative data and continued training the CNN/DM factCC model on each dataset with one epoch respectively.Then we evaluate different summarization tasks with the corresponding factCC checkpoint.
For GPT-2 Medium, T5-Base, and BART Large, we use NVIDIA GeForce RTX 8000 to train the model and for other smaller models, we use NVIDIA GeForce RTX 2080.Most of experiments could be done within one week.In all the inference time experiments, we use NVIDIA GeForce GTX TITAN X, batch size 4 for GPT-2, and batch size 8 for BART and T5.

Figure 2 :
Figure2: We simplify the pointer network / reranker by using another embedding h ct,S for the words in the context / the top-k likely words.
(a) The example where the next word should be either woman or king (or their synonym such as former and latter).(b) The example where the next word plates, keys, scissors, toys, and balloons should receive similar probabilities.(c) The example where the next word John, Alex, Mary, Kathryn, and Jack should receive similar probabilities.(d) Same as above except that the order of the objects in the context is different.

Figure 5 :
Figure 5: The model size versus the model loss in Wikipedia test data after training for 0.4 epochs.The left side points are the results from GPT-2 Small and the right side points come from GPT-2 Medium.The lower curves are better.

Figure 6 :
Figure 6: The model size versus the model loss in CNN/DM test set.The left side points are the results from T5-Small and the right side points come from T5-Base.The lower curves are better.

Figure 8 :
Figure 8: Our efficient implementation of word-by-word reranker.Ti is tokens and ki.j is top-k tokens for Ti.

Table 2 :
Comparison of different methods on top of GPT-2.Wiki and OWT refer to the testing perplexity of Wikipedia 2021 and OpenWebText, respectively.Lower perplexity is better.Time is the inference time of a batch; Mi is the multiple input hidden state enhancement; C is the context partition; R:20,100 is the reranker partition with k table, Mi refers to multiple input state enhancement, which is proposed to break the softmax bottleneck more effectively (please see details in Section 2.3 and Chang and McCallum (2022)).As we can see, Softmax + CPR:20,100 + Mi, which combines all the efficient approaches (i.e., context partition, reranker partition, and local decoder embedding), results in better performance 1 = 20 and k 2 = 100; P is the pointer network (i.e., local decoder embedding).Please see Equation5for the details of CPR.The best scores are highlighted.

Table 3 :
Comparison between our method and wordby-word reranker for the most likely 100 words (wbwR:100).The numbers are the validation perplexities on Wikipedia 2021 after training for 0.15 epochs.

Table 5 :
al., Prediction visualization of three input contexts.We show the top five words with the highest prediction probabilities of each model.The reasonable next word predictions are boldfaced.

Table 6 :
The performance on test sets of four summarization datasets.R1 is ROUGE-1 F1 (%).E refers to the encoder partition; C is the context partition; R:20 is the reranker partition with k 1 = 20; The P in CEPR means using the pointer networks for both encoder (LE) and decoder (LD); Mi is the multiple input hidden state enhancement; PS means Pointer Sentinel and PG means Pointer Generator.CEPR is described in Equation6.The model size, inference time, and more metrics are reported in Table Diagonal (e.g., king or woman)

Table 8 :
Comparison of the continuation generated by GPT-2 Small in Wikipedia test data.Table 4 is a short summary of this table.The meaning of the metrics is described in Appendix B.1.Higher R1C and R1PC mean copying more words from the context.A higher P Ratio means generating more proper nouns.All ROUGE scores are percentages.

Table 10 :
Comparison of the summaries generated by different models in the test sets of BookSum and SAMSUM datasets.We also report the inference time of one samples.The meaning of the metrics are described in Appendix B.1.R2 (ROUGE 2-F1) scores are percentages.Within each section, we highlight the smallest loss, the P Ratio that is closest to 1, and highest numbers in the other metrics.