Enriching and Controlling Global Semantics for Text Summarization

Recently, Transformer-based models have been proven effective in the abstractive summarization task by creating fluent and informative summaries. Nevertheless, these models still suffer from the short-range dependency problem, causing them to produce summaries that miss the key points of document. In this paper, we attempt to address this issue by introducing a neural topic model empowered with normalizing flow to capture the global semantics of the document, which are then integrated into the summarization model. In addition, to avoid the overwhelming effect of global semantics on contextualized representation, we introduce a mechanism to control the amount of global semantics supplied to the text generation module. Our method outperforms state-of-the-art summarization models on five common text summarization datasets, namely CNN/DailyMail, XSum, Reddit TIFU, arXiv, and PubMed.


Introduction
Automatic text summarization corresponds to text understanding and text generation processes. In general, there are two main approaches to perform this task. Extractive systems (Liu, 2019;Narayan et al., 2020;Jia et al., 2020) highlight salient words or sentences from the source text and form the final summary by concatenating them. On the other hand, abstractive methods (See et al., 2017;Zou et al., 2020) switch among generating new words, choosing phrases from the source document, and rephrasing them. Abstractive summarization, which is the focus of this paper, is usually more advanced and closer to human-like interpretation.
Recently, abstractive summarization studies (Lewis et al., 2019;Chen and Yang, 2020) are dominated by Transformer-based * Work done during internship at VinAI Research † Corresponding Author DOCUMENT: While Richie Benaud rose from the suburbs to captain Australia, he will be remembered forever for his mastery of commentating. The champion leg spinner turned cricket commentating into an art form, earning him the title of 'the Voice of Cricket.' His commentary was understated, measured and often extremely funny, and were perfectly timed. Scroll down for video. 84-year-old cricket commentator Richie Benaud has passed away after a battle with skin cancer . His sayings from the hundreds of Test and One Day cricket matches he commentated on across the world were often what fans remembered from important moments. His signature one liners soon dropped to a simple word. 'Marvellous...' will forever be linked to the cricket legend. On commentating, Benaud said: 'My mantra is -put your brain into gear and if you can add to what's on the screen then do it, otherwise shut up.' He once described the scene on the field: 'From our broadcast box you can't see any grass at all, it is simply a carpet of humanity.' On captaincy, and he was one of the best Test captains Australia ever had, Benaud was modest: 'The hallmark of a great captain is the ability to win the toss, at the right time.' The former leg spinner turned cricket commentating into an art form, giving him the title 'the Voice of Cricket'. But he cautioned that description with: 'Captaincy is 90 per cent luck and 10 per cent skill. But don't try it without that 10 per cent.' [...] GOLD SUMMARY: Cricket commentator Richie Benaud has passed away after cancer battle . The 84-year-old will be remembered for his mastery of commentating . The former leg spinner earned himself the title of the 'Voice of Cricket'. His trademark line was 'Marvellous'. PEGASUS: The champion leg spinner turned cricket commentating into an art form, earning him the title of 'the Voice of Cricket'. His commentary was understated, measured and often extremely funny, and were perfectly timed.
Our model: 84-year-old cricket commentator Richie Benaud has passed away after a battle with skin cancer. The champion leg spinner earned the title of 'the Voice of Cricket'. His commentary was understated, measured and often extremely funny. His trademark word, 'Marvellous...' will forever be linked to the cricket legend. architecture (Vaswani et al., 2017). Despite good performance in large scale datasets, Transformerbased summarization models have been proven to have the tendency to favor encoding short-range dependencies , i.e., whenever there is one word from the input generated in the summary, the model tends to continue generating the nearby words due to their high attention scores  to the previous generated word. As such, if the main content of the document is out of reach from the generated word, the final summary can miss that key information. For example, in Table 1, PEGASUS, a state-of-the-art Transformed-based model, failed to capture one key information of the document, i.e., "84-year-old cricket commentator Richie Benaud has passed away after a battle with skin cancer". To understand this phenomenon, we visualize the attention scores in the model during the generation process. As shown in Figure 1, when the model generates "commentary", the main subject of the blue sentence, it tends to point to and generate nearby words such as "his", "understated", "funny", etc. due to their high attention scores while words in the further range such as "Richard", "Benaud", "pass", and "away" receive little weight. Consequently, although PEGASUS generates a grammatically correct summary, the summary lacks the key content which describes the death of "Richie Bernaud".
To avoid missing key points in summarizing, one solution is to furnish the models with global semantics by using probabilistic topic models such as LDA (Narayan et al., 2018), Poisson factor analysis , or inner hidden states (Liu et al., 2019). Nevertheless, traditional topic models were shown to be inefficient in scalability for large-scale datasets (Hoffman et al., 2013;Rezende and Mohamed, 2015) and have limited capability of describing documents (Ding et al., 2018).
To overcome the above problems, we propose a novel method that integrates neural topic model into summarization architecture. Specifically, we aim to utilize the posterior distribution learned from the neural topic model as an approximation of global semantics of the document and from that, provide a signal for summarization model to have a better understanding of overall document. However, there is one critical question: how can we match the neural topic model's posterior distribution with the true posterior as it has been proven in improving the performance of variational inference (Rezende and Mohamed, 2015)? To this end, we propose a method to adapt normalizing flow in the neural topic model to have a better approximation of true distribution and integrate it into the summarization model. Integrating flow mechanism to better approximate the true posterior has been proven to improve performance for variational inference (Rezende and Mohamed, 2015) as well as for downstream tasks such as image synthesis (Kingma et al., 2016), etc. However, to the best of our knowledge, there is no study to investigate the benefit of flow mechanism for the abstractive summarization task.
On the other hand, even though rich global semantics is beneficial, there are recent studies showing that the redundant amount of global semantics may cause harm to hidden representation since it introduces detrimental noise to the model (Tenney et al., 2019;Li et al., 2020). Therefore, we propose a novel contextualized gating mechanism to control the flow of global semantics and maintain important information of the hidden states in the main summarization model.
The contributions of our paper can be summerized as follows: • We propose a novel architecture which takes the global semantics into consideration when performing abstractive summarization. • To this end, we introduce a neural topic model which is enpowered with normalizing flow to enrich the global semantics and contextualized gating mechanism to better control the effect of global semantics on hidden representations. • We conduct extensive experiments and outperform other state-of-the-art summarization models on five benchmark datasets, i.e. CNN/DailyMail, XSum, Reddit TIFU, PubMed, and arXiv, while generating summaries which favor human judgements, and producing human-interpretable topics.
2 Related Work

Transformer-based Text Summarization
Transformer (Vaswani et al., 2017) and its variants have demonstrated high efficiency in text summarization. (Liu and Lapata, 2019) first use to perform extractive summarization. (Zhong et al., 2020) propose using Siamese BERT to score among summaries extracted from the source document, exploring the rich semantic space that those summaries are projected onto. (Narayan et al., 2020) combine HiBERT and structured transformers to extract the document incrementally to form the final summary. For abstractive approaches, ) develop a pretraining scheme well-suited for abstractive summarization. Other frameworks uniting language understanding and text generation such as BART (

Topic-aware Summarization Models
Various works integrate global semantics of topic model into the sequential information. One method is to attend topic vectors with the hidden states, only choosing entries with high document-level representations (Zheng et al., 2020).  design three modules to incorporate topic information to attentive heads, provide topical embedding, and form document-related representations. Other works integrate topical information into convolutional-based models (Narayan et al., 2018;Wang et al., 2018). Ailem et al. 2019 have their pointer-generator conditioned on both the input document and the latent vector. (Fu et al., 2020) study how to effectively assist deep-learning summarization frameworks with external global information. Arguing that each paragraph in the document possesses a separate subtopic, they propose to merge topic information hierarchically with the dense word embedding.
Unfortunately, there is still limited effort controlling the effect of global semantics on the contextualized representations and enriching the global semantics for summarization performance.

Methodology
The overall architecture of our approach is given in Figure 2. It comprises of a topic-oriented encoder, a topic-oriented decoder, and a flow-based neural topic model. Formally, given a document as input, we process it into a sequence of tokens X = {x i }, and the bagof-word (BoW) representation x bow . X is taken as the input for the text summarization module, while x bow serves as the input for the neural topic model.

Flow-based Neural Topic Model
The architecture of the neural topic model (NTM) takes inspiration from (Miao et al., 2017) based on variational autoencoder (Kingma and Welling, 2013). In this work, we adapt the normalizing flow to the neural topic model to better grasp the global semantic patterns of the document. BoW Encoder. In particular, the input x bow is first encoded into a latent variable z by a topic encoder. Each input is passed to obtain the prior mean µ and prior standard deviation σ (1) where f M LP is a non-linear transformation with a tanh activation function; f 1 and f 2 are two linear transformations with bias. To obtain the topic distribution, we draw the latent variable z ∼ N (µ, σ 2 ). Flow. Different from conventional neural topic model, a flow is applied to map the latent vector to a more complicated distribution. Formally, flow is a chain of transformations f 1 , f 2 , . . . , f K which are all invertible and have the Jacobian easy to compute. z BoW Decoder. Given the new topic vector, the BoW decoder retains the original input x bow by generating x bow . We take the following procedure to simulate the reconstruction of x bow where f * (·) is a ReLU-activated non-linear transformation. The weight matrix of f φ is chosen as the topic-word distribution (φ 1 , φ 2 , ..., φ K ). We proceed to employ the topic mixture θ to guide the text summarization process.

Neural Topic Modeling for Transformer
Text summarization model is passed a source document X = {x i } N i=1 and its task is to predict the target summary Y = {y j } M j=1 . In this setting, the document D has N tokens and the summary S has M tokens (M < N ).

Figure 2: Our overall architecture
Our model inherits the Transformer-based architecture. Particularly, it consists of an encoder and decoder. The encoder learns the context of the source text, and the decoder then predicts the target summary, by learning the context of the generated tokens and attending over encoder hidden states. In our case, we make both the encoder and decoder conditioned on the latent topic yielded by the neural topic model. Topic-oriented Encoder We add the special token "CLS" to the beginning of the input. At each iteration, the encoder outputs a localized representation (3) This explores the relationship among the input tokens (the encoder), or discovering the context each token stays in. We relate the context of each word to the main topic of the document by modulating the i-th hidden state h i where g is a function used to introduce the global semantics to the hidden representations which we will discuss later as the contextualized gating mechanism in section 3.3 Topic-oriented Decoder We also make "CLS" the first input of the decoder. The decoder bridges the summary Y and document X, creating target hidden states S = {s j } M j=1 aligned with the source text. Because of the uni-directionality of the text summarization task, the decoder must work in a left-to-right fashion Similar to the Encoder, we seek to inject the semantics of the topic model into the output hidden state.

Contextualized Gating Mechanism
Because a specified amount of semantic meaning, whether it is local or global, has been embedded in the contextualized representations, it is reasonable to only append sufficient information to the calculated hidden states to maximize the efficiency of the topical information. We adapt the gating mechanism of (Cho et al., 2014) to achieve this goal. In our contextualized gating mechanism, we approximate the necessary amount of global semantics based on the obtained hidden states. Encoder Gating For the encoder, we take the hidden representation of "CLS" token to control the amount of additional global information where W E ∈ R d×d , and d is the dimension of the hidden representation. We form the topic-aware hidden state by merging it with the topic mixture and mapping it onto a topical space where f enc_topic is a non-linear transformation. The topic-oriented encoder hidden state of every token is the fusion of the topic-aware and the original representation.
Decoder Gating The amount of topic mixture used for the decoder is controlled by both encoder and decoder hidden state This switching probability is used to modulate the decoder hidden state, which follows the same computation with the encoder gating.

Training Objective
Our framework favors end-to-end learning of neural topic modeling and text summarization. In this section, we formally define the objective functions for the two modules.
For our neural topic model, the objective function is derived from the evidence lower bound (Blei et al., 2017). We adapt the change of variables in normalizing flow that determine the distribution of the variable at the end of the flow to the loss of neural topic model where p(z K ) denotes the prior distribution constructed by the flow; p(x|z K ) stands for the loglikelihood of the document; log q K (z K ) denotes the approximate posterior distribution. Detailed derivation is available in Appendix.
For text summarization, we minimize the crossentropy loss where N and M are the length of the document X and summary Y , respectively. The entire framework is trained with the linear combination of two loss functions L sum and L NTM L = L sum + λL NTM (17) where λ is the hyperparameter balancing the effect of neural topic model on the training process. Reddit TIFU includes 120K informal posts from the online discussion forum Reddit, strictly following the rule of constructing an expressive "TL;DR" summary. In this work, the long subset of the dataset is applied for performance evaluation.
arXiv, PubMed are two long-document datasets of scientific publications. For each document, the abstract is chosen to be the summary.
We present the statistics of datasets in Table 2.

Comparisons
As baselines, we compare our proposed architecture against a wide variety of previous studies: •

Automatic Evaluation
We use the automatic metrics of ROUGE scores (Lin, 2004). In Table 3, 4, 5, 6, and 7, we report the unigram overlap (ROUGE-1), bigram overlap (ROUGE-2) to assess informativeness, and longest common subsequence (ROUGE-L) for the fluency of the generated summary. Our model outperforms prior works on five standard datasets. For CNN/DailyMail, we achieve an absolute improvement of 0.35 in ROUGE-1, 0.48 in ROUGE-2, and 0.28 in ROUGE-L over PEGASUS. Furthermore, our model obtains better results than the previous topic-aware model BART + TA in ROUGE-2 with 0.6 points. This shows that our methods can generate summaries that include important content in the document.
On the XSum dataset, which is more abstractive than CNN/DailyMail (Bommasani and Cardie, 2020), our gain is even more pronounced. Compared with BART + TA, we achieve 3.8 absolute improvement in ROUGE-1, 2.4 in ROUGE-2, and 3.8 in ROUGE-L.
For Reddit TIFU, in which most of the source texts and the target summaries are informal, our model outperforms PEGASUS by 1.3 in ROUGE-1, 0.4 in ROUGE-2, and 1.5 in ROUGE-L. These results show that global semantics is capable of helping the model generate better target summaries.
For arXiv and PubMed dataset, we also achieve improvement over the baseline PEGASUS, which is designed specifically for abstractive text summarization. In particular, for arXiv dataset, we gain an increase of 0.71 in ROUGE-1, 2.48 in ROUGE-2, and 1.46 in ROUGE-L. For PubMed dataset, the increase is 1.7 in ROUGE-1, 1.3 in ROUGE-2, and 0.83 in ROUGE-L.  Since the automatic metric does not fully reveal the true quality of the model, we conduct a human evaluation for further assessment. To achieve that goal, we design two tests in order to elicit human judgements in two ways.

Human Evaluation
In the first experiment, we presented summaries of PEGASUS , BART (Lewis et al., 2019), our model, and the gold summary, then asked four professional English speakers to rate the summaries from worst to best in terms of informativeness, faithfulness, topic coherence, and fluency. We randomly sampled 100 summaries from 100 documents of CNN/DailyMail test set. The score of a system is equal to the percentage of times it was selected as the best minus the percentage of times it was chosen as the worst.
In the second experiment, we applied the question answering (QA) paradigm. For each document, we create two independent questions which emphasizes the key information of the text. Participants would read and answer those questions as best as they could. The score for one system is the percentage of questions the participants answer correctly.
Ten professional English speakers were asked to participate in two assessments. The results in table 9 show that our generated summaries favor human judgements, and are more likely to maintain the important content in the original text than other systems' summaries.
The Fleiss' Kappa scores with overall agreement percentages of the first and second human evaluation experiments were denoted in Table 9. As shown in the Table,      To study the effect of our topic-oriented module on other abstractive Transformer-based model, we integrate our flow-based neural topic model and contextualized gating into BART (Lewis et al., 2019). In particular, we continue to finetune on CNN/DailyMail, XSum, and arXiv dataset, given the pretrained checkpoint. As can be seen in Table  10, 11, 12, our topic-oriented module is able to improve the performance, showing general effectiveness on other Transformer-based architecture.

Analysis on Neural Topic Model and Traditional Topic Model
To substantiate our hypothesis that neural topic model does enhance the summarization performance in large-scale datasets, we have conducted experiments to combine the Transformerbased summarization module with traditional topic model, i.e. Latent Dirichlet Allocation (LDA) and Poisson Factor Analysis (PFA) on CNN/DailyMail and PubMed. We denoted the results in Table 13 and Table 14. As it can be seen, neural topic models, particularly our proposed model, significantly outperform approaches of traditional topic models on abstractive summarization.    It is inherent that latent vector is useful for text summarization, as shown in section 5.1. Here we study whether jointly training with summarization module helps the topic model produce humaninterpretable topics.
Coherence Score Comparison We decide to evaluate the topic models with the automatic C V measure. Following (Zeng et al., 2018;, we pick the top 10 words from each topic and average C V scores of all topics. The results are reported on two summarization datasets, CNN/DailyMail and XSum. To conduct the comparisons, we take LDA and LSA as probabilistic baselines, as they are notable and well-known for human interpretability. For both baselines, we execute 1000 iterations to assure convergence. As Table 16 shows, our model outperforms traditional topic models, which implies that jointly training neural topic model and text summarization creates human-understandable topics. Sample Topics To further assess the quality of the topics learned by our system, we continue to extract some sample words (Table 6) indicating the context around "liverpool chelsea" discovered by the model trained on CNN/DailyMail dataset. As can be realized, the topics pertaining to probabilis-tic topic models such as LSA and LDA contain some mixed topic words. Conversely, our neural topic models trained with the text summarization module produce the topic which looks more coherent. In particular, our words refer to the context which involves the teams competing in the football championship of England, such as "arsenal", "tottenham", etc. and related factors, for instance, "balls", "prize", "winning", etc.   In this section, we proceed to study the impact that (1) The integration of normalizing flow and (2) The contextualized gating mechanism have on the text summarization performance. Impact of the contextualized gating mechanism It can be seen that plainly incorporating the global semantics into the model makes the performance improvement drop strongly. As shown in table 17, the ROUGE-1 score's decreases more than 2 points compared with models we apply contextualized gating. We hypothesize that in numerous cases, the effect of global semantics overwhelm the benefits of contextualized representations. Impact of integrating normalizing flow In this ablation, we eliminate the normalizing flow from the neural topic modeling. As shown in Table 17, without the normalizing flow, the improvement that the latent vector brings is downgraded, nearly 0.4 of ROUGE-1 for using contextualized gating and 0.53 of ROUGE-1 in non-gating case . We hypothesize that the plain neural topic model does not give a sufficiently expressive global semantics as the neural topic model using normalizing flow. Table 1 shows a case study on the summarization results of PEGASUS and our models. While PE-GASUS model misses the key information related to the death of "Richie Benauld", our model successfully include it in the final summarization. It shows the effectiveness of our model in capturing key information in the document, thanks to the contribution of neural topic model and gating mechanism. Remarkably, our model is also able to rephrase "signature one liners" as "trademark word" when describing Richie Benauld's famous quote, not just copying the words in the original document. More case studies can also be found in Appendix.

Conclusion
In this paper, we propose a method to utilize global semantics for text summarization task. In particular, we aim to fit the global semantics to expressively describe the documents. Moreover, we find that maintaining the information in the original contextualized representations is also beneficial for the summarization performance. We outperform other state-of-the-art models on five benchmark datasets.

A Summary examples
We present some summary examples in this section DOCUMENT: New York (CNN) New York state authorities have issued a health alert following a dramatic spike in hospital visits for synthetic marijuana-related emergencies. Gov. Andrew Cuomo said Friday that more than 160 patients in nine days have been rushed to hospitals across the state for adverse reactions to synthetic cannabinoid, known as "spice" or "K2." "Spice" and other similar synthetic drugs are often marketed as legal plant material coated with chemicals that are supposed to mimic the effects of marijuana, according to a statement from the governor's office. "Since the exact compounds contained in synthetic cannabinoid products change so frequently, it's often impossible for users to know exactly what they are putting in their body," acting New York State Health Commissioner Dr. Howard Zucker said. Symptoms after use have a wide range of severity, from confusion, drowsiness and headaches to increased heart rate, seizures and loss of consciousness, according to the New York State Department of Health. Synthetic marijuana is popular among teens because it is marketed as incense or natural products to "mask its true purpose", the health department statement said. "Young people may be fooled into thinking that these substances are safe because they are sold over the counter or are in colorful packaging, but they are not made for human consumption," New York Alcohol and Substance Abuse Service s Commissioner Arlene Gonzalez Sanchez said. "They are dangerous and can have significant, long-term effects on the brain." The recent surge is not isolated in New York; other states across the country have noticed similar trends. Alabama Department of Public Health issued a statement last week acknowledging a rise of synthetic marijuana usage and said there had been 98 overdoses suspected to be linked with "spice" in the previous month. Mobile County alone has seen seven cases in 2015 so far, more than the entire previous year, the statement said. Mississippi health officials are also concerned that synthetic marijuana is on the rise. [...] GOLD SUMMARY: New York reports 160 hospitalizations related to synthetic marijuana . Gov. Andrew Cuomo issued a health alert . PEGASUS: Synthetic marijuana is popular among teens because it is marketed as incense or natural products to "mask its true purpose". "Spice" and other similar synthetic drugs are often marketed as legal plant material coated with chemicals that are supposed to mimic the effects of marijuana. Symptoms after use have a wide range of severity, from confusion, drowsiness and headaches to increased heart rate, seizures and loss of consciousness. BART: "Spice" and other similar synthetic drugs are often marketed as legal plant material. Symptoms after use have a wide range of severity, from confusion, drowsiness and headaches to increased heart rate, seizures and loss of consciousness. "Young people may be fooled into thinking that these substances are safe because they are sold over the counter or are in colorful packaging," official says.
Our model: New York Gov. Andrew Cuomo says 160 people have been hospitalized in nine days. Symptoms after use include confusion, drowsiness, headaches, increased heart rate, seizures and loss of consciousness. Health officials are concerned that synthetic marijuana is on the rise.

B Loss of Flow-based Neural Topic Model
We have the loss of neural topic model, called evidence lower bound Let f 1 , f 2 , . . . , f K be a chain of invertible transformations which have Jacobian easy-to-compute. We have change of variable formula for transforming z i to z i+1 Sequentially applying for K transformations, we have or equivalently, Plugging the formula (21) to equation (18), we obtain L NTM = log p(x, z) − log q(z|x) = log p(x, z K ) − log q(z K |x) We reach the neural topic model component in our training objective.

C Implementation Details
Flow-based Neural Topic Model Following , we preprocess to remove stopwords in the BoW representations. We experiment with different number of topics in 50, 100, 150, 200 and the number of invertible transformations in flow-based neural topic model (flow length) on CNN/DailyMail dataset. The results (in the format of R1/R2/RL scores) are shown in Table 20. It can be seen that a simple posterior distribution in neural topic model is not sufficient to describe the large-scale dataset, while a highly complex one can negatively affect the performance slightly. Similarly, it is necessary to set an adequate number of topics. We proceed to use flow length of 4 and topic number of 100 for other datasets.  We pretrain our versions of flow-based neural topic model on five downstream datasets CNN/DailyMail, XSum, Reddit TIFU, arXiv, and PubMed with batch size {256, 256, 256, 320, 320}, respectively. All versions are trained with Adadelta optimizer with a learning rate of 0.01.
Topic-oriented Transformer-based summarization model. We do not change any settings from original papers of PEGASUS  and BART (Lewis et al., 2019). In particular, we finetune all models on 16 Nvidia GeForce A100 GPUs with batch size 256, Adam optimizer of learning rate 1e − 4. For the objective function in Equation 17, we experimented λ with the choices of {0.5, 0.6, 0.75, 0.9} and found that λ = 0.75 gives the best performance for all datasets. Models are evaluated and saved checkpoints every one epoch. During training, we keep track three best validated checkpoints in terms of evaluation loss on the validation set. Eventually, for decoding, we run beam search with beam size of 8 and note the best result out of three validated checkpoints.