Boosting Summarization with Normalizing Flows and Aggressive Training

This paper presents FlowSUM, a normalizing flows-based variational encoder-decoder framework for Transformer-based summarization. Our approach tackles two primary challenges in variational summarization: insufficient semantic information in latent representations and posterior collapse during training. To address these challenges, we employ normalizing flows to enable flexible latent posterior modeling, and we propose a controlled alternate aggressive training (CAAT) strategy with an improved gate mechanism. Experimental results show that FlowSUM significantly enhances the quality of generated summaries and unleashes the potential for knowledge distillation with minimal impact on inference time. Furthermore, we investigate the issue of posterior collapse in normalizing flows and analyze how the summary quality is affected by the training strategy, gate initialization, and the type and number of normalizing flows used, offering valuable insights for future research.

Variational models have gained increasing research interest (Zhang et al., 2016;Su et al., 2018;Wang et al., 2019;Fu et al., 2020) as they address these issues by introducing uncertainty in predictions through learning a probability distribution over latent variables.A variational model enables diverse text generation (Du et al., 2022), smoother output spaces, and semantically meaningful latent codes (Wang et al., 2019) that guide the generation of coherent and informative summaries.
Nonetheless, existing variational models have not fully achieved the aforementioned desirable properties due to two main challenges.Firstly, the semantic information in the source text may possess a complex structure.However, since introducing latent variables complicates parameter estimation, many current models (Fu et al., 2020;Zheng et al., 2020) represent latent codes using a Gaussian distribution, which is insufficient for capturing the intricacies of the latent space and could potentially reduce model performance.To enrich latent distributions, researchers suggest replacing the highly restricted isotropic Gaussian with normalizing flows (Rezende and Mohamed, 2015).Normalizing flows can generate complex distributions while preserving density in an analytical form, and they have been integrated into variational autoencoder (VAE) (Kingma and Welling, 2014;Rezende et al., 2014) and variational encoder-decoder (VED) (Serban et al., 2017;Zhou and Neubig, 2017) frameworks to better approximate the latent posterior.This approach has found application in various domains, including text generation (Wang et al., 2019), neural machine translation (Setiawan et al., 2020), and dialogue generation (Luo and Chien, 2021).Despite this progress, the operating characteristics of normalizing flows on summarization tasks have yet to be investigated.
Secondly, as reported by previous studies (Bowman et al., 2016;Kingma et al., 2016;Chen et al., 2017), variational models tend to experience posterior collapse during training, which occurs when the KL term vanishes to zero, indicating that the model fails to learn meaningful latent codes.This problem becomes more severe when modeling discrete data with a strong auto-regressive decoder (He et al., 2019), which is the case for Transformerbased summarization models.To resolve this issue, several solutions have been proposed, such as employing a less auto-regressive decoder network (Yang et al., 2017;Semeniuta et al., 2017;Shen et al., 2018a), modifying the training objective (Zhao et al., 2017;Tolstikhin et al., 2018;Prokhorov et al., 2019), and proposing new training strategies (Kim et al., 2018;He et al., 2019).However, most existing work focuses on the VAE framework with Gaussian latent distribution, yet limited work considers the VED framework with normalizing flows.In particular, two questions remain unclear: (1) when the latent distribution is modeled by normalizing flows, does the posterior collapse problem still exist?(2) when posterior collapse exists, what are the appropriate strategies to achieve good summarization quality within the VED framework?
This paper introduces FlowSUM1 , a normalizing flows-based VED framework for Transformerbased summarization, along with a controlled alternate aggressive training (CAAT) strategy and a refined gate mechanism to resolve the two challenging issues.Our contributions include: 1. We employ normalizing flows to enrich the latent posterior distribution and integrate the latent code into Transformer-based models in a plug-and-play manner, demonstrating its effectiveness through extensive experiments.
2. We propose a controlled alternate aggressive training strategy and a refined gate mechanism to mitigate the posterior collapse problem and improve training efficacy.
3. Our findings suggest that FlowSUM facilitates knowledge distillation while having a negligible effect on inference time, implying normalizing flows' potential for transferring knowledge from advanced large language models.
4. We investigate the posterior collapse problem for different normalizing flows and examine how the quality of a summary is impacted by the training strategy, gate initialization, and the type and depth of normalizing flows.
This article consists of five sections.Section 2 provides an overview of normalizing flows, VED, and a summary of related studies.Section 3 describes the proposed model architecture and the training strategies employed.Section 4 presents the experimental setup and results, and Section 5 concludes the paper with some discussions.

Normalizing Flows
Normalizing flows (NF) (Rezende and Mohamed, 2015) is a type of generative model that has gained popularity in recent years.The fundamental idea involves mapping a simple probability density (e.g., Gaussian) to a more complex one through a series of invertible transformations.One of the key advantages of NF is that it allows for exact likelihood evaluations, which is crucial for many applications such as density estimation (Papamakarios et al., 2017), data generation (Tran et al., 2019), andvariational inference (Kingma et al., 2016).A flow-based model consists of two components: a base distribution p u (u) and a transformation f (•) : R D → R D , where f must be invertible and both f and f −1 must be differentiable.Let x = f (u) where u ∼ p u (u), then the density of x can be obtained via a change of variables (Bogachev, 2007): (1) In this paper, we examine several NFs, including planar flows (Rezende and Mohamed, 2015), radial flows (Rezende and Mohamed, 2015), Sylvester flows (van den Berg et al., 2018), real-valued non-volume preserving (RealNVP) transformation (Dinh et al., 2017), inverse autoregressive flow (IAF) (Kingma et al., 2016), rational-quadratic neural spline flows (RQNSF) (Durkan et al., 2019), and rational-linear neural spline flows (RLNSF) (Dolatabadi et al., 2020).We delegate the detailed discussion of transformation and invertibility to Appendix J. Throughout the paper, for each type, we compose K layers of transformation which remains invertible and differentiable.

Variational Encoder-Decoders
Variational encoder-decoders (VEDs) (Zhang et al., 2016;Serban et al., 2017;Zhou and Neubig, 2017;Shen et al., 2018b), which can be seen as an extension of variational autoencoders (VAEs) (Kingma and Welling, 2014;Rezende et al., 2014), have been widely used to understand the conditional data generation process.Given an input x, the framework posits the existence of a latent variable z ∼ p(z | x; ϕ), and the generation of y relies on p(y|x, z; θ).With this premise, the conditional data generation can be formulated as in Eq. 2.
(2) Since the marginal p(y | x; ϕ, θ) is intractable, we employ variational inference to estimate the parameters.This involves maximizing the evidence lower bound (ELBO), a surrogate of the log-likelihood, as defined in Eq. 3. The underlying idea is to propose a parameterized distribution q(z | x, y; ψ), known as the variational posterior, to approximate the true posterior distribution p(z | x, y; ϕ, θ).The greater the flexibility in q(z | x, y; ψ), the better the approximation, and the more effective the surrogate ELBO becomes.See more details in Appendix B.
(3) For summarization, we parameterize p(y | x, z; θ) as an encoder-decoder model that generates summaries conditioned on the input text and latent code.

Transformer-based Summarization Models
Transformer-based models equipped with pretraining and fine-tuning techniques have enjoyed significant success in many NLP tasks, including text summarization.Liu and Lapata (2019) proposed BertSUM for extractive and abstractive tasks, utilizing the pre-trained BERT encoder (Devlin et al., 2019).To better align the pre-trained encoder for document understanding with the decoder trained from scratch for text generation, Rothe et al. (2020) demonstrated the effectiveness of leveraging pre-trained BERT (Devlin et al., 2019), GPT-2 (Radford et al., 2019), and RoBERTa (Liu et al., 2019) checkpoints to build sequence-to-sequence (S2S) models for tasks including summarization.Another approach is to address both document understanding and generation in a unified framework by first pre-training some general-purpose S2S models and then fine-tuning on downstream tasks, for instance, BART (Lewis et al., 2020), MASS (Song et al., 2019), UniLM (Dong et al., 2019), ProphetNet (Qi et al., 2020), and T5 (Raffel et al., 2020).In addition, Zhang et al. (2020a) proposed PEGASUS with a pre-training objective tailored for abstractive summarization, achieving significant improvements across multiple datasets.

Variational Summarization
Variational summarization models come in two different flavors: unsupervised and supervised.In the unsupervised domain, researchers commonly utilize variational autoencoders in conjunction with specific control mechanisms for summary generation, as exemplified by prior work such as Schumann (2018); Chu and Liu (2019); Brazinskas et al. (2020).In the supervised realm, there are generally two primary approaches.The first approach models the conditional probability of the target sentences p(y | x) as in Eq. 2, whereas the second approach models the joint probability of the source and target sentences p(x, y) with p(z)p(x | z)p(y | z, x)dz.Our model belongs to the first category, akin to prior studies like Setiawan et al. (2020); Fu et al. (2020).In contrast, other works, including Zheng et al. (2020); Nguyen et al. (2021); Zou et al. (2021), adopt the second type by jointly modeling topics and sequence-to-sequence generation.Most of them assume a simple Gaussian latent prior, except for Nguyen et al. (2021), which employs normalizing flows to model neural topic models and enrich global semantics.However, they did not specify the choice of normalizing flows and how they addressed posterior collapse.To the best of our knowledge, there remains limited research on the application of normalizing flows in variational summarization models and their operating characteristics.
3 Normalizing Flows Enhanced Summarization Model

FlowSUM Model Architecture
As illustrated in Fig. 1 in Setiawan et al. (2020).Throughout this section, let e be the embedding size, m, n be the length of the input source and target summary respectively, ℓ be the latent dimension of the NF latent module, d be the dimension of the decoder's hidden states, {x i } m i=1 be the input source text, {y j } n j=1 be the target summary text, and x ∈ R e be the average embedding of the untruncated input source text 2 .NF Latent Module.To model the variational posterior q(z | x, y; ψ), we follow Zhou and Neubig (2017) and assume all the information in y is contained in x 3 .Therefore, we have q(z | x, y; ψ) = q(z | x; ψ), which allows us to parameterize q(z | x; ψ) with neural networks (NNs) and normalizing flows using the amortization and reparameterization tricks (Kingma and Welling, 2014).The NF latent module comprises of an inference network q 0 (•) and a normalizing flows model.The inference network takes x as input and produces two output vectors, µ 0 ∈ R ℓ and log(σ 0 ) ∈ R ℓ .Using the reparameterization trick, a random sample z 0 ∈ R ℓ is drawn from N (µ 0 , diag(σ 2 0 )).Afterward, the normalizing flows model applies a sequence of K invertible transformations to z 0 to obtain the latent 2 Let V be the vocabulary size, {Ev} V v=1 be the input embeddings, and {bv} V v=1 be the Bag-of-Words (BoW) of the input source text, then x = ( V v=1 bvEv)/( V v=1 bv) ∈ R e .
In addition, when we don't truncate the input text, b T 1 = m holds.However, if we truncate the input due to encoder constraints, then b T 1 > m, and the BoW vector will contain information that would otherwise have been lost. 3See detailed discussion in Appendix C. 4 Note that when K = 0, the model reverts to the traditional VED framework, and we refer to this degenerated version as VEDSUM.Gated Transformer-based Encoder-Decoder.
Our model adopts the Transformer-based encoderdecoder.The encoder processes the input text and learns a sequence of hidden representations, and the decoder generates a summary based on the encoder's hidden states and the previously generated tokens.We incorporate the latent information into the decoder with a gate mechanism, which mixes the latent vector z K with the decoder's last layer of hidden states {h j } n j=1 .As pointed out in Gu et al. (2020), the saturation property of traditional gating mechanisms hinders gradient-based optimization.Therefore, following their proposal, we use a refined gate mechanism designed to allow for better gradient flow.Let σ(•) be the sigmoid function.We generate the gated fused hidden states {h ′ j } n j=1 as in Eq. 4.
4) Afterward, the fused hidden states are passed to a language model (LM) Head layer, where they are transformed into vectors modeling the probabilities of each word in the vocabulary.

Training Objective
Traditional VEDs usually assume q(z | x; ψ) to be a Gaussian, allowing analytical computation of the KL term in ELBO.However, in our normalizing flows-based VED, the variational posterior q(z | x) = q K (z K | x) can be complex and hence the KL term in Eq. 3 lacks an analytical form.Therefore, we rewrite the ELBO via a change of variables to enable analytical evaluation5 : where q 0 is z 0 's probability density function, a Gaussian distribution modeled by NNs, and det J f k (•) is the determinant of f k 's Jacobian.Let L CE denote the cross-entropy loss and L VI denote the loss introduced by the variational latent module.Applying the idea of Monte Carlo to Eq. 5, we obtain the training objective as below.Note that L VI is a Monte Carlo estimate of the KL divergence between the variational posterior q K and the conditional prior distribution p(z K | x).

Mitigating Posterior Collapse
To remedy posterior collapse, we consider two strategies, aiming to preserve the expressiveness of the latent variable and improve the overall summary quality.The first approach, called β C -VAE (Prokhorov et al., 2019), replaces the KL term with β|KL − C|, where β is a scaling factor, and C ≥ 0 is a threshold that regulates the magnitude of the KL term.When C > 0, the KL term is expected to be discouraged from getting close to 0.
We propose the second approach, Controlled Alternate Aggressive Training (CAAT), inspired by the lagging inference strategy (He et al., 2019).This strategy uses the observation that the inference network cannot accurately approximate the true posterior in the initial stages of training.As outlined in Alg. 1 in Appendix A, CAAT comprises two stages.In the first stage, we alternately update the variational parameters and the entire parameters6 for a specified number of steps.In the second stage, we train all parameters jointly, as in basic VAE training, for the remainder of the training.

NF-enhanced Knowledge Distillation
Normalizing flows can learn complex and multimodal distributions (Papamakarios et al., 2017), which makes them a promising approach for knowledge distillation tasks that involve integrating information from multiple sources (Hinton et al., 2015).To investigate the impact of normalizing flows on knowledge distillation, we adopt two knowledge distillation methods by Shleifer and Rush (2020): Shrink and Fine-Tune (SFT) and Pseudolabels (PL).SFT shrinks the teacher model and re-finetunes the shrunk model.In contrast, the PL method initializes the student model with the compressed version produced by SFT and then finetunes using the pseudo-labeled data generated by the teacher model.In this study, we fine-tune the model on the augmented data with both original and pseudo-labeled data, enabling it to more effectively switch between generated summaries and ground truth, thereby mitigating exposure bias.

Implementation Details
We configure the inference net q 0 (z 0 |x) to be a feedforward neural network and set the latent dimension ℓ to 300 and the number of NF layers K ∈ {2, 4, 6, 8}.For models that use β C -VAE, we set β = 1 and C = 0.1, and for those using CAAT, we conduct one epoch of aggressive training with n alt = 15 and two epochs of non-aggressive training.See more details in Appendix G.

Automatic Evaluation
We evaluate the generated summary quality using ROUGE scores (Lin, 2004) and BERTScore (Zhang et al., 2020b) 8 .Specifically, we utilize the overlap of unigrams and bigrams (ROUGE-1 and ROUGE-2) to evaluate the informativeness, and the longest common subsequence (ROUGE-L) for fluency.Moreover, we report BERTScore, which gauges semantic similarity based on contextual embeddings.Furthermore, we present rep-w (Fu et al., 2021) 9 and the average length of summaries to gain a better understanding of the quality.
We compare the proposed model against baseline models in ROUGE scores in Tables 2 and 3. On CNN/DM, FlowSUM (BERT2BERT) greatly outperforms BERT2BERT, whereas VEDSUM adds noise to the model and leads to a decrease in performance.With the BART backbone, FlowSUM achieves an absolute improvement over the BART model with +0.48, +0.08, and +0.75 in R-1, 2, and L scores, respectively.However, on XSum, the variational models do not perform well when the gold summaries involve only one sentence.VED-SUM leads to a significant decrease in performance, whereas with FlowSUM, the decrease in ROUGE scores is less severe, leading to +0.12, -0.15, and -0.25 in R-1, 2, and L scores, respectively.
Table 4 uses BART as the backbone and compares BART, VEDSUM, and FlowSUM across all datasets.Overall, variational models produce summaries of superior quality for datasets with long summaries, such as CNN/DM, Multi-News, arXiv, and PubMed, and FlowSUM further enhances the performance beyond VEDSUM.However, when it comes to datasets featuring short summaries such as XSum and SAMSum, the variational component markedly diminishes the model performance.

On NF-enhanced Knowledge Distillation
We use PEGASUS as the teacher model to generate pseudo-labels on the CNN/DM training set.In this study, we explore the effects of knowledge distillation on BART and DistilBART, a shrunken version of BART.We examine two variations of Distil-BART: dBART-6-6, which replicates 6 layers10 of the BART encoder and decoder, and dBART-12-3, which duplicates all layers of the BART encoder and 3 layers11 of the decoder.model on augmented data worsens the performance compared to training on the original data.In contrast, VEDSUM-PLKD achieves improvements in all three ROUGE scores, and FlowSUM-PLKD with RQNSF achieves the highest R-2 score, albeit with some sacrifice in R-1 and R-L12 .However, planar flows appear to be unsuitable for knowledge distillation via PL.To better understand FlowSUM-PLKD, we visualize the latent distribution (see Appendix I) and demonstrate how the NF's ability to capture multi-modality could account for its impressive performance.Table 6 investigates the two DistilBART variants with RQNSF.With FlowSUM, both variants achieve improvements, suggesting that NF is beneficial for the SFT approach.Previous experiments from Shleifer and Rush (2020) showed that PL performed worse than SFT on CNN/DM.However, our experiments reveal that the NF latent module unleashes the potential of PL.When trained on augmented data, FlowSUM-PLKD (dBART-6-6) achieves R-1/2/L improvements of 0.92/0.47/1.01 over dBART-6-6, and FlowSUM-PLKD (dBART-12-3) achieves improvements of 0.66/0.49/0.63 over dBART-12-3, much more than the SFT approach.Furthermore, FlowSUM does not introduce additional computational burden at inference, and the time cost is primarily related to the length of the generated summaries.

Analysis on NF Types and Depth
We investigate the effect of NF types and the number of NF layers on the Multi-News dataset13 .Table 7 explores the effect of NF types.Simple flows like Planar and Radial yield inferior performance compared to the VAE counterpart, whereas more complex flows tend to achieve greater improvements.Overall, IAF and RQNSF emerge as the best-performing NF types.Table 8 delves further into IAF and RQNSF, investigating the effect of NF depth.The findings indicate that adding more layers does not always lead to improved performance.We hypothesize that when the encoder-decoder model is well-trained, the increased complexity of the NF module may introduce more noise, outweighing the benefits of better latent modeling and subsequently worsening the summary quality.Hence, it is only effective for models with KL divergence that is not close to zero.Nonetheless, when applicable, CAAT enhances the quality of summaries, particularly when utilized with the topperforming NFs, namely IAF and RQNSF.
In addition, we explore the impact of gate score initialization.The standard method initializes gating weights with small deviations from zero, resulting in an initial gate score close to 0.5.In contrast, the near-zero initialization method initializes gating weights such that the resulting gate score is approx-  FlowSUM illustrates the advantages of incorporating flexible latent modeling.Considering the remarkable achievements of Latent Diffusion Models (LDMs) in generating images (Rombach et al., 2022), adopting LDMs for capturing latent representation may produce comparable or even superior outcomes in text summarization.In this scenario, the gating mechanism may not be an appropriate choice.A direct correlation between the latent vector and the target text may be more suitable for executing the diffusion process.Enhancing the architecture to leverage diffusion models could be a potential avenue for future research.

Limitations
FlowSUM has demonstrated excellent results on datasets with long summaries.However, its performance on short-summary datasets like XSum and SAMSum has been unsatisfactory.The underlying cause could be attributed to suboptimal hyperparameter tuning or the incompatibility of FlowSUM with short summaries.Additional investigations are needed to identify the root cause.Furthermore, we did not fine-tune the hyperparameters of the normalizing flows model, such as the latent dimension, the number of bins in spline coupling layers, and the neural network in IAF, Re-alNVP, RLNSF, and RQNSF.Moreover, we opted for a small batch size due to memory limitations.Adjusting these hyperparameters could potentially enhance the model's performance.
Due to limited computational resources, we utilized BART and BERT2BERT as the backbone models instead of newer architectures.Further research may focus on verifying the effectiveness of FlowSUM on more advanced structures.

Ethics Statement
Our research entailed developing a new text summarization framework.Although no private data were utilized, we acknowledge the potential societal impacts of our work.Therefore, we adhered to pertinent ethical guidelines and implemented rigorous procedures to guarantee the accuracy of our results.
allowing the model more freedom to learn, even if the NF latent module is not helpful, will not harm performance.However, our experiments suggest that this assumption does not hold, particularly for short-summary datasets where the model will not learn on its own to avoid hurting the original performance.The CAAT strategy allows us to effectively freeze the encoder-decoder parameters by setting n agg and n alt to large values, ensuring that when the nf module is unhelpful, it will not significantly harm performance.

B Deeper Dive into the Evidence Lower Bound (ELBO)
Within the VED framework, the conditional data generation process can be expressed as follows: The subsequent challenge revolves around parameter estimation.Typically, the conditional latent prior is assumed as p(z | x; ϕ) = N (0, I) for simplification (hence eliminating the ϕ parameter).Despite this, the likelihood p(y | x; θ) remains computationally intractable to evaluate.Variational inference tackles this issue by introducing a variational distribution q(z | x, y; ψ) from a specific parametric family, aiming to approximate the actual posterior p(z | x, y).Here, θ denotes the model parameters, and ψ refers to the variational parameters.Instead of attempting to estimate θ solely through maximizing the challenging log-likelihood, the approach involves joint estimation of both θ and ψ by optimizing the ELBO.
Examining Eq. 7 and 8, it's evident that the ELBO represents a lower bound of the loglikelihood.Moreover, a smaller value of KL(q(z | x, y)∥p(z | x, y)) indicates a closer alignment between the variational posterior and the true posterior, thereby bringing the ELBO closer to the log-likelihood.This insight propels the adoption of normalizing flows to model a flexible family of variational posterior.
where q 0 and q K are the probability density function for z 0 and z K respectively.
C Discussion on q(z | x, y) = q(z | x) we choose to assume q(z | x, y) = q(z | x) for the following reasons.Firstly, this assumption is grounded in the nature of summarization, where y can be viewed as a condensed form of x and hence it is sensible to assume all the information in y is contained in x.Secondly, as evidenced by Zhang et al. (2016), it is plausible to condition the posterior on both x and y.However, their approach suffers from difficulties during prediction.In prediction, the target text y is not accessible, making it hard to sample from q(z | x, y).Zhang et al. (2016) suggests taking the prior's mean as the latent code, but in our paper, the prior is a Gaussian whereas the posterior is a complex distribution modeled by normalizing flows, and taking such a strategy would diminish the benefit of using normalizing flows.Thirdly, it has been shown empirically by Eikema and Aziz (2019) that by restricting the conditioning of the posterior to x alone, their model achieves higher accuracy.Therefore, we consider q(z | x, y) = q(z | x) as our modeling strategy.

D Repetition Measures
Let s represent the sentences in a result set D, |s| be the number of tokens in s, s t be the tth token, and s i:j be the sub-sequence of s from the ith token to the jth token.The rep-w (Fu et al., 2021) is then defined by Equation 10.Multi-News (Fabbri et al., 2019) is a multidocument dataset comprising 56k pairs of news articles and multi-sentence summaries.arXiv, PubMed (Cohan et al., 2018) are two scientific paper document datasets from arXiv.org (113k) and PubMed (215k).Each pair consists of a scientific article's body document and its abstract.
SAMSum (Gliwa et al., 2019) includes 16k conversations annotated with summaries by linguists.Unlike structured texts, the information in dialogues is scattered across different speakers' utterances, increasing the summarization difficulty.

F Baseline Models
PG+Cov (See et al., 2017) is a pointer-generator (PG) network supplemented with a coverage mechanism that addresses the Out-Of-Vocabulary problem and minimizes word repetition.
BERT2BERT (Rothe et al., 2020) initializes both the encoder and the decoder with the pre-trained BERT checkpoints and adds cross-attention layers.
BERTSUM (Liu and Lapata, 2019) builds on top of BERT and applies a fine-tuning scheduler to better align the encoder and the decoder.
BART (Lewis et al., 2020) is a pretrained denoising autoencoder with the standard sequence-to-thermore, we terminate the training early when the perplexity fails to improve for eight or sixteen consecutive evaluation calls.

G.3 Model Hyper Parameters
Table 11 provides the hyper-parameters for the models discussed in Table 4 -7, for the sake of reproducibility.To ensure fair comparisons, unless otherwise specified, the VEDSUM models typically employ the same set of hyper-parameters as their FlowSUM counterparts, except with standard training and no NF layers applied.Additionally, the models in Table 8 have the same hyper-parameters as those in Table 7, except for the number of NF layers used.Lastly, in Table 9, all FlowSUM models use 4 NF layers and the same set of hyperparameters as those in Table 7 but vary in their training strategies.

H Experiments on Training Strategies and Gate Initialization
The training curves for the methods in Table 10 are illustrated in Figure 2. The plot demonstrates that the gate score decreases gradually and remains high during aggressive training when CAAT is combined with standard initialization.This combination compels the model to utilize the latent code information effectively.Moreover, as presented in Figure 2c, even though CAAT combined with standard initialization starts with a high perplexity, it achieves a lower perplexity level than other approaches by the end.By examining the training procedure in detail, Figure 3 further indicates that CAAT contributes to greater training stability than standard training.

I Visualization of Latent Distribution
To

J Normalizing Flows
Planar flow Proposed by Rezende and Mohamed (2015), the planar flow can be expressed as in Eq. 11.It applies contractions or expansions in the direction perpendicular to the hyperplane w ⊤ z + b = 0. Its Jacobian determinant can be computed in time O(D) as in Eq. 12, using the matrix determinant lemma.In addition, we need to note that this flow is not invertible for all values of u and w.When the derivative of the activation function h ′ (•) is positive and bounded from above, x) is sufficient to ensure invertibility 15 .
where {u, w ∈ R D , b ∈ R} are free parameters and h(•) is a smooth element-wise non-linear activation function with derivative h ′ (•).
Radial flow The radial flow (Tabak and Turner, 2013;Rezende and Mohamed, 2015) takes the form of Eq. 13.It applies radial contractions and expansions around a reference point.Similar to the planar flow, we can apply the matrix determinant lemma to calculate the Jacobian determinant in O(D) time, as in Eq. 14.To guarantee invertibility, we usually require β > −α 16 .
where z 0 ∈ R D is the reference point, β ∈ R, α ∈ R + are free parameters, r = ∥z − z 0 ∥ is the norm of z − z 0 , and h(α, r) = 1 α+r .Sylvester flow The Sylvester flows (van den Berg et al., 2018) generalize the planar flows to have M hidden units, as in Eq. 15.To achieve better computational efficiency, van den Berg et al. (2018) proposes the parameterization as in Eq. 16, with which the Jacobian determinnant reduces to Eq. 17 and can be computed in O(M ).Similar to the planar flows, when h ′ (•) is positive and bounded from above, Rii R ii > − 1 sup x h ′ (x) for all i ∈ {1, . . ., D} is sufficient to ensure invertibility. where the free parameters and h(•) is an element-wise activation function.
where R and R are upper triangular M × M matrices, and Q = (q 1 . . .q M ) consists of an orthonormal set of vectors.
Autoregressive Flows The masked autoregressive flow (MAF) (Papamakarios et al., 2017) was motivated by MADE (Germain et al., 2015), which is an autoregressive model for density estimation.MAF generalizes the conditional distribution to be Gaussian and generates data in a recursive way as in Eq. 18.Given a data point x, the inverse transformation can be performed in parallel as in Eq. 19.The Jacobian of the inverse transformation is lower-triangular by design due to the autoregressive structure, hence its absolute determinant can be expressed as in Eq. 20.The set of functions {f µ i , f α i } are autoregressive neural networks following the approaches in MADE. where ) and u i ∼ N (0, 1).
Likewise, the inverse autoregressive flow (IAF) (Kingma et al., 2016) uses MADE with Gaussian conditionals and generates data as in Eq. 21.Its Jacobian determinant has a simple form as in Eq. 22.
The main difference between IAF and MAF lies in the history variables.MAF uses previous data variables x 1:i−1 to compute µ i and α i , whereas IAF uses previous random variables u 1:i−1 for the computation.In terms of sampling and density evaluation, IAF can sample in parallel and need to evaluate sequentially, whereas MAF has to sample sequentially and can evaluate in parallel.Since we care more about the sampling efficiency in variational inference, we choose IAF in the paper. where Affine Coupling The affine coupling layer, proposed in NICE (Dinh et al., 2015) and later generalized in RealNVP (Dinh et al., 2017) takes the following form.Its Jacobian determinant can be efficiently computed as det J = exp j s (x 1:d ) j .Since the computation does not involve the Jacobian of s or t, we can make these two functions arbitrarily complex and use neural networks to model them.The coupling layers are usually composed of permutation layers to ensure every component gets modified, and since the Jacobian determinant of permutation is 1, the Jacobian determinant remains tractable.
Spline Coupling Neural spline flows (Durkan et al., 2019;Dolatabadi et al., 2020) use monotonic rational-quadratic splines or monotonic rationallinear splines as the coupling transformation to achieve more flexibility and yet remain differentiable and invertible.The monotonic rationalquadratic spline uses K + 1 monotonically increasing knots x (k) , y (k) K k=0 to set up K bins, each of which is defined as a rational-quadratic function 17 that is monotonically increasing.It maps [−B, B] to [−B, B] and defines the transformation outside the range to be identity transformation.Let s k = y k+1 − y k / x k+1 − x k and ξ(x) = x − x k / x k+1 − x k , the rationalquadratic function in the kth bin takes the form of Eq. 24 and the Jacobian determinant of the rationalquadratic neural spline flows (RQNSF) can be written as in Eq. 25.
The rational-linear neural spline flows (RLNSF) work similarly, except with monotonically increasing linear rational functions in each bin.Neural splines combine the best of autoregressive flows 17 A rational-quadratic function is defined as the quotient of two quadratic polynomial functions.and coupling layers (such as NICE and RealNVP) in that it has both an analytic single-pass inverse and sufficient flexibility, as demonstrated in Durkan et al. (2019).

K Example Analysis
In this section, we analyze several instances from CNN/Daily Mail and XSum, showcasing diverse outcomes generated by different summarization models. 18riginal Text (truncated): It looks like an ordinary forest, with moss climbing up the walls and brown leaves covering the floor.But if you look closely, you will see that this picture is not all it seems.For the peaceful scene actually features a carefully painted female model.The amazing illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model so she would blend in with her surroundings.The stunning set of pictures was taken in a forest in Langenfeld, Germany, yesterday.Mr Duesterwald has been painting for more than 20 years.

Gold Summary:
The illusion is the work of German body-painting artist Joerg Duesterwald, who spent hours painting his model.Stunning set of pictures was taken in front of a rockface in a forest in Langenfeld, Germany, yesterday.
BART: Stunning set of images was taken in a forest near Langenfeld, Germany, yesterday by body-painting artist Joerg Duesterwald.It looks like an ordinary forest, with moss climbing up the walls and brown leaves covering the floor.But, if you look closely, you will see that this picture is not all it seems.For the peaceful scene actually features a carefully painted female model.

VEDSUM:
The stunning set of pictures was taken in a forest in Langenfeld, Germany, yesterday.It looks like an ordinary forest, with moss climbing up the walls and brown leaves covering the floor.But, if you look closely, you will see that this picture is not all it seems.For the peaceful scene actually features a carefully painted female model.
FlowSUM: Amazing illusion is the work of German body-painting artist Joerg Duesterwald.He spent hours painting his model so she would blend in with surroundings.Stunning set of pictures was taken in a forest in Langenfeld, Germany, yesterday.Original Text (truncated): UFC light heavyweight champion Jon Jones ran from a crash that hospitalised a pregnant woman -but quickly came back to grab 'a large handful of cash' from the car, witnesses told police.According to police, the accident occurred in southeastern Albuquerque just before noon on Sunday local time when the driver of a rented SUV jumped a red light.The driver, whom an off-duty officer identified as Jones, ran from the scene but then returned for the cash before fleeing again, police said.'Witnesses stated he shoved the cash into his pants and ran north jumping the fence,' the report said.Officers found a pipe with marijuana in the vehicle as well as MMA and rental car documents in Jones' name, according to the police report.Police were searching for UFC champion Jon Jones in connection with a hit-and-run accident.Albuquerque police were seeking an arrest warrant for Jones on Monday.They said he would likely face a felony charge of leaving the scene of an accident since the woman broke her arm in the crash.Police said in a news release they'd been unable to reach Jones or his lawyer.However, Jones handed himself in later the same day, with TMZ reporting he was being held at Bernalillo County Metropolitan Detention Center.According to the warrant, the pregnant woman told police she was driving when she was hit by a silver Buick SUV.• • • Although he is widely considered the world's best pound-for-pound mixed martial artist, Jones has endured legal problems and questionable behaviour as champion.
Gold Summary: UFC light heavyweight champion Jon Jones ran from a crash that hospitalised a pregnant woman, witnesses told police.According to police, the accident occurred in Albuquerque just before noon on Sunday when the driver of a rented SUV jumped a red light.The driver, whom an off-duty officer identified as Jones, ran from the scene but then returned for the cash before fleeing again, police said.Jones is widely considered the best pound-for-pound mixed martial artist.
BART: Albuquerque police were seeking an arrest warrant for Jones on Monday.They said he would likely face a felony charge of leaving the scene of an accident since the woman broke her arm in the crash.However, Jones handed himself in later the same day, withTMZ reporting he was being held at Bernalillo County Metropolitan Detention Center.
VEDSUM: UFC light heavyweight champion Jon Jones ran from a crash that hospitalised a pregnant woman.Witnesses said he returned for 'a large handful of cash' from the car.Albuquerque police were seeking an arrest warrant for Jones on Monday.They said he would likely face a felony charge of leaving the scene of an accident since the woman broke her arm in the crash.Jones handed himself in later the same day.
FlowSUM: UFC light heavyweight champion Jon Jones ran from a crash that hospitalised a pregnant woman.Witnesses said he came back to grab 'a large handful of cash' from the car, witnesses told police.The driver, whom an off-duty officer identified as Jones, ran from the scene but then returned for the cash before fleeing again, police said.Officers found a pipe with marijuana in the vehicle as well as MMA and rental car documents in Jones' name, according to the police report.Original Text (truncated): The Office for National Statistics said industrial output fell 0.7% compared with January, when it dropped 0.3%.Unexpectedly warm weather drove the change, because it led to a fall in electricity and gas demand, the ONS said.Construction output fell by 1.7% in February, down from a revised January reading of zero growth.The construction figure, the biggest drop in nearly a year, was mainly the result of a 2.6% fall in the housebuilding sector.Meanwhile, the UK's deficit in goods and services widened to Â£3.7bn in February, from a revised figure of Â£3bn in January.According to the ONS, the deficit was fuelled by what it called "erratic items", such as imports of gold and aircraft."The overall trade deficit worsened, but excluding erratic items, the picture improved, as imports fell more than exports," said ONS senior statistician Kate Davies.Howard Archer, chief UK and European economist at IHS Markit, called the figures "a disappointing package of data for the UK economy which fuels suspicion that GDP growth slowed markedly, largely due to consumers becoming more cautious".He added: "We suspect UK GDP growth in the first quarter of 2017 slowed to 0.4% quarter-on-quarter from 0.7% quarter-on-quarter in the fourth quarter of 2016 -this would be the weakest growth rate since the first quarter of 2016."Gold Summary: Activity in the UK's industrial and construction sectors shrank in February, new figures show.
BART: UK industrial output fell for the second month in a row in February, official figures have shown.
VEDSUM: Industrial output in the UK fell for the second month in a row in February, official figures have shown.
FlowSUM: Activity in the UK's industrial and construction sectors shrank in February, according to official figures.Original Text (truncated): In December, the government announced finalised plans for a cull, initially in pilot areas, as a way to curb the spread of tuberculosis in cattle.In applying for judicial review, the Badger Trust says culling will not stop TB and may in fact help spread it.Other campaign groups are considering action under the Bern Convention, which protects European wildlife.The government's plans are likely to result in farmers funding contractors to shoot badgers in a number of areas of England, with two initial pilots in west Gloucestershire and west Somerset taking place later this year."We have identified some serious flaws in the way by which the Secretary of State [Caroline Spelman] reached her decision to cull badgers," said Gwendolen Morgan of Bindmans solicitors, lawyer for the Badger Trust."Given that Defra's proposals come at an enormous cost to farmers, and threaten to prompt rather than prevent the spread of disease, we hope that this ill-conceived decision will be struck down by the court."She pointed to government projections that culling would reduce TB incidence by 12-16% over nine years.

Gold Summary:
The Badger Trust has launched a new legal challenge to the government's plans to cull badgers in England.

BART:
The Badger Trust has launched a legal challenge to the government's plans to cull badgers in England.

VEDSUM:
The Badger Trust is taking legal action against the Department for Environment, Food and Rural Affairs (Defra) over plans to cull badgers in England.

FlowSUM:
The Badger Trust has launched a legal challenge to the UK government's plans to cull badgers in England and Wales.Original Text (truncated): The response from many in that time has been: "Let's get on with it."That view was shared by the First Minister Carwyn Jones until recently when he altered his opinion and said that we should only start the official Brexit negotiations in the early part of next year.My sense is that the public will be flexible on the timing up to a point, as long as they are given a clear sense of direction.The majority of the political establishment have had to come to terms with the fact that most people ignored their advice to remain.So much for being in touch with the electorate.In conversations with politicians on the remain side since, I have come across a mix of bewilderment, frustration and sadness.And while people like me spend a lot of time talking and writing about a Welsh political dynamic, on this subject at least, Wales was a carbon copy of England.In stark contrast, those that supported leaving feel vindicated by their campaign, and now believe they are the ones in touch with vast swathes of the population.The referendum result was a devastating indictment of the effectiveness of the billions of pounds of EU funds spent trying to regenerate economically deprived communities.The brutal reality is that those who were most likely to vote to leave lived in communities where most EU money had been spent.It is an extraordinary paradox that raised eyebrows far further afield than Wales.Gold Summary: It has been a month since Wales voted to leave the European Union.
BART: It has been more than a year since the UK voted to leave the European Union.
VEDSUM: It has been a year since the EU referendum result, and in that time I have spent a great deal of time talking to politicians on both sides of the political spectrum about what they think about Brexit.
FlowSUM: Since the referendum result on 23 June, I have spent a lot of time talking about the implications for Wales and the Welsh political establishment.
gain a better understanding of how normalizing flows contribute to knowledge distillation, we selected several examples from the CNN/Daily Mail and XSum datasets and visualized the resulting latent distribution generated by the FlowSUM-PLKD model, as shown in Figure 4 and 5.For both cases, the transformed latent code z K exhibited a highly flexible distribution.Notably, in the CNN/Daily Mail example, the first dimension of the second example demonstrated a clear bi-modal distribution, indicating the model's ability to capture information from multiple sources.Similarly, in the XSum dataset examples, we observed distinct multi-modal patterns.

Figure 2 :
Figure 2: Comparison of training strategies and gate initialization.
Figure 3: A closer look at the training process: CAAT vs. Standard Training.

Figure 4 :
Figure 4: Visualization of the first two dimensions of z 0 , z K , and N (0, I) by FlowSUM-PLKD on CNN/DM.The right sub-figure demonstrates a clear bi-modality.

Figure 5 :
Figure 5: Visualization of the first two dimensions of z 0 , z K , and N (0, I) by FlowSUM-PLKD on XSum.Both sub-figures demonstrate distinct multi-modal patterns.
y d+1:D = x d+1:D ⊙ exp (s (x 1:d )) + t (x 1:d ) (23) where s : R d → R D−d and t : R d → R D−d are scale and translation transformation function respectively, and ⊙ is the element-wise product.

Table 1 :
. Refer to Appendix E for more details.Statistics of Summarization Datasets.

Table 2 :
Comparison with baselines on CNN/DM.

Table 4 :
Table 5 presents the impact of the PL approach on the original BART model.Training the BART Comparison of BART, VEDSUM (BART), and FlowSUM (BART) on all six benchmarks.

Table 7 :
Effect of NF Types on Multi-News.

Table 8 :
Effect of Number of NF Layers on Multi-News.4.4.4 Analysis on Training StrategiesWe implement standard VAE training, β C -VAE, and CAAT on VEDSUM and FlowSUM models, and we evaluate their effectiveness with different types of normalizing flows.Table9shows that VEDSUM and FlowSUM models with residual flows, including planar, radial, and Sylvester flows, suffer from posterior collapse, whereas those with more complex flows do not.Moreover, applying β C -VAE to VEDSUM and FlowSUM models with residual flows does not effectively mitigate posterior collapse but even exacerbates the issue.Furthermore, for models with planar, RealNVP, and IAF flows, training with β C -VAE worsens ROUGE scores, while for radial and Sylvester flows, it improves performance.Notably, the two neural spline flows are not impacted by β C -VAE training.Concerning CAAT, we note that applying it to treat severe posterior collapses such as VEDSUM and FlowSUM with residual flows can cause instability in training while producing NaN values.

Table 9 :
Effect of Training Strategies.