Semantic-Preserving Abstractive Text Summarization with Siamese Generative Adversarial Net

,

2 Anhui Province Key Laboratory of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China.

Abstract
We propose a novel siamese generative adver sarial net for abstractive text summarization (SSPGAN), which can preserve the main se mantics of the source text.Different from pre vious generative adversarial net based methods, SSPGAN is equipped with a siamese semanticpreserving discriminator, which can not only be trained to discriminate the machine-generated summaries from the human-summarized ones, but also ensure the semantic consistency be tween the source text and target summary.As a consequence of the min-max game between the generator and the siamese semantic-preserving discriminator, the generator can generate a sum mary that conveys the key content of the source text more accurately.Extensive experiments on several text summarization benchmarks in dif ferent languages demonstrate the effectiveness of the proposed method. Source: The software and information technology service industry in Chengdu has maintained the momentum of rapid development in recent years, ranking first among the cities in the central and western regions, and has become the "Silicon Valley" in the west of our country."The 2013 Chengdu Software and Information Technology Service Industry Development Report" was released a few days ago...For details, please see: @Chengdu Daily@Chengdu post Reference: Chengdu strives to build the Western "Silicon Valley" Generated: Chengdu releases software and information technology service industry development report Abstractive text summarization endeavors to pro duce a concise and fluent summary for a given text, while maintaining the key content and overall meaning.Previous attempts tackle this problem with either rule-based or statistical-based methods.Recently, with the successes obtained on the ma chine translation task (Sutskever et al., 2014;Sheng et al., 2020), the neural network based sequence to-sequence framework is also applied to the ab stractive text summarization task.Specifically, the sequence-to-sequence architecture consists of an encoder responsible for transforming the source sequence x = {x 1 , x 2 , . . ., x T x } into an interme diate representation, and a decoder to generate a target sequence y = {y 1 , y 2 , . . ., y T y } using the previously generated intermediate representation.Furthermore, to dynamically generate a context vector for a target word being generated, the attention mechanism (Bahdanau et al., 2014;Luong et al., 2015) is proposed to strengthen the sequenceto-sequence models, which enables the model to focus on the relevant parts of the source-side sequence.Based on the encoder-decoder framework, many variants of model structures, such as convo lutional neural network (CNN) and recurrent neu ral network (RNN) are proposed (Bahdanau et al., 2014;Gehring et al., 2017).With the emergence of Transformer (Vaswani et al., 2017), which is based entirely on the attention mechanism, state-of-the art performance is achieved on many sequence-to sequence tasks.Nevertheless, for the task of abstractive text summarization, one of the dominant challenges is to maintain saliency, which requires the generated summary to convey the important information accurately.As shown in Figure 1, the key content of the source text "Chengdu become the 'Silicon Valley' in the west of our country" is accurately summarized in the reference, while the generated summary expresses the unimportant con tent "Chengdu releases software and information technology service industry development report".
Intuitively, the lack of saliency in summarization is usually caused by attending to wrong parts of the source text, inspiring many attention optimiza tion methods for more accurate attention mecha nism.Among them, (Lin et al., 2018) proposes a global encoding framework, which controls the attention information flow from the encoder to the decoder based on the global information of the source context.(Gui et al., 2019) proposes an effec tive method to regularize the attention weights from both global and local aspects.(Duan et al., 2019) introduces a novel attention mechanism, where the attention weights on relevant parts of the source side are encouraged while the attention weights on less relevant or irrelevant parts are discouraged with a softmax and a softmin function respectively.However, for these methods, the underlying nature of saliency, which is actually the sentence-level se mantic consistency between the source text and the generated summary, is generally overlooked.
To explicitly maintain the semantic consistency, we propose a novel Siamese Semantic-Preserving Generative Adversarial Net (SSPGAN) for abstrac tive text summarization.In SSPGAN, different from conventional adversarial training (Goodfel low et al., 2014) which mainly focuses on how to generate more realistic data, a novel training paradigm is introduced to generate a summary that is more semantically consistent with the source text.Specifically, the proposed model consists of two adversarial modules which play a min-max game: • A conventional neural encoder-decoder based generator, which aims to generate the sum mary sequence based on the input text.
• A siamese semantic-preserving discriminator.Different from the conventional discriminator in a generative adversarial net (GAN), in addi tion to distinguishing the real summary from the generated summary, it is also required to capture the semantic consistency between the source text and the target summary.And we adopt a pseudo siamese net to achieve that.Specifically, we aim to maximize the seman tic similarity for a real sentence pair (text, real summary), while minimizing it for a gener ated sentence pair (text, generated summary).
During the training process, in terms of the au thenticity and semantic consistency with the input source text, the generator aims to fool the discrim inator into believing that its output is a humangenerated summary, and the discriminator makes efforts not to be fooled by improving its ability to distinguish the machine-generated summary from the human-generated one.This kind of adver sarial training achieves a win-win situation when the generator and the discriminator reach a Nash Equilibrium (Zhao et al., 2016;Arora et al., 2017;Guimaraes et al., 2017).
Different from conventional GANs, which as sume the existence of a generator in a continuous space, in our proposed framework, the text sum marization model is in fact not a typical generative model, but instead a probabilistic transformation that maps a source text to a target summary, both in a discrete space.To this end, we turn to a policy gradient method named REINFORCE (Williams, 1992), which can guarantee that both the two sub models are effectively optimized in an adversarial manner.In addition to the conventional reward, which is the estimated probability of the gener ated summary being discriminated as the real one, we also adopt the semantic similarity between the source text and the generated summary as a sup plementary reward signal.Besides, we employ Transformer (Vaswani et al., 2017) as the basis of our discriminator to capture both the global and local features of the sentence.
The contributions of this work are three-fold: • We propose a siamese net based discriminator to ensure the semantic consistency between the generated summary and the source text.
• A generative adversarial net based entirely on Transformer is proposed.As far as we know, this work is the first attempt to apply such framework into the text summarization task.
• Experimental results on both English and Chi nese text summarization datasets show that the proposed model outperforms conventional GAN-based methods.And we also demon strate that the proposed method can maintain semantic consistency from multiple perspec tives.

Related Work
Automatic text summarization can be broadly di vided into extractive and abstractive summarization.
The extractive methods simply extract important parts of the source text and reorganize them in a certain order (Jing and McKeown, 2000;Knight and Marcu, 2000;Neto et al., 2002).In comparison, abstractive text summarization is closer in princi ple to the process of manual summarization, which extracts the essential information of the source text and describes it in a shorter version as the abstrac tive summary.In this paper, we focus on abstractive text summarization.
Previous works on abstractive text summariza tion are mainly designed with statistical methods and rule-based methods (Banko et al., 2000;Dorr et al., 2003;Zajic et al., 2004;Cohn and Lapata, 2008).Recently, the sequence-to-sequence neu ral framework becomes predominant on the task of abstractive text summarization (Chopra et al., 2016;Nallapati et al., 2016;Li et al., 2017b).Later on, with the advent of Transformer (Vaswani et al., 2017), more and more works choose it as the base model in their frameworks.
For the abstractive text summarization task, out-of-vocabulary (OOV), repetitions and lack of saliency are three dominant challenges.To tackle the problem of OOV, some works introduce the pointer network and copy mechanism (Nallapati et al., 2016;See et al., 2017;Gu et al., 2016;Paulus et al., 2017).On the issue of repetitions, (See et al., 2017) adopts a coverage mechanism, which is in spired by the coverage vector from neural machine translation (Tu et al., 2016).Regarding saliency, some works (Duan et al., 2019;Gui et al., 2019) focus on how to optimize the attention mechanism, while (Zhu et al., 2021) tries to enhance the fac tual consistency with a fact corrector.Meanwhile, (Narayan et al., 2021) adopts the content planning to improve the performance of abstractive summa rization model.However, the essence of saliency, which is the sentence-level semantic consistency between the source text and the generated summary, is intuitive yet usually overlooked.
The proposed training principle is based on ad versarial learning (Goodfellow et al., 2014).In conventional adversarial training, a generator and a discriminator compete with each other, forcing the generator to produce high quality samples that can fool the discriminator.Adversarial training typically excels in image generation (Goodfellow et al., 2014), with less applications in natural lan guage processing tasks (Yu et al., 2017;Li et al., 2017a), mainly due to the difficulty of propagating the signals from the discriminator to the generator through the discretely generated tokens.(Yu et al., 2017) addresses this issue with a reinforcement learning approach for sequence generation.Thus, the adversarial training paradigm can improve the model on the sentence-level instead of the vanilla token-level (e.g., maximum likelihood estimation).
To address the semantic inconsistency problem mentioned above, we introduce the paradigm of siamese net into GANs.Siamese net is a class of neural network architectures that contain more than one identical or different sub networks, which depends on whether the inputs are similar or not.Siamese net is generally used to measure the simi larity between the inputs by comparing their corre sponding output feature vectors, and can be broadly divided into two types: true siamese net and pseudo siamese net.The true siamese net contains identi cal sub networks which share the same architecture and network parameters, while the pseudo siamese net contains sub networks which have different pa rameters and even different architectures.Among the existing works, (Kenter et al., 2016) is the first to adopt siamese net into unsupervised sentence embedding learning.(Mueller and Thyagarajan, 2016) proposes MaLSTM to learn sentence simi larities with Manhattan distance.(Neculoiu et al., 2016) considers similarity matching of a sentence pair as a binary classification task and replaces the Manhattan distance with cosine similarity.Re cently, (Reimers and Gurevych, 2019) introduces the principle of siamese net to fine-tune BERT (De vlin et al., 2019) for better sentence embedding.
Different from previous GAN-based abstractive text summarization model in the work of (Liu et al., 2018), by incorporating siamese net into GANs, the generator can generate summaries which are more semantically consistent with the source texts.As far as we know, this work is the first attempt to apply siamese net to the GAN-based sequence-to sequence generation task.

Siamese Semantic-Preserving Generative Adversarial Net
In this section, we introduce the architecture of the proposed Siamese Semantic-Preserving Gen erative Adversarial Net (SSPGAN) in detail.The model consists of two main components.The first component is a standard Transformer-based sum mary generator G (Figure 2).During adversarial training, the generator G is treated as an agent tak

N N
Figure 2: The summary generator, taking a conven tional Transformer based encoder-decoder architec ture (Vaswani et al., 2017), where the predicted word from the previous step serves as the input of the current step during inference.We omit some layers for brevity.
ing sequential actions (i.e., generating words) and trained using policy gradient given the reward of each generated word.The second component is a siamese network based discriminator D, which is also implemented based on the Transformer.On the one hand, the discriminator D is required to distinguish the generated summary from the real one.On the other hand, it aims to capture the se mantic similarity between the source text and the target summary.Specifically, it is expected to max imize the semantic similarity for the real sentence pair (text, real summary), while minimizing the semantic similarity of the generated pair (text, gen erated summary).From these two perspectives, we compute a composite reward for each generated summary.Both the generator G and the discrim inator D are iteratively trained.Figure 3 shows the overview of the adversarial training framework.
In the following, we describe the generator G and the siamese semantic-preserving discriminator D in detail.

Generator
At time step t, the generator G takes an action (i.e., a word y t ) according to a stochastic policy π θ (y t |x, y t−1 ), where x is the input source text, y t−1 = [y 1 , . . ., y t−1 ] is the previously generated partial summary, and θ is the parameter of the pol icy.We utilize the conventional Transformer based encoder-decoder framework (Vaswani et al., 2017) as the model of the policy.By sequentially gener ating each word y t using the policy π θ (.) until the end, a complete sentence y is generated.In conven tional sequence-to-sequence learning, the model is trained to minimize the cross-entropy loss: where N is the number of text-summary pairs, T n is n the length of the ground-truth summary ŷ , Loss is n n the cross-entropy loss, ŷt−1 and y ˆare the groundt truth partial summary and word, respectively.Nev ertheless, in adversarial training, these is no explicit supervised information for computing the crossentropy loss.Hence, we adopt our discriminator D to assess the quality of the generated complete summary y n .Specifically, the discriminator D is responsible for calculating a reward using the gen erated summary y n and the source text x n (See Section 3.3 for details).

Siamese Semantic-Preserving Discriminator
Our discriminator D aims to not only distinguish the real summary from the generated one, but also capture the semantic similarity between the source text and the target summary.Here, the discrimi nator D is implemented based on the Transformer, as Transformer is capable of capturing both local and global sentence features.In the meantime, to capture the semantic similarity, the whole frame work of the discriminator D is designed based on the siamese net (right panel in Figure 3).Given the source text x = {x 1 , x 2 , . . ., x T x } and the target summary y = {y 1 , y 2 , . . ., y T y } (here y represents both the real and generated summary for simplicity), where x t and y t are the t-th words in the corresponding sequences.For the source text se quence x, we take it as input of the Transformer en coder.After the processing of Transformer blocks, a hidden state sequence h x will be produced: where h xt is the hidden state corresponding to x t in the input sequence.Thus, h xt contains not only the positional information, but also the global and lo cal correlation information.To get the final feature representation f x for the input sequence, a meanpooling operation is leveraged over the output hid den state sequence h x .Since there exists difference between the textual structures of the source text 2124 Figure 3: Overview of the model.Left panel: our generator G produces a summary conditioned on the source text.At each time step, the expected reward of a newly generated word ("earthquake" in the presented example) is computed from the siamese semantic-preserving discriminator D using Monte Carlo rollout.We use policy gradient to update the generator G toward generating summaries with higher rewards.Right panel: the discriminator D observes the generated summary and aims at distinguishing it from the real one.Besides, the discriminator D is responsible for capturing the semantic consistency between the source text and the target summary.During adversarial training, both the generator G (left) and the discriminator D (right) are iteratively updated to improve.and the target summary, for the target summary, we adopt the same encoder framework as the one for the source text, but share no parameters (i.e., pseudo siamese net).And the corresponding final feature representation f y is also obtained using the mean-pooling operation.Finally, given both the source text and the target summary, the probability that the target summary is classified as real can be calculated as: where V is the weight matrix to transform the concatenation of f x and f y into a 2-dimensional embedding and σ is the logistic function.Finally, the training objective for discriminating the real summary from the generated one can be formulated as a supervised classification objective: where N is the number of text-summary pairs, ϕ is the model parameters of the discriminator D, and l n is the corresponding label (i.e., 0 for the generated summary and 1 for the real summary).
To capture the semantic similarity between the source text and the target summary, we further utilize the final features of the pseudo siamese net.Specifically, we aim to maximize the semantic similarity between the source text and the real summary, while minimizing it for the pair of the source text and the generated summary.To this end, we adopt the cosine function to evaluate the similarity of the sentence pair: and the value of S cos ranges from −1 to 1. Next, we can obtain the contrastive loss L sim for siamese semantic similarity learning: where N is the number of text-summary pairs, l n is the corresponding summary label (i.e., 1 for the real sentence pair and 0 for the generated sentence pair), L + and L − are the corresponding loss functions for the real and generated sentence pair, respectively.The two sub loss functions are given by: Thus, we can obtain the final objective of the siamese semantic-preserving discriminator D: where η is a hyper-parameter to balance the two sub training objectives.

Policy Gradient Training
Following (Yu et al., 2017), during adversarial train ing, the goal of the generator G is defined as to generate a summary sequence from the start state to maximize its expected overall reward.Formally, the objective function is calculated as: where θ denotes the parameters of G, y 1:T = {y 1 , . . ., y T } indicates the generated target sum mary, x is the source text.Here we denote T y as G T for simplicity.R θ D is the action-value function of the generated summary ϕ given the source text x (i.e., the expected accumulative reward starting from the state (y 1:T −1 , x), taking action y T , and adopting the policy G θ ).To estimate the actionvalue function, we combine the probability of being classified as real by the discriminator D with the cosine similarity as the total reward: where b(x, y 1:T ) denotes the baseline value to re duce the variance of the reward.In practice, we set it to 0.5 during training.And λ is a hyper-parameter for balance.It is worth noting that, (10) only de fines a reward value for a completely generated summary.If y 1:T is partially generated, the values of D ϕ (x, y S 1:T ) and cos (x, y 1:T ) are meaningless.To evaluate the action-value for an intermediate state, we apply Monte Carlo (MC) tree search un der the policy G θ to sample the following unknown tokens.Each search lasts until the end of summary token is sampled or the sampled summary reaches the maximum length.For more stable reward and lower variance, we conduct a K-time roll-out as follow: where T i denotes the length of the summary sam pled by the i-th Monte Carlo search.(y 1:t , x) is the current state and y i t+1:T i is sampled based on the policy G θ .Accordingly, the discriminator provides K rewards for the sampled K summaries respec tively.The final reward for the intermediate state is computed as the average of K rewards.Thus, for the generated summary with length T , we compute the final reward for y t at the sentence level as: Using the discriminator D as a reward function can further improve the generator iteratively by dy namically updating D. Once we have a set of more realistic generated summaries, we shall re-train the discriminator model by minimizing (8).Each time when a new discriminator model is obtained, we can re-train the generator.The gradient of the objective J adv (θ) w.r.t. the generator's parameters θ can be formulated as:

Adversarial Training
The overall training flow of SSPGAN is shown in Figure 3.Both the generator G and the siamese semantic-preserving discriminator D learn together by pursuing competing goals.Given x, the genera tor G generates a summary y.It would prefer sum maries with bigger rewards, which implies larger values of s reality and s similarity .In contrast, the discriminator D would encourage smaller values of s reality and s similarity .Thus, the generator G and the siamese semantic-preserving discriminator D play a min-max game (see Algorithm 1 in the Appendix A.2 for more details).

Datasets
We conduct extensive experiments on both Chi nese and English text summarization datasets.The Chinese dataset we adopt is a large corpus of Chi nese short text summarization (LCSTS) (Hu et al., 2015), which is collected from Sina Weibo, a fa mous Chinese social media website.Following the data split of previous works, we get around 2.4M text-summary pairs for training, 10K pairs for vali dation and 725 pairs with annotation score no less than 3 for testing.For English text summarization, we use the Gigaword dataset based on Annotated Gigaword (Napoles et al., 2012), and preprocess it identically to (Rush et al., 2015), which results in 3.8M sentence pairs for training, 190K for valida tion and around 1.9K for testing.

Evaluation Metrics
For a fair comparison with previous works, we adopt ROUGE (Lin, 2004) as the automatic eval uation metric.ROUGE measures the degree of overlap between the generated summary and the reference, with respect to the number of n-grams.We report ROUGE-1 (uni-gram), ROUGE-2 (bigram), ROUGE-L (longest common subsequence -LCS) on the testing set for our quantitative ex periments.Since the official ROUGE evaluation package is only available for English summariza tion, to evaluate the models on the Chinese summa rization task, we follow (Hu et al., 2015) and map all characters including punctuation and numbers to numerical IDs, and then conduct evaluation on them.In experiments, we denote ROUGE as RG for simplicity.

Compared Models
Baselines for the Chinese text summarization task include the followings.RNN and RNN-context are two RNN-based models adopted in (Hu et al., 2015), without and with the attention mechanism respectively.CopyNet leverages the copy mecha nism to alleviate the OOV problem (Gu et al., 2016).RNN-MRT (Shen et al., 2016) and Actor-Critic (Li et al., 2018) are two sentence-level training meth ods to address the problem of teacher forcing which use the maximum likelihood estimation.DRGD (Li et al., 2017b) uses a recurrent latent random model to strengthen the abstractive text summarization model.GlobalEncoding (Lin et al., 2018)  the information flow from the encoder to the de coder based on the source-side global information.
As for the English dataset, besides DRGD and Actor-Critic, we choose the following baselines.ABS and ABS+ are two pioneer methods using neural networks for abstractive text summariza tion (Rush et al., 2015).Concept-pointer+DS en gages abstractive summarization models to gener ate new conceptual words (Wang et al., 2019).
Our model is complemented based on Ten sor2Tensor1 .For all experiments, SSPGAN is run with 5 random seeds on 2 NVIDIA V100 GPUs and the final automatic results are presented with means (see the Appendix A.1 for more details).

English Results
Table 1 shows the results on the English dataset Gi gaword.The results of the baselines are reported in the upper rows, while the bottom row summarizes the results of the proposed SSPGAN.When we introduce the SSPGAN framework to Transformer, it significantly improves the performance, proving the effectiveness of our method.

Chinese Results
The experimental results on the Chinese dataset LC STS are presented in Table 2.As can be observed from the comparison between the baselines in the upper rows and SSPGAN in the bottom row, the proposed method achieves the best performance.In addition, the proposed SSPGAN brings signifi cant improvements to the classical baseline Trans former.Precisely, Transformer is greatly improved in ROUGE-1/2/L with gains of +1.88/+1.09/+1.20.8) and ( 10).We bold the best results.

Analysis
In this section, we analyze the effectiveness of the proposed method from multiple perspectives.All the experiments are conducted on Gigaword.

Ablation Study
As shown in Table 3, we analyze the contribu tions of different sub training objectives proposed in ( 8) and ( 10).On the Transformer model, the basic GAN (i.e., the second row with η=1.0 and λ=1.0) achieves improvement with gains of +0.43/+0.51/+0.49 in ROUGE scores.We also test the results when Transformer is only guided by the semantic similarity objective (i.e., the fourth row with η=0 and λ=0), resulting gains of +0.21/+0.29/+0.30.Armed with the proposed SSP GAN (i.e., the third row with η=0.7 and λ=0.7), the performance can be more significantly improved with gains of +0.74/+0.99/+0.91 in ROUGE scores.

Human Evaluation
To further evaluate the quality of the generated sum maries, we randomly select 50 test examples from the Gigaword testing set for human evaluation.For each example, we show the source text, the ground truth summary as well as the summaries generated by different models.The human evaluators do not know which summary comes from which model or System R C Transformer 6.39 6.65 +SSPGAN (η, λ=1.0) 7.09 6.82 +SSPGAN (η, λ=0.7) 7.06 7.33 +SSPGAN (η, λ=0) 6.64 6.78 Table 4: Comparison of human evaluation on a random subset of the Gigaword testing set.We denote the read ability and consistency as R and C, respectively.The best results are bold.
Source: malaysia's national car maker proton expects to export its cars to russia by early next year to boost its overseas sales, a company official said tuesday Reference: malaysian carmaker proton seeks inroads into russia by early next year GAN: malaysia's car maker to boost overseas sales SSPGAN: malaysia's proton to export cars to russia Source: chinese vice-premier wu yi said tuesday that the country should step up efforts to develop its service trade in a bid to alter the growth pattern of foreign trade and increase employment and domestic consumption.which one is the ground truth.Two scores from 1 to 10 are assigned to each summary (1 and 10 indicate the worst and the best respectively), one for read ability (how well-written the summary is) and one for consistency (how well the summary conveys the key content of the source text).Each summary is rated by 10 invited human evaluators who are capa ble of reading English proficiently.And the results are averaged across all selected examples and evalu ators.As shown in Table 4, equipped with the basic GAN objective, the readability is improved signifi cantly with comparable results (i.e., the second row with η=1.0 and λ=1.0 and the third row with η=0.7 and λ=0.7).As for the consistency, our proposed model (i.e., the third row with η=0.7 and λ=0.7) achieves the highest score, which justifies that the proposed method can preserve the key content of the source text more accurately.It is worth noting that the improvements of the fourth row are limited, which is only equipped with siamese similarity ob jective.Due to the lack of basic GAN objective, the improvement of readability is limited, resulting in incomplete sentence semantic expression and damage to the improvement of consistency.

Case Study
Figure 4 shows some examples of the generated summaries on the English dataset, in which both the basic GAN and the proposed SSPGAN pro duce readable results.However, as shown in the highlights of the SSPGAN examples, the proposed method is able to convey the key content of the source text more accurately, resulting in more salient summaries as expected.Specifically, in the upper example, the key content "expects to export its car to russia" in the source text is only expressed by SSPGAN, while the basic GAN gen erates "boost overseas sales", ignoring the most relevant information.Similar behaviors can also be observed in the bottom example.

Conclusion
This paper presents a novel siamese generative ad versarial net (SSPGAN) which can preserve the se mantic consistency between the source text and the target summary for abstractive text summarization.In SSPGAN, a novel semantic similarity based re ward is introduced to further augment the GANbased abstractive text summarization to preserve the semantic consistency and convey the key con tent in the source text.It is worth noting that SSP GAN addresses the problem of saliency for text summarization from a totally different perspective of semantic consistency, therefore it is orthogonal to some state-of-the-art methods which focus on attention mechanism, and can be applied to them for further improvements.
The inverse square root learning rate decay is ap plied for initial warm up and annealing with 4000 steps.For the discriminator, we adopt RMSProp optimizer with the learning rate of 0.0005 and η = 0.7.The dropout rate is set to 0.3 for both mod els.During adversarial training, for both models, the learning rate is set to 0.00001 without changing the optimizer.K in Monte Carlo rollout is set as 20 and λ is 0.7.
In the proposed architecture, there are 2 hyperparameters η and λ need to be jointly tuned during training.Here we conduct a grid search to find a proper combination of these hyper-parameters.For both η and λ, the value is selected in set [0.1, 0.3, 0.5, 0.7, 0.9] and we experimentally find that the η of 0.7 and the λ of 0.7 give the best results on validation sets.

A.2 Pseudo Code
Algorithm 1 Siamese Semantic-Preserving GAN Require: generator G θ , siamese semanticpreserving discriminator D ϕ , a text summa rization dataset S = (x, ŷ) 1: Initialize G θ , D ϕ with random weights θ, ϕ 2: Pre-train G θ using (1) on S 3: Generate negative summaries y with G θ for training D 4: Pre-train D ϕ using (8) on the combination of (x, y) and S 5: while G θ not converged do 6: for g-steps do end for 17: end while

Figure 1 :
Figure 1: The case of lacking saliency in abstractive text summarization.Bold text represents the key content, while the underlined parts represent the unimportant content.

Figure 4 :
Figure 4: Comparison of the summaries generated by the basic GAN and the proposed SSPGAN.Bold text represents that the correct contents are extracted, while the underlined parts correspond to the wrong ones.

Table 1 :
The full-length F-1 based ROUGE scores on the testing set of the English benchmark Gigaword.Here we bold the best results.

Table 2 :
The full-length F-1 based ROUGE scores on the testing set of the Chinese benchmark LCSTS.Here we bold the best results.

Table 3 :
Ablation study regarding the sub training objec tives proposed in (