EnsLM: Ensemble Language Model for Data Diversity by Semantic Clustering

Natural language processing (NLP) often faces the problem of data diversity such as different domains, themes, styles, and so on. Therefore, a single language model (LM) is insufficient to learn all knowledge from diverse samples. To solve this problem, we firstly propose an autoencoding topic model with a mixture prior (mATM) to perform clustering for the data, where the clusters defined in semantic space describes the data diversity. Having obtained the clustering assignment for each sample, we develop the ensemble LM (EnsLM) with the technique of weight modulation. Specifically, EnsLM contains a backbone that is adjusted by a few modulated weights to fit for different sample clusters. As a result, the backbone learns the shared knowledge among all clusters while modulated weights extract the cluster-specific features. EnsLM can be trained jointly with mATM with a flexible LM backbone. We evaluate the effectiveness of both mATM and EnsLM on various tasks.


Introduction
It is common knowledge in modern natural language processing (NLP) that natural language varies greatly across domains, themes, styles, genres and many other linguistic nuances ( Van der Wees et al., 2015;van der Wees, 2017;Niu et al., 2017). Generally, we call such nature of language as data diversity. Many existing works Cai and Wan, 2019;Hu et al., 2019) have illustrated that data diversity will affect the performance of LMs if we just train a single LM over the entire dataset, even though fine-tuning a pretrained LM (that has been pre-training on a very large corpus) such as Bert (Devlin et al., 2019) on current task (Aharoni and Goldberg, 2020). * Equal contribution. † Corresponding author. The domain diversity in dataset is a very common type of data diversity. In some cases, if we can obtain a well-defined domain label for each sample, some works (Jiang et al., 2020;Du et al., 2020;Wright and Augenstein, 2020) try to consider the multi-domain property of data in developing the LMs. However, these pre-defined domain labels are not always accurate or even available (Aharoni and Goldberg, 2020), especially for the wild datasets, in which data come from different sources, such as internet news, product reviews, and daily conversation. To this end, we hope to develop a LM that can explore the diversity from data automatically.
Data selection is a commonly used strategy to handle diversity in data (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013;Silva et al., 2018;Aharoni and Goldberg, 2020). This kind of method is developed from an assumption that samples belonging to the same cluster should own similar characteristics. According to the clustering assignment, models can select suitable data for training a LM for each cluster separately. Although, to some extend, data selection is an efficient strategy to alleviate the problem of data diversity, it may bring two disadvantages as follows. Firstly, the process of data selection is independent of the LM learning. In other words, the gradient signal generated by LM's training loss can not affect the data selection. Secondly, data selection only tells the hard cluster belongings of samples, ignoring a fact that some samples may belong to more than one clusters with soft (weighted) assignment.
Inspired by their works and to move beyond, in this paper, we find the semantics learned by topic modeling (Blei et al., 2003;Srivastava and Sutton, 2017) can infer sample clusters to a certain extent via K-means, but is not good enough, as shown in Fig. 1a . To jointly consider the clustering and topic modeling for better clustering (as shown in Fig. 1b) and for joint training with the following LM, we firstly introduce an autoencoding topic model with mixture priors (mATM). For each sample in the corpus, mATM can infer a soft clustering assignment. In order to jointly consider the learning of mATM with various LMs, we employ the weight modulation methods (Cong et al., 2020;Wen et al., 2020). Specifically, as shown in Fig. 3, given a LM as backbone, for each layer (convolutional or fully-connected), we introduce some modulated parameters. Guided by clustering assignment inferred from mATM, these parameters modulate the backbone single LM to multiple LMs, corresponding to different clusters. Therefore, our proposed model can be seen as a type of ensemble learning, and hence we call it ensemble language model (EnsLM).
Our proposed mATM and EnsLM enjoy the following distinguished properties: • The mATM learns the mixture-prior latent semantic space to define a soft clustering assignment for each sample.
• Guided by clustering assignments that describe the data diversity, EnsLM learns both shared and cluster-specific knowledge by weight modulations.
• Joint training of mATM and EnsLM improves the performance of both on many NLP tasks.

Related work
For NLP, topic modeling (TM) (Blei et al., 2003;Zhou et al., 2012) and LMs are two common regimes with their own advantages. TM can discover the interpretable global semantics that are topics, while with pre-training on large corpus, LMs recently achieve the SOTA performance on many NLP tasks with more focuses on local dependencies. Therefore, some works consider to combine them to obtain benefits from both. Dieng et al. (2016) and  incorporate the TM with RNN-based model to capture the longrange dependencies. To move beyond single-layer TM for RNNs, Guo et al. (2020) propose the recurrent hierarchical topic-guided RNN with the help of multi-layer TM (Zhou et al., 2015;Zhang et al., 2018). To extract explicit document semantics for summarization,  propose three different modules to plug knowledge from TM into Transformer-based LMs (Vaswani et al., 2017;Devlin et al., 2018). Our work can be seen as a parallel work to combine their advantages together but focuses on dealing with data diversity in NLP without the ground-truth information such as domain labels. Meanwhile, our work can be applied for different LMs including CNNs, RNNs, and Transformer-based models.

Autoencoding topic model with mixture prior
We firstly describe one of the most popular topic models, latent Dirichlet allocation (LDA) (Blei et al., 2003), and its autoencoding inference (Srivastava and Sutton, 2017). Inspired by them, in order to jointly consider topic learning and sample clustering, we propose the autoencoding topic model with mixture prior (mATM).

LDA with autoencoding inference
For a document containing D words as w = {w d } D d=1 , given K topics Φ = [φ 1 , · · · , φ K ] where φ k is a probability distribution over the vocabulary, LDA defines the generative process of w in Algorithm 1, where θ ∈ R K + is the topic proportion with α as the prior parameter. After collapsing Algorithm 1 Generative process of LDA i d , given θ and Φ, we can represent the conditional likelihood of w d as Given Φ, a popular approximation for efficient inference of LDA is mean-field variational inference, which tries to maximize the evidence lower bound (ELBO) of marginal data log likelihood as (2) where q(θ) is the variational posterior. In particular, Srivastava and Sutton (2017) propose the autoencoding variational inference (AEVB) (Kingma and Welling, 2013) for LDA by using Laplace approximation (Hennig et al., 2012) for the Dirichlet prior, and building logistic-normal (LN) encoding posterior.
As shown in Fig. 1, we find that running clustering method such as K-means on semantic space θ can not achieve satisfactory results. For jointly considering the learning of topics and sample clustering, we propose the mATM.

Generative process of mATM
Suppose the number of clusters is C, and the clustering prior parameter is π = [π 1 , · · · , π C ] with C c=1 π c = 1, shown in Fig. 2a, mATM defines the generative process of w in Algorithm 2. Com-Algorithm 2 Generative process of mATM pared with LDA, mATM has a mixture Dirichlet prior with parameters {α c } C c=1 . In other words, mATM assumes that the θ of different documents may come from different clusters, which is the basic thought to discover the data diversity from corpus automatically.

Variational encoder of mATM
In order to infer the parameters in mATM and further develop the EnsLM by mATM, we introduce AEVB for mATM, whose detailed structure is shown in Fig. 2b.

Laplace approximation for mixture Dirichlet prior
Although Dirichlet prior of θ is important to learn interpretable topics (Wallach et al., 2009), it is difficult to handle it within AEVB since AEVB needs effective reparameterization (RT) function for distributions. Inspired by the success of the Laplace approximation for Dirichlet distribution, we propose the mixture LN (mLN) distribution as the approximation of mixture Dirichlet distribution. Specifically, Srivastava and Sutton (2017) have proved that a Dirichlet distribution p(θ|α) can be well approximated by LN distribution as where the elements in mean vector µ and diagonal covariance matrix Σ are To go further, for inference of mATM, we construct the mLN distribution as which is used to approximate the mixture Dirichlet prior p(θ|{α c , π c } C c=1 ) in mATM. Therefore, for each document, the prior of θ can be written as C c=1 LN (µ c , Σ c ) zc . In practice, we build the µ c and Σ c as where z = [z 1 , · · · , z C ]. Next, we build variational posterior for latent variables with easy RT function.

Variational encoding posterior
After collapsing {i d } D d=1 in mATM as (1) in LDA, given topics Φ, for document w, there are two latent variables that need to be inferred: θ and z.
LN posterior for θ. We build the variational pos- are two encoding networks, and x is a type of representation for document w such as original words or bag of words (Bow) vector. Morevoer, LN distribution has easy RT function as Normal distribution. Gumbel softmax (GS) posterior for z. As categorical variable, z is difficult to build variational posterior under AEVB with accurate RT function. Instead, we employ GS distribution (Jang et al., 2016) as the variational posterior of z for efficient gradient propagation. Specifically, suppose the posterior of z is Categorical(π ), after obtaining C i.i.d samples {g 1 , · · · , g C } drawn from Gumbel(0, 1), then z can be sampled as where τ is the temperature parameter. In order to build encoder for π , we let π = f Wπ (θ, w). For efficient gradient propagation, rather than sampling z from arg max as (7), we obtain the variational posterior of soft assignment vector z = [z 1 , · · · , z C ] as q(z): Besides the benefit of efficient gradient backpropagation, the soft assignment in (8) provides clustering belonging weights. In the following En-sLM, this property is useful for some ambiguous samples that may belong to different clusters.

ELBO of mATM
We obtain the ELBO of mATM as Similarly with Srivastava and Sutton (2017), instead of sampling Φ from Dirichlet posterior in LDA, we parameterize it as Φ = sof tmax(W t ), where W t = [w 1 , · · · , w K ] and sof tmax is operated for each topic {w k } K k=1 to ensure them on a probability simplex. Therefore, as shown in Fig. 2, all the parameters of mATM are

Ensemble language model
Recently, various advanced LMs for language understanding and generation have been introduced, most of which do not consider the data diversities in the corpus. In this paper, having obtained the clustering assignment vector z from mATM, given a single LM as backbone, we propose the ensemble LM (EnsLM) via z-guided weight modulation. In other words, the EnsLM can modulate the backbone single LM to fit for different clusters.

Efficient weight modulation
Although LMs have many different types, basically, all of them build on convolutional (such as in CNN (Johnson and Zhang, 2015)) or fully-connected (such as in Transformer (Vaswani et al., 2017)) operations (ignoring the bias) as where, H 1 ∈ R Ix×Iy×C in and H 1 ∈ R C in are the input features, W ∈ R kx×ky×C in ×Cout and W ∈ R C in ×Cout are the convolutional kernel or full-connected weights 1 . Suppose the number of clusters (domains) in mATM is C, given a LM as backbone, we introduce a few modulation parameters to modulate the original parameters W or W for different clusters. Specifically, shown in Fig. 3, for a convolutional or fully-connected layer in (10), suppose that there are two dictionaries of modulation parameters as: where {α c } C c=1 ∈ R C in and {β c } C c=1 ∈ R Cout . For a document w whose feature at current layer is H 1 , after archiving its domain assignment z ∈ R C×1 from (8), we feed H 1 into the modulated layer as

Modulated parameters
where , and denotes matrix element-wise product (with broadcasting for convolution).
Explanation of (12). Intuitively, W and W act as the backbone parameters in the original single LM, and Γ is the modulated parameters, which moves the backbone to fit different domains. If z is drawn from (7) that means z is a one-hot vector, then it denotes that α and β are chosen from the dictionaries A and B, correspondingly. If z is drawn from (8) that means z is a soft assignment vector, then it denotes that α and β are weighted summation of all elements in A and B, correspondingly. In practice, we use the soft assignment vector since i) it brings efficient gradient propagation during joint training of mATM and EnsLM, and ii) it considers the fact that there are some domain ambiguous samples in the dataset. It is interesting to note that although EnsLM is developed for the problem that ground-truth priors of data diversity (such as domain label) is unavailable, it can be also used when we know the priors. For this scenario, rather than inferring the clustering assignment z from mATM via (8), we directly set z as the real one-hot assignment vector, which is illustrated in experiment in Sec. 5.2.

Joint training of mATM and EnsLM
Different from some strategies such as data selection that separate the calculation of assignment and the training of LM, our proposed mATM and En-sLM can be jointly trained in one framework.
Specifically, given a training set containing N sample {w n } N n=1 , suppose that there is a label {y n } N n=1 for each sample. It should be noted that labels {y n } N n=1 can be different for different tasks, such as labels for document classification, golden summarization for abstractive summarization, or document itself for generation. As a result, the loss for joint training of mATM and EnsLM can be written as where, without loss of generality, L LM denotes the loss for LM. All learnable parameters are i) parameters of mATM: ii) parameters of LM: Θ LM . These parameters can be jointly trained by stochastic gradient descend with low-variance gradient estimation since LN and GS distributions have easy RT function.

Experiments
In this section, we evaluate the effectiveness and efficiency of our proposed mATM and EnsLM on different NLP tasks including document clusters, text classification, language generation and abstractive document summarization. Our code is available at https://github.com/BoChenGroup/EnsLM

Document clusters
The basic idea of mATM and EnsLM is that mATM can automatically discover the sample clusters which describe the data diversity. Therefore, we firstly evaluate the document clustering performance of mATM. 20News has 20 classes and consists of 18,846 documents with a vocabulary size of 61,188, partitioned into a training set of 11,314 documents and a test set of 7,532 ones. R8 is a subset of the Reuters 21578 dataset, which has 8 classes and was split into 5,485 training and 2,189 test documents. For these two datasets, we remove the stop words and use the 2,000 most frequent terms as the vocabulary. For all methods, we set the number of clusters as the number of classes.
Comparison models and implementation details To verify the effectiveness of mATM for clustering, three types of document clustering models are compared. i) Raw+kmeans performs K-means on raw BoW vectors, and PCA+kmeans uses PCA extract low-dimensional features and then uses K-means for clustering; ii) Train a topic model and then perform Kmeans for clustering on topic proportions, where we consider LDA+kmeans (Blei et al., 2003), AVITM+kmeans (Srivastava and Sutton, 2017), and PFA+kmeans (Zhou et al., 2012); iii) Deep neural network based clustering methods, including Deep clustering (Xie et al., 2016), and DCN (Yang et al., 2017), which jointly consider the feature extracting and clustering. Besides Raw+kmeans performing clustering on original inputs, others are on a latent feature space (For topic modeling, feature is the topic proportion). Following (Xie et al., 2016;Yang et al., 2017), the dimension of feature space equals to the number of clusters. Results Following Yang et al. (2017), since we know the ground-truth label and set the clustering number as the number of classes, we measure the clustering performance by accuracy (AC) and normalized mutual information (NMI), both of which are the higher the better. The results are shown in Table 1. Compared with the Base+kmeans, PCA+kmeans performs better since it extracts effective principal components. Benefiting from the learning of semantics for documents, the second group including three types of topic modeling outperforms PCA. Compared with the first two groups, the third group jointly considers the feature learning and clustering, thus achieving higher AC and NMI. Combined the advantages of topic modeling in extracting efficient features from documents and joint learning of feature extractor and clustering, mATM gets the SOTA performance for document clustering tasks on these two datasets.
The clustering results support our motivation of using mATM to discover the data diversity. In the following experiments, we evaluate the performance of both mATM and EnsLM on different language understanding and generation tasks.

Multi-domain sentiment classification
Sentiment classification (positive or negative) for different products is a fundamental language understanding task in NLP. For this task, the data diversity mainly arises from different domains (products) (Blitzer et al., 2007), which brings the problem that data from different domains may have different distributions.
Datasets To evaluate the performance of mATM and EnsLM in capturing the multi-domain property for sentiment classification, following Cai and Wan (2019), we perform experiments on the dataset released by , which consists of product and movie reviews in 16 different domains. The data in each domain is randomly split into training set, development set and test set according to the proportion of 70%, 10%, 20%, whose statistics of the 16 datasets are listed in Appendix A.1.
Comparison models and implementation details Following (Cai and Wan, 2019), we firstly consider three base models, BiLSTM (Adhikari et al., 2019), TextCNN (Kim, 2014) and BERT (Devlin et al., 2019), which perform classification on every domains separately. Secondly, combining data from different domains together, we train the above three models named as BiLSTMmix, TextCNN-mix and DocBERT-mix. Having obtained the ground-truth domain label, the previous works regard the multi-domain problem as the multi-task learning (MTL) including DA-MTL (Zheng et al., 2018), ASP-MTL ,and MDAE (Cai and Wan, 2019). All these works are developed from BiLSTM model. For our proposed EnsLM, we use TextCNN, BiLSTM and DocBERT as the backbone of EnsLM. We perform experiments on two types of EnsLM: i) with ground-truth (GT) domain label, we directly set z as the one-hot assignment vector (do not infer z from mATM), which is named as BiLSTM-EnsLM-GT, TextCNN-EnsLM-GT, and BERT-EnsLM-GT; ii) without GT domain label, we use mATM to infer z , which is named as BiLSTM-EnsLM-mATM, TextCNN-EnsLM-mATM, and BERT-EnsLM-mATM. For model using mATM, we set the number of topics as 16. More detailed settings and implementation details can be found in Appendix B.1. Similarly, with GT domain label, three models equipped with our proposed EnsLM performs better than their basic counterparts with a large margin. Assuming that GT domain labels are unavailable, we use mATM to infer the clustering assignment to guide the learning of EnsLM, which obtains the SOTA performance on all three basic models, even better than the models using GT domain label. We attribute it to the fact that com- pared with the hard GT domain label, mATM infers the soft clustering assignment, which not only reflect the domain characteristic of samples but also describe the samples having confused domain characteristics. For example samples from DVD may be similar with the ones from Electronics.

Language generation
Datasets In order to verify the effectiveness of our model on datasets of different lengths, we consider four publicly available corpora: APNEWS, IMDB, BNC, and COCO. Following Lau et al.
(2017), we tokenize words and sentences using Stanford CoreNLP (Klein and Manning, 2003), lowercase all word tokens, and filter out word tokens that occur less than 10 times. For the topic model, we additionally exclude stopwords. All these corpora are partitioned into training, validation, and testing sets, whose summary statistics are provided in Appendix A.2.
Comparison models and implementation details We consider the following baseline models: LSTM, A standard LSTM language model (Hochreiter and Schmidhuber, 1997); Tansnsformer-XL enables learning dependency beyond a fixed length by introducing a recurrence mechanism and a novel position encoding scheme into the Transformer architecture (Dai et al., 2019); TGVAE (Wang et al., 2019), combines a variational auto-encoder based natural sequence model with a neural topic model; rGBN-RNN (Guo et al., 2020), extracts recurrent hierarchical semantic structure via a dynamic deep topic model to guide natural language generation; GPT-2 (Radford et al., 2019) is a generative pre-training of a Transformerbased LM on a diverse set of unlabeled text. For our proposed model, GPT-2-EnsLM-mATM first uses mATM to infer semantic clusters for each sample, and then introduce this diversity information to pre-trained GPT2 by efficient weight modulation naturally. In the experiments, we use the Adam optimizer (Kingma and Ba, 2014) with learning rate 10 −6 . The length of an input sample is limited to  ', 'child', 'people', 'person', 'young'] ['beach', 'water', 'outside', 'near', 'park'] A child flying a pink kite on the beach. Person flying a kite high over a sea inlet. The bird is on a branch on the tree.
A man in a yellow and white outfit flying a kite. A young child flying a kite with a frisbee in the air. A person flying a kite near the water in a body of water.

2
['cake ', 'slice', 'piece', 'chocolate', 'cream'] [' A women receives a cake that is blue. A piece of a chocolate cake on a plate. A small bird perched on a thin branch.
Two cakes with frosting on top sit on a red plate. A sandwich on a platter with a pickle and some fruit. A cake that has various decorations on it.
A man on a baseball field swinging a bat. A baseball player swinging a bat on a field. A batter is getting ready to hit the ball. 1024. We set the mini-batch size as 8, the number of training epochs as 5. The clustering number of mATM is set to 64 for the first three datasets, while 80 for COCO dataset. More detailed settings and implementation details can be found in Appendix B.2 Results For fair comparison, we use standard language model perplexity as the evaluation metric. The results of all models on four datasets are given in Table 3, where the results of existing models are obtained from Guo et al. (2020).
In the first group, Transformer-XL gets better result, which shows that the transformer-based model have better modeling capabilities. In terms of capturing the document global semantic information, the second group can improve performance significantly, which indicates that the topic model is effective in capturing document global information.
Pre-training on massive data, the GPT-2 can obtains better results compared with above models. Although GPT-2 gets a good result, the GPT-2-EnsLM-mATM can improve performance significantly by capturing data diversity. It illustrates that even pre-training on large scale of corpus, En-sLM can further improve the performance of pretrained LM via exploring data diversity. A similar phenomenon also appeared in the experiments conducted by Gururangan et al. (2020) Sentence generation of EnsLM Given the learned GPT-2-EnsLM-mATM, we can sample the sentences conditioned on semantic clusters. Shown in the in Fig. 5, we select the top-3 topics to represent this cluster, and select original sentences according to the clustering results. we can see that most of the generated sentences conditioned on a semantic clusters are highly related to the given topics in terms of their semantic meanings but not necessarily in key words, indicating the LM is successfully guided by the cluster assignment. These observations suggest that GPT-2-EnsLM-mATM has successfully captured syntax and global semantics simultaneously for natural language generation. Similar to Fig. 5, we also provide other semantic clusters generated sentences in Appendix C.

Abstractive summarization
Datasets We evaluate the effectiveness and efficiency of proposed model on two benchmark datasets, including the CNN/DailyMail (CNN/DM) (Hermann et al., 2015) and the XSum (Narayan et al., 2018). The summary styles of these datasets varies from highlights, composed of several sentences, to very brief one sentence. See more detailed descriptions in Appendix A.3. We perform data pre-processing following Liu and Lapata (2019).
Comparison models and implementation details We consider some baseline models, including LSTM based models PTGEN and PT-GEN+Cov (See et al., 2017); Transformer based models Tansformer, BertSUM (Liu and Lapata, 2019); and BertSUM+TA which combine pretrained model with topic model . We combine EnsLM with BertSUM on the abstractive summarization task. The clustering number of mATM is set to 64 for all datasets. Given BertSUM checkpoints 3 on CNN/DM and XSum provided by Liu and Lapata (2019), we further fine-tune Bert-SUM+EnsLM. Besides, we adopt the settings in the BertSUM. Following Liu and Lapata (2019), in the test stage, we use beam search with size 5, select the top-3 checkpoints based on their evaluation loss on the validation set, and report the averaged results on the test set. More detailed settings and implementation details can be found in Appendix B.3.
Results ROUGE scores on CNN/DM, XSum have been exhibited in Tables 4, respectively. Focusing on the models without pre-training in the first group, Transformer achieves better performance compared with LSTM-based model, attributing to stronger sequence modeling capabilities. Further, the outperformance of BertSUM illustrates the fact that the combination of a pretrained Bert encoder and a Transformer decoder is a better choice of sequence-to-sequence structure. Despite owning the same structure as the BertSUM, the BertSUM+TA employs a topic model to capture global document segment diversity, and achieving higher scores. Different from BertSUM+TA that introduces document semantic diversity by adding topic information, BertSUM+mATM combines BertSUM with EnsLM model, result in a better performance. Compared with BertSUM+TA, the performance improvement of our model is not enough promising is because they have been incorporated the topical information into the BertSum model which considering the segment diversity and contextual information. Note that the performance of our model improves significantly compared with BertSum, which can prove the effectiveness of our model.   duce the amount of new parameters, we only introduce segment diversity information to query layer. For optimization, the Adam optimizer is utilized here (Kingma and Ba, 2014) with a learning rate of 0.00001. To avoid overfitting, we utilize the dropout and set its rate as 0.3. We set the size of minibatch as 16 in all experiments.

B.2 Language Generation Models
For language generation, we propose GPT-2-EnsLM-mATM which combine mATM with pretrained model GPT-2. And we introduce segment diversity information to query, key and value for each layer. We use the Adam optimizer (Kingma and Ba, 2014) with learning rate 10 −6 . The length of an input sample is limited to 1024. We set the mini-batch size as 8, the number of training epochs as 5. The clustering number of mATM is set to 64 for the first three datasets, while 80 for COCO dataset.

B.3 Abstractive Summarization Models:
For abstractive summarization, we combine Bert-Sum with mATM, which include a pretrained encoder and a transformer decoder. Specially, we introduce segment diversity information to query, key and value for each layer. We set the hyperparameters following the original papers and their public codes, where BertSUM 8 is referred to Liu and Lapata (2019). We fine-tune all models in four Nvidia GeForce RTX2080 TI GPUs. The experiments are performed with mini-batch size including 200 summary tokens with gradient accumulation every six iterations. Model checkpoints were saved and evaluated on the validation set every 1000 updates. Totally, we update the model 250, 000 times. Following Liu and Lapata (2019), we select the top-3 checkpoints based on their evaluation loss on the validation set, and report the averaged results on the test set. During decoding we used beam search with size 5, and tuned the α for the length penalty between 0.6 and 1 on validation set. It is worth noting that our decoder applies neither a copy nor a coverage mechanism, despite their popularity in abstractive summarization.

C More Generation Examples
As shown in Fig. 5, we provide semantic clusters generated sentences by GPT-2-EnsLM-mATM on the coco corpus.