Sentence-Permuted Paragraph Generation

Generating paragraphs of diverse contents is important in many applications. Existing generation models produce similar contents from homogenized contexts due to the fixed left-to-right sentence order. Our idea is permuting the sentence orders to improve the content diversity of multi-sentence paragraph. We propose a novel framework PermGen whose objective is to maximize the expected log-likelihood of output paragraph distributions with respect to all possible sentence orders. PermGen uses hierarchical positional embedding and designs new procedures for training, and decoding in the sentence-permuted generation. Experiments on three paragraph generation benchmarks demonstrate PermGen generates more diverse outputs with a higher quality than existing models.


Introduction
Paragraph generation is an important yet challenging task. It requires a model to generate informative and coherent long text that consists of multiple sentences from free-format sources such as a topic statement or some keywords . Typical paragraph generation tasks include story generation (Fan et al., 2018), news generation (Leppänen et al., 2017), scientific paper generation (Koncel-Kedziorski et al., 2019), etc. Recent advances in natural language generation models such as Transformer (Vaswani et al., 2017) and BART (Lewis et al., 2020) have demonstrated attractive performance of generating text paragraphs.
An important desired property of modelgenerated paragraphs is diversity -given the same source, an intelligent model is expected to create a variety of paragraphs in terms of content, semantic style, and word variability (Li et al., 2016;Ippolito et al., 2019). For example, a story generation § Our code and output files are available at https:// github.com/wyu97/permgen.  Figure 1: Left: Diversity of each generated story sentence at different positions (1 to 5) in ROCStories' test set, measured by averaged 1-Self-BLEU (Zhu et al., 2018). Our PermGen produces contents of higher diversity at all positions, while BART (dashed line) produces diverse outputs only at the end of story. With p-value<0.01, PermGen has higher diversity than the the grey line. Right: PermGen outperforms BART in the accuracy of generated stories measured by  model should narrate a plot with different storylines (Clark et al., 2018); a scientific paper generation model should suggest diverse contents to spark new ideas . In order to create diversity, controllable methods (Zhao et al., 2017;Cho et al., 2019;Yu et al., 2020) used additional inputs (e.g., aspects, styles). Sampling decoding algorithms (Radford et al., 2019;Holtzman et al., 2020) searched next tokens widely from a vocabulary. However, existing models struggled to produce multi-sentence paragraphs of diverse contents, because they relied on the homogeneity of contexts (e.g., similar story beginnings) caused by the conventional autoregressive framework with fixed left-to-right sentence order (i.e., S1→S2→S3). As an example, Figure 1 evaluates the diversity of each generated sentence at different positions of the story in ROCStories (Mostafazadeh et al., 2016) by different models. As shown, BART (dashed line) tends to generate stories of very similar beginning and middle parts and only produce diverse text near the end of a story. This phenomenon stems from the fact that the left-to-right generation leads to homogeneity of context to the left, reducing the diversity of the generated paragraph.
Our idea is permuting the sentence orders in paragraph generation, while sticking with the leftto-right scheme to generate tokens in each sentence. It has two advantages. First, it provides an output sentence with a variety of contexts (and possibilities) from different orders. For example, creating the story ending first can probably produce a completely different story from generating the beginning first. Second, it retains the benefit of autoregressive model that originates from the wordby-word nature of human language production. So the coherence within sentences can be maintained, avoiding the harm of incomplete semantics from token-level permutation (Shen et al., 2020).
In this work, we propose a sentence-permuted paragraph generation framework called PermGen. Instead of using the fixed forward order, PermGen maximizes the expected log-likelihood of the distribution in output paragraph w.r.t. all possible sentence orders. The optimization is based on π-SGD (Murphy et al., 2019) which has guaranteed convergence property. Furthermore, PermGen employs a novel hierarchical position encoding scheme to represent the positions of tokens in permuted sentences. PermGen can be initialized with any Transformer-based models and any decoding algorithms such as beam search and nucleus sampling (Holtzman et al., 2020).
We conduct experiments on three paragraph generation tasks: story generation, news generation, and paper abstract generation. Results show that PermGen can significantly improve the diversity of generated texts and achieve higher accuracy. Particularly, as shown in Figure 1, PermGen model can improve diversity for sentences at all positions while also improving the accuracy. Besides, we observe consistent improvements on both accuracy and diversity when PermGen is coupled with various pre-trained models and decoding algorithms.

Related Work
Paragraph Generation. The source can be either structured or unstructured such as database records (Puduppully et al., 2019), knowledge graphs , images (Ippolito et al., 2019), and keywords (Yao et al., 2019). The expected outputs typically are stories (Guan et al., 2019;Yao et al., 2019), essays (Yang et al., 2019), news articles (Dong et al., 2021), or scientific papers (Hua and Wang, 2019;Koncel-Kedziorski et al., 2019). This task poses unique challenges as it aims at generating coherent and diverse long-form texts. Our framework can use various forms of input such as a story title, keywords, and keyphrases, which can be generalized to broad domains.
Diverse Text Generation. Generating diverse sequences is of crucial importance in many text generation applications that exhibit semantically oneto-many relationships between source and the target sequences, such as machine translation (Shen et al., 2019;Lachaux et al., 2020), summarization (Cho et al., 2019), question generation , and paraphrase generation (Qian et al., 2019). Methods of improving diversity in text generation that have been widely explored from different perspectives in recent years. Samplingbased decoding is one of the effective solutions to improve diversity (Fan et al., 2018;Holtzman et al., 2020), e.g., nucleus sampling (Holtzman et al., 2020) samples next tokens from the dynamic nucleus of tokens containing the vast majority of the probability mass, instead of aiming to decode text by maximizing the likelihood. Another line of work focuses on introducing random noise (Gupta et al., 2018) or changing latent variable (Lachaux et al., 2020) to produce uncertainty, e.g., Gupta et al. (2018) employ a variational auto-encoder framework to generate diverse paraphrases according to the input noise. In addition, Shen et al. (2019) adopt a deep mixture of experts (MoE) to diversify machine translation, where a minimum-loss predictor is assigned to each source input; Shi et al. (2018) employ inverse reinforcement learning for unconditional diverse text generation.
Dynamic Order Generation. These methods have two categories. First, non-autoregressive generation is an emerging topic and commonly used in machine translation (Gu et al., 2018;Ren et al., 2020). They generate all the tokens of a sequence in parallel, resulting in faster generation speed. However, they perform poorly for long sentences due to limited target-side conditional information (Guo et al., 2019). Second, insertion-based generation is a partially autoregressive model that maximizes the entropy over all valid insertions of tokens (Stern et al., 2019). POINTER  inherits the advantages from the insertion operation to generate text in a progressive coarse-to-fine manner. Blank language model (BLM) (Shen et al., 2020) provides a formulation for generative modeling that accommodates insertions of various length. Different from the above methods, our PermGen  permutes the sentence orders for generating a paragraph, and it follows the left-to-right manner when producing each sentence.

Preliminaries
Problem Definition. Given input X that can be a topic statement, some keywords, or a paper's title, the goal is to produce a paragraph Y consisting of multiple sentences as a story, a news article, or a paper's abstract. Suppose Y has T sentences, denoted by Y = [Y 1 , · · · , Y T ], where Y t is the t-th sentence. T can be easily obtained from training data to create sentence indices. During testing, models are expected to predict the sentence indices under maximum T (i.e., 10).

Sentence-Level Transformer
Transformer (Vaswani et al., 2017) follows the encoder-decoder architecture (Sutskever et al., 2014) and uses stacked multi-head self-attention and fully connected layers for both the encoder and decoder. For simplicity, we represent the Transformer framework at the sentence level by using a recurrent notation that generates a probability distribution for sentence prediction by attending to both input X and previous decoded sentences Y <t . (1) where Y t and Y <t are the t-th sentence and sentences before t-th sentence under the left-to-right manner in target output. Transformer eschews recurrence and instead relies on the self-attention mechanism to draw global dependencies between the input and output. During the decoding phase, Transformer can predict each token based on both the input and previously predicted tokens via attention masks to improve efficiency. The objective of Transformer is to maximize the likelihood under the forward autoregressive factorization: (2)

Proposed Method: PermGen
In a left-to-right generation scheme such as the canonical Seq2Seq design, each generated token is conditioned on left-side tokens only (Sutskever et al., 2014). It ignores contextual dependencies from the right side. It also leads to limited diversity of generated text (as shown in Figure 1). To solve this problem, our PermGen, a novel sentencepermuted paragraph generation model, produces sentences not confined to the left-to-right order. Instead, PermGen attempts different sentence orders and selects the best-ranked output candidate. As shown in Figure 2, PermGen uses the Transformer encoder but changes the sentence orders during the decoding phase. It should be noted that PermGen follows the left-to-right manner when generating tokens in each sentence. Thus, we represent the Transformer decoder as: where Y πt and Y π<t are the t-th sentence and the sentences before the t-th sentence under the permutation order π in the target output. Taking the first permuted order in Figure 2 as an example, we have π = [2, 1, 3], π 1 = 2, π 3 = 3, π <3 = [2, 1]. We note that as PermGen is based on the encoderdecoder Transformer architecture, which can be initialized either randomly or from a pre-trained Transformer model with the same structure. Therefore, in the experiments, we evaluate PermGen which is i) trained from scratch, and ii) initialized with BART (Lewis et al., 2020). Next, we will introduce three modules of PermGen: (1) hierarchical positional embedding, (2) sentence-permuted learning, and (3) sentence-based decoding.

Hierarchical Positional Embedding
In Transformer, positional embeddings are added to every token's embedding. Traditionally, the positional embedding encodes the absolute position from 1 to the sequence length to model how a token at one position attends to tokens at other positions (Vaswani et al., 2017;Lewis et al., 2020).
We propose the hierarchical positional embedding that consists of a global position and a local position. Given a token, the global position is the position (index) of the sentence that contains this token; the local position is the position of the token in the sentence (see the two lines of position numbers in Figure 2). Given a paragraph Y , its embedding matrix is given below, where rows are its tokens and columns are embedding dimensions: where Y token is the token embedding, Y global_position and Y local_position are the global positional embeddings and local positional embeddings.
Compared to the absolute positional embedding, the hierarchical positional embedding has two advantages. First, the embedding of two-level positions is more informative about the paragraph structure than that of the absolute position. Second, when we permute the sentence orders in paragraph generation, the absolute positions of tokens might not be available. For example, if the second sentence is generated earlier than the first sentence, the absolute positions of its tokens cannot be determined because the length of the first sentence is unknown. In comparison, hierarchical position does not have this issue.
In addition, for the t-th sentence in Y , we add two special tokens (i.e., <B-t> and <E-t>) to indicate the beginning and end of the sentence. Thus, the decoder can determine the sentence index based on the predicted special tokens. We also append a special token <EOP> to the paragraph to indicate the end of the generation process.

Sentence-permuted Learning
This module learns by varying sentence orders in paragraph generation and acts as the key component in PermGen. For example, given a sentence order π = [2, 4, 1, 5, 3], PermGen first generates the second sentence from the leftmost token to the rightmost, then generates the fourth sentence, and so on. Formally, we denote Z T as the set of all possible sentence orders, i.e., the permutations of sentence indices of length T . It follows that |Z T | = T !. Given input X and target output paragraph Y of T sentences, PermGen maximizes the following likelihood: However, computing the negative log-likelihood in Eq. (5) is prohibitive because the back-propagation computational graph branches out for every permutation in the sum. Therefore, we apply the Jensen's inequality to lower-bound the log-likelihood: By maximizing the lower bound, we do not favor any particular sentence order, but encourage the model to generate Y equally well in all orders. Note that maximizing this lower bound is equivalent to minimizing the following expectation: Since computing this expectation is still intractable, we apply the π-SGD (Murphy et al., 2019) stochastic optimization, which randomly samples a permutation for gradient computation.
We note that π-SGD is a Robbins-Monro stochastic approximation of gradient descent (Robbins and Monro, 1951). When it's applied to permutation sampling, the optimization almost surely converges to the optimal θ, as implied by the following proposition.

Sentence-based Decoding
In decoding, PermGen adopts the following steps: • If the token is <EOP>, end; otherwise, append <B-t> to the generated text; • Step 3: Generate tokens from V ∪ {<E-t>} for the t-th sentence in an autoregressive way, where V is the set of normal text tokens. Stop when <E-t> is generated; • Step 4: I ← I ∪ {t}, then go back to Step 2. As stated in step 2, when <EOP> is generated, the whole generation ends. Then, the sentences in the generated paragraph can be reordered according to sentence indices I and special tokens. Note that in step 3, since PermGen adopts autoregressive generation, it can employ any decoding strategy such as beam search or sampling algorithm (e.g. truncated sampling (Fan et al., 2018), nucleus sampling (Holtzman et al., 2020)). For example, truncated sampling samples the next word from the top k probable choices, instead of aiming to decode text by maximizing the likelihood.
1 When trying to generate multiple candidates, we use the sampling without replacement strategy. For example, if we need to generate 3 candidates each with 5 sentences, their beginning tokens can be B-1, B-3 and B-4, respectively.
Rank with log-probability. We compute the loglikelihood of each candidate as the same as in beam search (Vijayakumar et al., 2016) and sampling methods (Holtzman et al., 2020): log p(y l |y 1 , · · · , y l−1 ) (8) where L is the total number of tokens in Y and y l is the l-th token in generated paragraph Y . Complexity reduction. Since the number of possible sentence orders grows as n! for a n-sentence paragraph, exact inference is an extremely time consuming process. To reduce the complexity during inference, we employ an approximate inference by taking advantage of the special token prediction mentioned in step 2. The special token prediction happens when a end-of-sentence (i.e., <E-t>) is generated. Instead of traversing each remaining possible sentence index, the model only chooses the most likely sentence index through special token predictions. It should be noted that we reuse the classifier in decoder by simply masking tokens not in {<B-t>} T t=1 , without training any new classifiers. Therefore, the decoding time is roughly linear in the number of candidates to be generated.

Experiments
We conduct experiments on three text generation tasks: story generation, news generation, and paper abstract generation. For all tasks, we compare PermGen with multiple baseline models on diversity and accuracy of their generated texts. We also perform human evaluation on story generation.

Tasks and Benchmarks
Task 1: Story generation In this task, models learn to generate story paragraphs from the title and multiple keywords. We use ROCStories dataset (Mostafazadeh et al., 2016) and follow the same data preparation as in Yao et al. (2019). ROC-Stories has 98,162 / 9,817 / 9,803 paragraphs for training / development / test sets, respectively. The stories in the corpus capture causal and temporal commonsense relations between daily events.
Task 2: Paper abstract generation In this task, models need to generate paper abstracts from paper title and a list of keywords. We use the AGENDA dataset (Koncel-Kedziorski et al., 2019) that consists of 40,720 paper titles and abstracts in the Semantic Scholar Corpus taken from the proceedings of 12 AI conferences. Each abstract is paired with several keywords. We follow the settings in Koncel-Kedziorski et al. (2019) to directly generate paper abstracts from the keywords. We follow the same data partition, which has 38,720 / 1,000 / 1,000 for training / development / test sets, respectively.
Task 3: News generation In this task, models are trained to generate news articles from a list of keyphrases. We use DailyMail dataset (See et al., 2017), a corpus of online news articles. We randomly sample 53,102 news articles and extract keyphrases from each sentence using RAKE (Rose et al., 2010). It contains 49,102 / 2,000 / 2,000 news articles for training / development / test sets.
BLM (Shen et al., 2020) Blank Language Model (BLM) generates sequences by dynamically creating and filling in blanks. The blanks control which part of the sequence to fill out, making it ideal for word-to-sequence expansion tasks.
POINTER ) POINTER operates by progressively inserting new tokens between existing tokens in a parallel manner. This procedure is recursively applied until a sequence is completed. This coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
For each task, we also evaluate PermGen with different sampling methods for decoding, including beam search, Truncated sampling (Fan et al., 2018) and Nucleus sampling (Holtzman et al., 2020).
Truncated Sampling (Fan et al., 2018) It randomly samples words from top-k candidates of the distribution at the decoding step.
Nucleus Sampling (Holtzman et al., 2020) It avoids text degeneration by truncating the unreliable tail of the probability distribution, sampling from the dynamic nucleus of tokens containing the vast majority of the probability mass.

Implementation Details
We use pre-trained parameters from BARTbase (Lewis et al., 2020) to initialized our model, which takes a maximum 512 input token sequence and consists of a 6-layer transformer encoders and another 6-layer transformer decoders (Vaswani et al., 2017) with 12 attention heads and 768 word dimensions. For model fine tuning, we use Adam with learning rate of 3e-5, β 1 = 0.9, β 2 = 0.999, L2 weight decay of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of learning rate. Our models are trained with a 4-card 32GB memory Tesla V100 GPU, and implemented with the Huggingface's Transformer (Wolf et al., 2020).

Evaluation Metrics
We use metrics introduced in previous work (Ott et al., 2018;Vijayakumar et al., 2018;Zhu et al., 2018) to evaluate accuracy and diversity.

Diversity metrics
Corpus diversity (⇑). Distinct-k (Li et al., 2016) measures the total number of unique k-grams normalized by the total number of generated k-gram tokens to avoid favoring long sentences. Entropyk  reflects how evenly the empirical k-gram distribution is for a given sentence when word frequency is taken into account (i.e. low weights for high-frequency words).

PermGen v.s. Transformers
As shown in Table 2, PermGen can improve both the diversity and the accuracy of generated text when initialized with either non-pretrained (Trans-former) or pre-trained (BART) Transformers. For example, compared with BART which has the best performance among baselines, PermGen reduced Self-BLEU-4 by 43.2% and improved BLEU-4 by +1.5% on AGENDA. And we observe similar improvement on all other paragraph generation tasks. More evaluation results are in Table ?? in Appendix. POINTER achieves the lowest performance in paragraph generation tasks. This is because its insertion operation ignores dependency between generated words so it cannot well capture the intersentence coherence during long-text generation.
It should be noted that since BART performed the best among all baseline methods, we apply Per-mGen on BART in the following evaluations.

Ablation Study
As we mentioned, the absolute positions in Transformer (Vaswani et al., 2017) of tokens might not be available when we permute the sentence orders in paragraph generation. So, we propose the hierarchical positional embedding that consists of a global position and a local position. In this section, we conduct ablation study to show the adding hierarchical position embedding to BART (short as Hi-BART) does not improve diversity, compared   Table ?? in Appendix. to the original BART model. Hi-BART even underperforms than original BART (see Table 3). This is mainly because the newly added hierarchical position embeddings are randomly initialized, without any pre-training on large corpora.

PermGen v.s. Decoding Methods
We investigate the quality of text generated by Per-mGen (built on BART) when coupled with beam search, truncated sampling and nucleus sampling. Figure 4 shows that on average, PermGen can significantly boost diversity by 5.81% in Self-BLEU-3 and 6.83% in Self-BLEU-4, respectively, and improve accuracy by +1.2% and +1.5% in terms of Top1-BLEU-4 and Oracle-BLEU-4. As the diversity of generated text depends on the number of produced candidates, we compare the diversity of generation between BART and PermGen with various number of output candidates, K. Figure 5 shows that as K increases, PermGen can consistently generate more diverse content, measured by the ratio of distinct 2-grams, Distinct-2 (dashed line). Meanwhile, measured by Entropy-4 (solid line), the proportion of novel words in generated candidates from PermGen is rising as K increases, while BART shows a flat or even falling trend.

Human Evaluations
We sample 100 inputs from ROCStories test set and each evaluated method generates top-3 stories. Every story is assigned to five annotators with NLP background. For diversity, the annotators are given * "Reordered from [2, 1, 3, 5, 4]" means that PermGen first generates the 2 nd sentence, and then generates the 1 st sentence, and so on. Finally, we reorder the generated story according to the ascending order of sentence index as shown in Figure 3.  two sets of top-3 stories from two methods each time and instructed to pick the set that is more diverse. The choices are "win," "lose," or "tie." Then, the annotators give an accuracy score from 1 to 5 to measure semantic similarity between the top-1 generated story and ground truth story. Finally, the annotators need to give a fluency and coherency score from 1 to 5 for each generated story.
Table 5-6 demonstrate that PermGen outperforms beam search in both accuracy and fluency, while significantly improving generation diversity compared with other diversity-promoting methods. Table 4 demonstrates generated stories from different diversity-promoting methods, including beam search, nucleus sampling and our PermGen. Overall, we observe that PermGen can generate more diverse stories than the other two methods. We notice that stories generated by beam search often differ only by punctuation and minor morphological variations, and typically only the last sentence (or last several words) is different from others. Nucleus sampling achieves better diversity than beam search, but the stories are still following similar storylines. In comparison, PermGen can generate semantically richer and more diverse contents.

Conclusions
In this paper, we proposed a novel sentencepermuted paragraph generation model, PermGen. PermGen maximizes the expected log likelihood of output paragraph w.r.t. all possible sentence orders. Experiments on three paragraph generation tasks demonstrated that PermGen outperformed original Transformer by generating more accurate and diverse text. The result is consistent on various Transformer models and decoding methods.