Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence

Generating long and coherent text is an important but challenging task, particularly for open-ended language generation tasks such as story generation. Despite the success in modeling intra-sentence coherence, existing generation models (e.g., BART) still struggle to maintain a coherent event sequence throughout the generated text. We conjecture that this is because of the difficulty for the decoder to capture the high-level semantics and discourse structures in the context beyond token-level co-occurrence. In this paper, we propose a long text generation model, which can represent the prefix sentences at sentence level and discourse level in the decoding process. To this end, we propose two pretraining objectives to learn the representations by predicting inter-sentence semantic similarity and distinguishing between normal and shuffled sentence orders. Extensive experiments show that our model can generate more coherent texts than state-of-the-art baselines.


Introduction
The ability to generate coherent long texts plays an important role in many natural language generation (NLG) applications, particularly for openended language generation tasks such as story generation, namely generating a reasonable story from a prompt or a leading context. While existing generation models (Fan et al., 2018;Radford et al., 2019) can generate texts with good intra-sentence coherence, it is still difficult to plan a coherent plot throughout the text, even when using the powerful pretrained models, as illustrated in Figure 1.
Pretrained generation models have shown stateof-the-art performance on various NLG tasks such as summarization and translation (Radford et al., 2019;Lewis et al., 2020). However, such tasks  ROCStories (Mostafazadeh et al., 2016). The generated story by BART suffers from severe incoherence issue in spite of some related concepts (in bold). In comparison, the human writer can write a coherent story because they fully consider the context semantics and discourse relations (e.g., the temporal order) among the sentences. provide sufficient source information in the input for generating desired texts, while open-ended generation tasks require expanding reasonable plots from very limited input information (Guan et al., 2020). As exemplified in Figure 1, we observe severe issues of incoherence when applying BART for story generation. Although BART performs reasonably well at generating some concepts related to the context (e.g., "basketball", "player"), they are used incoherently in the generated texts, which is manifested in repetitive plots (e.g., the sentences B and C), unrelated events (e.g., "played baseball better") and conflicting logic (e.g., "not good at basketball" but "in the basketball team"). These issues are also commonly observed in other NLG models (Holtzman et al., 2020;Guan and Huang, 2020). We argue that existing models are rarely trained beyond the token-level co-occurrence, and therefore they can easily generate related concepts but do not arrange them reasonably. In contrast, human writers always first fully understand the semantics (e.g., some key events such as "try out", "not make the cut") and the discourse relations (e.g., temporal orders) among the already written sentences before deciding the following content. In this way, the writers can write coherent stories even with few related concepts, as shown in Figure 1. Therefore, it is important for subsequent generation to capture high-level features in the context.
In this paper, we propose HINT, a generation model equipped with HIgh-level representations for loNg Text generation. Typical generative models usually train a left-to-right decoder by next word prediction based on the attention to all the prefix words. In order to encourage the model to capture high-level features, we extend the decoder to represent the prefix information at sentence level and discourse level, respectively, with special tokens which are inserted at the end of each sentence. To effectively learn the representations, we propose two pretraining objectives including: (a) semantic similarity prediction, which requires predicting the inter-sentence similarity using the sentencelevel representation, with the powerful sentence understanding model SentenceBERT (Reimers and Gurevych, 2019) as the teacher model; and (b) sentence order discrimination, which requires distinguishing between the normal and shuffled sentence orders using the discourse-level representation. The objectives are designed to help the decoder capture the semantics and discourse structure of the prefix, which can benefit modeling the longrange coherence when generating long texts. We summarize our contributions in two folds: We propose a generation model named HINT for long text generation. HINT derives high-level representations for each decoded sentence to model the long-range coherence. We adopt two pretraining objectives called similarity prediction and order discrimination to learn the representations at sentence level and discourse level, respectively.
II. We conduct extensive experiments on commonsense story and fiction generation tasks. Results show that HINT can learn meaningful high-level representations and generate more coherent long texts than baselines. 1 2 Related Works Long Text Generation Recent studies tackle the incoherence problem in long text generation from the following perspectives. Li et al. (2015) adopted a hierarchical RNN-based decoder to learn the sentence representation but without any external supervision. Shao et al. (2017) proposed a self-attention mechanism to attend on the prefix by appending it to the RNN-based encoder, which is a similar idea with the vanilla Transformer (Vaswani et al., 2017). However, the token-level self-attention mechanism still struggles to model high-level dependency in the context. Recent works proposed several multi-step generation models (Fan et al., 2018;Yao et al., 2019;Shao et al., 2019;Tan et al., 2020;Goldfarb-Tarrant et al., 2020), which first plan high-level sketches and then generate texts from the sketches. However, the lack of exposure to degenerate sketches may impair the generation performance since the models are only trained on sketches constructed from golden truth texts (Tan et al., 2020). Another line is to incorporate external knowledge into generation especially for commonsense story generation (Guan et al., 2020;Xu et al., 2020). However, the methods may not be always effective for other types of generation tasks. Guan et al. (2020) also required the decoder to distinguish true texts from negative samples to alleviate potential issues such as repetition. But the classification objective does not provide explicit guidance for generation at each step. Therefore, the coherence of language generation is still an open problem.
High-Level Language Representation Significant advances have been witnessed in many NLP tasks with pretrained contextualized representation (Peters et al., 2018;Devlin et al., 2019). However, most models were limited on token-level representation learning, which is not enough for capturing the hierarchical structure of natural language texts (Ribeiro et al., 2020). Several works have tried to learn high-level representation. Skip-Thought vectors (Kiros et al., 2015) learned to encode a sentence by reconstructing its neighboring sentences. HLSTM (Yang et al., 2016)

HINT Encoder
Perturbation Figure 2: Model overview of HINT, which is pretrained to predict the next token (Task 1), predict inter-sentence semantic similarity with the sentence-level representations (Task 2), and distinguish between normal and shuffled sentence orders with the discourse-level representations (Task 3) based on the human-written texts and autoconstructed negative samples.
hierarchical LSTM-based encoder to learn the contextualized sentence representation by downstream classification. HIBERT (Zhang et al., 2019) incorporated the hierarchical architecture to BERT (Devlin et al., 2019) and learned sentence representation by recovering masked sentences. Sentence-BERT (Reimers and Gurevych, 2019) derived sentence representation by fine-tuning BERT for natural language inference. CONPONO (Iter et al., 2020) and SLM (Lee et al., 2020) further trained BERT to understand relations among sentences at discourse level by distance prediction and sentence unshuffling, respectively. However, all these models focused on enhancing the representation of encoders for language understanding, while improving decoders by high-level representation for long text generation is yet to be well investigated.

Task Definition and Model Overview
Our task can be defined as follows: given an input X = (x 1 , x 2 , · · · , x m ) (e.g., a beginning or a prompt), the model should generate a multisentence text Y = (y 1 , y 2 , · · · , y n ) with a coherent plot (each x i or y i is a token). To tackle the problem, the conventional generation models such as BART commonly employ a bidirectional encoder and a left-to-right decoder to minimize the negative log-likelihood L LM of human-written texts: log P (y t |y <t , X), (1) where H t is the decoder's hidden state at the t-th position computed from the context (i.e., the prefix y <t and the input X), and S i is the contextualized representation of x i acquired from the encoder, W and b are trainable parameters.
However, as aforementioned, the models often generate incoherent texts due to the decoder's inability to capture high-level features of the prefix sentences. Therefore, we extend the decoder with high-level representations to gather the prefix information. Specifically, we split the human-written texts into sequential sentences and add special tokens at the end of each sentence, which will be used to aggregate their respective semantics and their discourse relations with one another during decoding. To this end, we devise two pretraining tasks besides the standard language modeling objective, including similarity prediction and order discrimination to learn the sentence-level and discourse-level representations, respectively, as Figure 2 shows. Although we only consider sentence as segments in this work, our method can be easily extended to other syntactic levels such as phrases or paragraphs.

Sentence-Level Representation
Assume that the target text Y consists of K sentences, denoted from Y 1 to Y K (e.g., AB and CD in Figure 2). We insert a special sentence token, sen , at the end of every sentence in Y , which is designed to aggregate the semantics of each sentence. Let H s k (1 k K) denote the decoder's hidden state at the position where the k-th sentence token is the golden truth for next token prediction. We expect H s k to be a meaningful sentence representation for Y k , which means semantically similar sentences have close representations in the vector space. Since sentence representation has been well studied for language understanding with many powerful models such as SentenceBERT (Reimers and Gurevych, 2019), we propose to directly transfer their semantic knowledge for our sentence representation learning. Specifically, we require the HINT decoder to predict the similarity of any two sentences Y i and Y j only using the corresponding sentence representations H s i and H s j , with the Sen-tenceBERT similarity as the golden truth 2 . We do not directly learn the SentenceBERT representation for each sentence but the similarity score to avoid the discrepancy between different model bias. Furthermore, to alleviate the innate bias of Sentence-BERT, we do not enforce HINT to exactly fit the golden similarity. Instead, it would be enough that the difference between the predicted score and the golden similarity is less than a margin ∆ ∈ [0, 1]. Formally, the loss function L Sen for the similarity prediction task can be derived as follows: where t ij is the golden similarity, p ij is the predicted similarity score, s ij is an intermediate variable to guarantee p ij is symmetric with respect to i and j, W s is a trainable parameter to transform the representation space of HINT to that of SentenceBERT. The task explicitly exerts external supervision to learn the sentence-level representation, enhancing the ability of the HINT decoder to fully understand the semantics of prefix sentences.

Discourse-Level Representation
In analogy to the sentence-level representation learning, we also insert a special discourse token, dis , after every sentence and the corresponding sentence token to gather the discourse information between different sentences. Let H d k (1 k K) denote the decoder's hidden state at the position where the k-th discourse token is the golden truth to be predicted. H d k should be a meaningful representation which can be used to derive discourse relations with others (e.g., the k-th sentence precedes another one in terms of the temporal order). Previous work has shown that reconstructing the correct order from shuffled sentences helps understand the discourse relations (Lee et al., 2020). However, the unshuffling task is not directly applicable for NLG since the decoder should learn to dynamically model the discourse structure in the decoding process rather than wait until finishing decoding the whole text. Therefore, we propose to learn the discourse-level representation in a pair-wise manner by discriminating whether the order of two sentences is correct. Formally, we minimize the cross-entropy loss L Dis as follows: where o ij is the golden label (1 if Y i should precede Y j , 0 otherwise), q ij is the predicted discrimination score, and W d is a trainable parameter. Compared with the sentence-level representation H s k which aggregates the semantics of a single sentence, the discourse-level representation H d k focuses more on the relationship with other sentences, thereby improving HINT's ability to capture the high-level features in both content and order.

Pretraining and Fine-tuning
To learn the high-level representations more effectively, we propose to augment the training corpus by automatically constructing negative samples from the human-written texts for pretraining. Specifically, for the order discrimination task, we randomly shuffle the sentences in human-written texts as negative samples. And for the similarity prediction task, besides the negative samples with shuffled sentences, we also randomly repeat a sentence, or substitute a sentence with another from other texts as negative samples. We expect the negative samples to help enhance the generalization ability of HINT during fine-tuning or inference. In summary, the overall loss function L P re for pretraining is computed as follows: where we optimize the language modeling objective L LM only on the human-written texts, L Dis on the human-written texts and the negative samples with shuffled sentences, and L Sen on all the humanwritten texts and the negative samples. λ 1 and λ 2 are adjustable scale factors. By pretraining with the proposed two objectives, the decoder can better capture the semantics and discourse structures in the context. And during fine-tuning, we train HINT only with the language modeling objective.

Implementation and Pretraining Dataset
Since our approach can adapt to all the generation models with auto-regressive decoders (e.g., GPT-2 (Radford et al., 2019), UniLM (Dong et al., 2019), etc.), we use BART as the base framework of HINT, which has been shown to have strong performance for long text generation (Goldfarb-Tarrant et al., 2020). And we also provide the performance of GPT-2 widely used in the literature. Due to the limited computational resources, we follow BART BASE 's hyper-parameters and utilize the public pretrained checkpoint to initialize HINT. The batch size is set to 10 and the maximum sequence length is set to 512 for both the encoder and the decoder. The margin ∆ in Equation 5 is set to 0.1 and we present the results with other settings of ∆ in the appendix. Both the scale factors λ 1 and λ 2 in Equation 11 are set to 0.1. We adopt BookCorpus (Zhu et al., 2015) as our pretraining dataset and split each text to sentences using NLTK (Bird and Loper, 2004). We create the training texts by taking a sentence as the input and the following ten sentences as the target output. Besides, we construct the same number of negative samples with the human-written texts. And it is evenly possible for a negative sample to be repeated, substituted or shuffled. We pretrain HINT on BookCorpus for 0.1M steps.

Fine-tuning Setting
We evaluate HINT on ROCStories (ROC for short) (Mostafazadeh et al., 2016) and Writing-Prompts (WP for short) (Fan et al., 2018). ROC contains 98,162 five-sentence commonsense stories. We follow Guan et al. (2020) to delexicalize stories in ROC by masking all the names with special placeholders to achieve better generalization. WP originally contains 303,358 stories paired with writing prompts, which are usually unconstrained on writing topics. Considering that using too many examples for fine-tuning may weaken the influence of post-training, we randomly selected stories from the original validation set and test set of WP for the subsequent experiments. We regard the first sentence and the prompt as the input to generate a text for ROC and WP, respectively. And we only retain the first ten sentences (split using NLTK) of the texts in WP for fine-tuning. We present more details in Table 1. The batch size is set to 10/4 for ROC/WP, respectively. And other hyperparameters are the same as the pretraining phase.

Baselines
We compared HINT with the following baselines: Seq2Seq: It generates a text conditioned upon the input. For better performance, We implement the baseline by training BART from scratch on the downstream datasets without pretraining. Plan&Write: It first plans a keyword sequence conditioned upon the input; and then generates a text based on the keywords (Yao et al., 2019). We implement the model based on the codes provided by the original paper.
GPT-2 and BART: They are fine-tuned on the downstream datasets with the language modeling objective.
BART-Post: It is first post-trained on the pretraining dataset with the original pretraining objectives of BART (text infilling and sentence permutation) for the same number of steps with HINT; and then fine-tuned on the downstream datasets with the language modeling objective.
BART-MTL: The model is trained by fine-tuning BART on the downstream datasets with multi-task learning (MTL), including the language model-ing objective and an auxiliary multi-label classification objective (Guan et al., 2020), which requires distinguishing human-written texts from auto-constructed negative samples. Furthermore, we conduct ablation tests by removing the proposed components respectively to investigate the influence of each component. Besides, we also demonstrate the adaption of our approach to general language generation models by directly fine-tuning BART and HINT on downstream datasets with the proposed two objectives as auxiliary tasks. For fair comparison, we set all the pretrained models to the base version. And we also insert the sentence token and discourse token into each training text for all the baselines.
We generate texts using nucleus sampling (Holtzman et al., 2020) with p=0.9 and a softmax temperature of 0.7 (Goodfellow et al., 2016) to balance the trade-off between diversity and fluency. And we set the probability of generating dis to 1 if the last token is sen to ensure that HINT can obtain the high-level representations for each sentence. And during evaluation, we remove the special tokens in the generated texts. We apply these settings to all the baselines.

Automatic Evaluation
Evaluation Metrics We adopt the following automatic metrics to evaluate the performance on the test sets: (1) Perplexity (PPL): Smaller perplexity scores indicate better fluency in general. We do not count the probability values at the positions where the sentence or discourse token is the golden truth.
(2) BLEU (B-n): We use n = 1, 2 to evaluate n-gram overlap between generated texts and human-written texts (Papineni et al., 2002). (3) Lexical Repetition (LR-n): The metric computes the percentage of those texts which repeat a 4-gram at least n times in all the generated texts (Shao et al., 2019). We set n = 2 for ROC and n = 5 for WP. (4) Semantic Repetition (SR-n): The metric first computes the average top-n SentenceBERT similarity between any two sentences in each generated text, and then averages the results as the final score. We set n = 1 for ROC and n = 10 for WP. 2019) to distinguish human-written texts and negative samples constructed by substituting words, phrases and sentences of human-written texts randomly. Then, we use the average classifier score of all the generated texts to measure the context relatedness. (7) Sentence Orders: In analogy to relatedness measurement, we train another classifier to distinguish human-written texts and negative samples where sentences are randomly shuffled. We use the average classifier score to measure sentence orders. We train the last two metrics based on the training sets of the downstream datasets.
Results on ROC We show the results on ROC in Table 2. We do not provide the perplexity scores of Plan&Write and GPT-2 since they do not tokenize texts with the same vocabulary as used in BART. HINT outperforms all the baselines in terms of perplexity, indicating the better ability to model the texts in the test set. And HINT can generate more word overlaps with reference texts as shown by better BLEU scores. It is accordant with the previous observation (Xu et al., 2020) that Plan&Write has less lexical repetition than pretraining models possibly because small models are better at learning short term statistics (e.g., n-gram) but not long term dependencies. However, HINT improves the situation compared with GPT-2 and BART, and has less semantic repetition than all the baselines, indicating the better ability of HINT to capture semantic features. Besides, our approach does no harm to the generation diversity. HINT also outperforms baseline models in generating related events and arranging a proper order, as shown by the higher relatedness and order scores. Furthermore, finetuning with the proposed objectives as auxiliary tasks can further reduce the lexical and semantic repetition, and improve the relatedness and order scores for both BART and HINT, suggesting the general benefit of modeling the long-range coherence at sentence level and discourse level.
Besides, the ablation test shows that the sentencelevel and discourse-level representations are relatively more important to enhance the ability to generate texts with related events and reasonable orders, respectively. And both of them contribute to reducing semantic redundancy. When post-training only with the language modeling objective, almost all the metrics drops substantially, indicating the importance to model high-level coherence.
Furthermore, we also notice that some models achieve even higher relatedness score than the  golden texts. We summarize the possible reasons as follows: (a) It is still difficult for the learned classifier to judge implicit relatedness in some golden texts, which may require a strong reasoning ability. (b) There exist some noisy texts with poor relatedness in the golden texts. And (c) the systems tend to generate a limited set of texts (as demonstrated by much lower distinct-4 than golden texts) with generic plots (Guan et al., 2020), which may get high relatedness scores easily. However, we believe the learnable metric is still meaningful to compare different models with similar diversity regarding the context relatedness.
Results on WP We present the results on WP in Table 3. We use a larger n to compute the lexical/semantic repetition since we find that all the models tend to repeat similar texts easily when generating texts with hundreds of words. And we do not provide the relatedness and order scores because it is difficult to train satisfactory classifiers to distinguish human-written texts from negative samples well. Table 3 shows that HINT outperforms baselines except for lexical repetition, which is accordant with the results on ROC. Therefore, the high-level representations are effective for generating long texts with different lengths and domains.

Manual Evaluation
For manual evaluation, we conduct pair-wise comparisons with two strong baseline models (BART and BART-Post), and three ablated models of HINT. We randomly sample 200 texts from the test set of   ROC 3 and obtain 1,200 texts from the six models. 3 We do not conduct manual evaluation on WP since it would be hard to obtain acceptable annotation agreement for  For each pair of texts (one by our model and the other by a baseline, along with the input), three annotators are hired to give a preference (win, lose, or tie) in terms of fluency and coherence, respectively. We adopt majority voting to make final decisions among the three annotators. We resort to Amazon Mechanical Turk (AMT) for annotation. We follow Xu et al. (2020) to define fluency as a measure of intra-sentence linguistic quality and grammatical correctness, and coherence as inter-sentence relatedness, causal and temporal dependencies. Note that the two aspects are independently evaluated. Besides, we control the annotation quality by filtering out those annotations where the annotator can not make reasonable judgments when comparing a human-written text with a negative sample. Furthermore, we also ask workers to annotate the specific errors in the generated texts. We show the annotation instruction and the error analysis of different models in the appendix. Table 4 shows the manual evaluation results. All the results show moderate inter-annotator agreement (0.4 κ 0.6) or substantial agreement (0.6 κ 0.8). And we can see that HINT performs significantly better than baselines in coherence by capturing the high-level features, and has comparable fluency with baselines.

Language Modeling
It is still necessary to further investigate whether the learned representations help HINT capture the high-level coherence better. Therefore, we propose to evaluate the models using individual language modeling tests in different aspects (Ribeiro et al., 2020). To this end, we construct coherent and incotoo long texts. herent examples based on the test set of ROC, and compute perplexity on the examples of different aspects. Specifically, we focus on the following aspects: semantic repetition, relatedness, negation, causal and temporal relationship. We select humanwritten texts as coherent examples and construct incoherent examples by perturbing human-written texts. For example, we select those texts with timerelated words (e.g., "then") as coherent examples for testing in the temporal relationship. And we exchange two sequential events connected by "then" of a human-written text or substitute "before" with "after" as incoherent examples of the aspect. We show more details in the appendix.
We present the results in Table 5. HINT can model the context coherence better in the above aspects than baseline models (lower perplexity on the coherent examples), and recognize the incoherent errors more effectively (higher perplexity on the incoherent examples). By contrast, both BART-Post and HINT (w/o Sen&Dis) achieve an overall drop of perplexity compared with BART even on the negative examples, indicating that they may still focus on capturing the token-level features. As for the ablation study, we can see that the sentence-level representation enhances the ability of HINT to capture the relatedness, negation and semantic repetition, while the discourse-level representation works mainly for causal and temporal relationship. However, we also notice the insignificant improvement of HINT compared with BART in recognizing the unreasonable causal and temporal relationship, which may require injecting explicit inferential knowledge besides learning sentence orders.

Case Study
We present several cases in the appendix to demonstrate that HINT can derive meaningful sentencelevel and discourse-level representations, and generate texts with better coherence than baselines with the help of the representations.

Conclusion
We present HINT, a generation model for ation, which can represent the prefix information at sentence level and discourse level in the decoding process. We propose two pretraining objectives including inter-sentence similarity prediction and sentence order discrimination to learn the sentencelevel and discourse-level representations, respectively. Extensive experiments demonstrate that HINT can generate more coherent texts with related context and proper sentence orders than strong baselines. Further analysis shows that HINT has better ability of language modeling thanks to ability of modeling high-level coherence. We would also like to thank the anonymous reviewers for their invaluable suggestions and feedback.

Ethics Statement
We conduct the experiments based on two existing public datasets ROCStories and WritingPrompts, which are widely used for commonsense story generation and fiction generation tasks, respectively. Automatic and manual evaluation show that our model outperforms existing state-of-the-art models on both datasets, suggesting the generalization of our model to different domains. Besides, our approach can be easily extended to different syntactic levels (e.g., phrase-level, paragraph-level), different model architectures (e.g., GPT, UniLM) and different generation tasks (e.g., dialog generation, essay generation).
We resorted to Amazon Mechanical Turk (AMT) for manual evaluation. We did not ask about personal privacy or collect personal information of annotators in the annotation process. We hired three annotators and payed each annotator $0.05 for comparing each pair of stories. The payment is reasonable considering that it would cost average 30 seconds for an annotator to finish a comparison.

A Implementation Details
We implement our model based on BART BASE and use the public checkpoint and code of Hugging-Face's Transformers 4 . Both the encoder and the decoder contain 6 hidden layers with 12 attention heads. The vocabulary consists of 50,625 tokens with Byte-Pair Encoding (Radford et al., 2019). And we regard mask and s in the original vocabulary as the sentence token sen and the discourse token dis , respectively. The learning rate for both post-training and fine-tuning is 3e-5 with Adam as the optimizer. The Adam epsilon is 1e-6. It cost about 32 hours for HINT's post-training on BookCorpus, and 7 hours/8 hours for fine-tuning on ROC/WP, respectively. The results are based on 1 NVIDIA TITAN X GPU.

B Results on the Validation Set
Besides the performance on the test set which has been reported in the main paper, we also provide the performance on the validation set of ROC in Table 6 for HINT and strong baselines.

C ∆ for Sentence-Level Representation Learning
We tune ∆ in Equation 5 to investigate the influence of the margin between the predicted similarity score of HINT and that of SentenceBert. We present some automatic evaluation results with different ∆ in Table 7. Note that we use ∆ = 0.1 for the experiments in the main paper. We can see that a smaller ∆ (e.g., 0.01) would lead to less lexical and semantic repetition but worse fluency (indicated by higher perplexity) and context relatedness, which may be caused by the over-fitting to the model bias of the teacher model. On the other hand, a larger ∆ (e.g., 0.5) would result in worse performance in almost all the metrics even than ∆ = 1.0 (without the similarity prediction task). The result indicates that a large ∆ makes the model not learn effectively from the teacher model, and impact on the representations of the model itself. By contrast, ∆ = 0.1 would bring better overall performance.

D Manual Evaluation Annotation Instruction
We show the manual annotation interface in Figure 3. In each HIT (human intelligence task) of AMT, we show workers an input along with two text pairs including (a) a pair of generated texts (one by HINT and the other by a baseline), and (b) a pair of the human-written text and a negative sample constructing by perturbing a text (e.g., repetition, substitution) randomly sampled from the data. Note that the two pairs are presented in random order. Then, we ask workers to select the better text in each pair in terms of the fluency and coherence, respectively. Besides, we also require workers to annotate the errors in each text, including repetition (repeating the same or similar words), unrelatedness (with unrelated entities or events to the input or within its own context), wrong temporal orders, and others. We reject an HIT where the worker does not think the human-written text has better coherence than the negative sample, or the worker does not annotate any errors for the negative sample. In this way, we reject 21.09% HITs in total. Finally, we ensure that there are three valid and independent comparison results for each pair of generated texts.

Error Analysis
Based on the manual annotation of errors in the generated texts, we summarize the percentages of those texts with some error in all the annotated texts (200 for each model) in Table 8. We decide that a text contains some error when at least two of three annotators annotate the error for it. Note that each text of HINT is annotated five times (three annotators each time) since HINT is compared with other five models. Therefore, we take the average

vs.
Text D: The orange fell from ...

Human-written Text
Negative Sample  of five annotation results. We can see that HINT has less repetition, better context relatedness and temporal orders than baselines. However, the results show that generating coherent long texts is still challenging.

F Case Study
Sentence-Level Representation Table 10 presents some cases from the test set of ROC to demonstrate the effectiveness of the learned sentence-level representation of HINT. We compute BLEU-1, BART similarity and HINT similarity for different sentence pairs, where BART/HINT similarity means the cosine distance between BART/HINT representations of two sentences. To obtain the BART representation of a sentence, we feed it into the BART decoder (along with its context) and apply mean-pooling on the hidden states at the last layer. HINT representation refers to the corresponding sentence-level representation after decoding the sentence. We normalize all the results into the standard Gaussian distribution 6 . We can see that HINT can derive meaningful sentence-level representations and gives high scores for semantically similar sentence pairs (the first two pairs) but low scores for dissimilar pairs (the last two pairs). By contrast, BART focuses more on tokenlevel similarity and thus derives accordant similarity with BLEU.

Discourse-Level Representation
We also present a case in Table 11 to indicate the effectiveness of the learned discourse-level representation of HINT. We consider a segment in the text of Table 11, which consists of two adjacent 2 The paraphrases are generated based on the public checkpoint of the back translation augmentation system of UDA (Xie et al., 2020). 6 We compute the mean and standard deviation within 2,000 sentence pairs randomly sampled from the test set.

Negation
Texts with negated words (e.g., "not", "unable"). Case: The man turned it on. It did not respond. The man unplugged it. He took it apart. He could never get ...
Inserting or Deleting negated words for 20% sentences. Case: The man turned it on. It {did not respond} delete {responded} insert . The man unplugged it. He took it apart. He could never get that thing to work.

Causal Relationship
Texts with causality-related words (e.g., "so", "because"). Case: Mike had a very stressful job. He needed a vacation. So he took one. He headed to the sunny beaches of Mexico. Mike had a great time on his vacation.
Reversing the cause and effect (two individual sentences or clauses connected by a causalityrelated conjunction such as "so"); Substituting the causality-related words with the antonyms (e.g., "reason" vs "result"). Case: Mike had a very stressful job. {He took one.}reverse So {he needed a vacation.}reverse He headed to the sunny beaches of Mexico ...

Temporal Relationship
Texts with time-related words (e.g., "then"). Case: Karen got stung by a bee. Her arm swelled up immediately. It turned out she was allergic to bees! She had to go to the hospital for medication. Then she felt much better better! Reversing two sequential events (two individual sentences or two clauses) connected by a timerelated conjunction; Substituting the time-related words with the antonyms (e.g., after vs. before) Case: ... Her arm swelled up immediately. It turned out she was allergic to bees! {She felt much better better!}reverse Then {she had to go to the hospital for medication.}reverse  sentences (e.g., the segment z{in yz{|). Then, we can derive the segment representation by concatenating the contextualized representations of the two sentences. Besides, if we reverse the two sentences (from z{ to {z, other sentences in the text unchanged), we can also derive the segment representation in the same way. Note that in this case we concatenate the two sentence representations still in the normal order (i.e., first the representation of z and then that of {). We expect the segment representations before and after the reversion to be distant in the vector space if the sentence representation contains discourse-level information. Otherwise, the segment representations would be similar since the segments have the same tokens before and after the reversion. For BART, we derive the sentence representation by feeding the whole text into BART and mean pooling the hidden states at the positions of tokens in the sen-

Input:
xKate was at her garbage can on a dark night.
Human-written Text: y And a raccoon was standing near the can. z It started to come towards her.
{ Kate turned and ran to the house hoping it wasn't behind her. | Once inside she was relieved to see it hadn't followed her.  Table 11: A human-written text sampled from the test set of ROC with five sentences from x to |. We consider two adjacent sentences as a segment (underlined) and compute the similarity of the segment representations (derived by BART or HINT) Before and After reversing the two sentences. B (M) and B (D) mean using BART to derive the sentence representation by mean-pooling and taking the hidden state at the position corresponding to the discourse token, respectively.

Before
tence. And for HUGO, we regard the corresponding discourse-level representation of each sentence as the sentence representation. For reference, we also show the results using the hidden state of BART at the position of the discourse token as the sentence representation, i.e., B (D). Table 11 shows the similarity between the segment representations before and after the sentence reversion. All the results are normalized into the standard Gaussian distribu-