Is Everything in Order? A Simple Way to Order Sentences

The task of organizing a shuffled set of sentences into a coherent text has been used to evaluate a machine's understanding of causal and temporal relations. We formulate the sentence ordering task as a conditional text-to-marker generation problem. We present Reorder-BART (Re-BART) that leverages a pre-trained Transformer-based model to identify a coherent order for a given set of shuffled sentences. The model takes a set of shuffled sentences with sentence-specific markers as input and generates a sequence of position markers of the sentences in the ordered text. Re-BART achieves the state-of-the-art performance across 7 datasets in Perfect Match Ratio (PMR) and Kendall's tau ($\tau$). We perform evaluations in a zero-shot setting, showcasing that our model is able to generalize well across other datasets. We additionally perform several experiments to understand the functioning and limitations of our framework.


Introduction
Constructing coherent text requires an understanding of entities, events, and their relationships. Automatically understanding such relationships among nearby sentences in a multi-sentence text has been a longstanding challenge in NLP.
Sentence ordering task was proposed to test the ability of automatic models to reconstruct a coherent text given a set of shuffled sentences . Coherence modeling has wide applications in natural language generation like extraction-based multi-document summarization (Barzilay and Elhadad, 2002;Galanis et al., 2012;Nallapati et al., 2017), retrieval dependent QA (Yu et al., 2018;Liu et al., 2018) and concept-to-text generation (Schwartz et al., 2017).
Earlier studies on coherence modeling and sentence ordering focused on exploiting different categories of features like coreference clues (Elsner and * Authors contributed equally.
I packed my raincoat.
The forecast called for rainy.
It never rained.
The weather is never predictable.
Instead it started to snow. The forecast called for rainy.
I packed my raincoat.
It never rained.
Instead it started to snow.
In this paper, we present RE-BART (for Reorder-BART) to solve the sentence ordering as a conditional text-to-marker generation where the input is a shuffled set of sentences and output is a sequence of position markers for the coherent sentence order.
Sentence ordering can be viewed as a task of reconstructing the correct text from a noisy input. For this reason we use BART (Lewis et al., 2020) as the underlying generation module for RE-BART. BART is pre-trained as a denoising autoencoder where one of the objective involves generating the coherent text from corrupted input sequences. Prior works encode sentences individually or in a pairwise manner and then compute the position of a sentence in the paragraph. We instead encode the entire shuffled sequence at once, which results in learning better token representations with respect to the entire input context. This helps the model in capturing interactions among all sentences and identifying the relative order among them.  Our simple framework outperforms previous state-of-the-art by a large margin for all benchmark datasets. Specifically, we achieve 11.3%-36.2% relative improvements in Perfect Match Ratio (PMR) and 3.6%-13.4% relative improvements in Kendall's tau (τ ) across all benchmarks. Our main contributions are: • We formulate the sentence ordering as a conditional text generation problem and present a simple method to solve it.
• We empirically show that our model significantly outperforms existing approaches by a large margin and achieves state-of-the-art performances across all benchmark datasets.
• We conduct zero-shot evaluations showing our model trained on Movie Plots outperforms the previous in-domain trained state-of-the-art.
• We present a thorough analysis to evaluate sensitivity of our model to different input properties.

Related Work
The problem of sentence ordering can be formulated as finding an order with maximum coherence. Earlier works focused on modeling local coherence using linguistic features Elsner and Charniak, 2011;Guinaudeau and Strube, 2013). A line of work have leveraged neural networks to encode sentences and retrieve the final order using pointer network (Vinyals et al., 2015) by comparing them in a pairwise manner (Gong et al., 2016;Logeswaran et al., 2018a;Cui et al., 2018;Yin et al., 2019Yin et al., , 2020. HAN (Wang and Wan, 2019) and TGCM (Oh et al., 2019) used an attention based pointer network for decoding. B-TSort (Prabhumoye et al., 2020) uses topological sorting to retrieve the final order from sentence pairs. Zhu et al. (2021) encode sentence-level relationships as constraint graphs to enrich sentence representations. The state-of-the-art approach (Cui et al., 2020) introduced a novel pointer decoder with a deep relational module.
Other works considered reframing the task as a ranking problem.  proposed a model which relies on a ranking framework to retrieve the order of sentence pairs. Kumar et al. (2020) utilized a BERT (Devlin et al., 2019) encoder to generate scores for each sentence which were used to sort them into the correct order.
Different from these approaches, we formulate sentence ordering as a conditional text generation task. We use a sequence-to-sequence model in our framework where the decoder encapsulates the functioning of a pointer network while generating output sentence positions. Our code is available at: https://github.com/fabrahman/ReBART.

RE-BART
Given a sequence of shuffled sentences S = {s 1 , s 2 , . . . , s N S }, where s i denotes the i th sentence and N S denotes the number of input sentences, the task is to generate the ordered output sequence S * = {s 1 , s 2 , . . . , s N S }.
We solve the sentence ordering task using a textto-marker framework shown in Figure 2. Specifically, taking a shuffled sequence of sentences (S ) as input, we generate a sequence of position markers Y = {y 1 , y 2 , . . . , y N S } as output, where y i denotes the position of i th sentence of the corresponding ordered sequence (s i ) in the shuffled input. The ordered output sequence can then be reconstructed asŜ = {S y 1 , S y 2 , . . . , S y N S }. Our goal is to train a probabilistic model P θ (Y|S ) by optimizing: The functioning of RE-BART model is shown in Figure 2. RE-BART consists of a sequenceto-sequence model with an encoder to receive a shuffled set of sentences, and a decoder to generate position markers (2, 1, 3 etc.), which is then used to retrieve the final ordered sequence. We use BART (Lewis et al., 2020) as the underlying sequence-to-sequence model, since our task can benefit from the sentence permutation pre-training objective. Additionally, to provide the model with a supervision signal to generate position markers, we append each sentence in the shuffled input with sentence markers (<S1>, <S2>, etc.). 1 Sentence markers were added as special tokens to the tokenizer. RE-BART learns to attend to the markers while generating the final order Y .
The proposed text-to-marker framework has two advantages over an alternate text-to-text framework, where the model directly generates the entire text sequence instead of marker outputs. First, the model performs better as the output space is much smaller. This also makes it less susceptible to neural text degeneration (Holtzman et al., 2020), as significantly fewer output tokens are generated. Second, when generating the entire text sequence in the text-to-text framework, we observe that the model often generates text which is not part of the input, rendering the output invalid for the task.

Datasets
We run our experiments on 7 publicly available English datasets from two domains: scientific paper 1 We experimented with various combinations of sentence markers and position marker, and found out that the text-tomarker framework performs the best.  We randomly split ROCStories into train/test/validation in a 80:10:10 ratio. For the other datasets, we use the same train, test and validation sets as previous works. Dataset statistics are reported in Table 1.

Implementation Details
We use Huggingface library (Wolf et al., 2019) for our experiments. During inference we decode the output positions in a greedy manner by choosing the logit with the highest probability. The hyperparameters used for each dataset are provided in Table 1 in the Appendix. The experiments are conducted in PyTorch framework using Quadro RTX 6000 GPU. The hyper-parameters for each dataset are provided in Table 2.

Evaluation Metrics
Following previous works (Cui et al., 2020;Kumar et al., 2020;Wang and Wan, 2019), we use the following metrics for evaluating our approach: Accuracy (Acc): This is the fraction of output sentence positions predicted correctly averaged over   all test instances. It is defined as: where S is a shuffled input from the dataset D, s i is the i th sentence in the ordered sequence, y i is the predicted sentence marker at position i and N S is the number of sentences in the input. Perfect Match Ratio (PMR): PMR measures the fraction of sentence orders exactly matching with the correct order across all input instances: where Y j and Y * j are the predicted and gold position marker sequences, respectively, and N is the number of instances in the dataset. Kendall's Tau (τ ): τ is a metric to evaluate the correlation between two sequences: In our setup, we evaluate τ between the predicted position marker sequence Y and gold position marker sequence Y * . A higher score indicates a better performance for all metrics.

Baselines
We compare RE-BART with 11 previous sentence ordering frameworks including the current state-ofthe-art BERSON (Cui et al., 2020).  (Gong et al., 2016) and Pairwise model .  Table 4: Model performance using text-to-text and textto-marker frameworks on ROCStories. A significant gain is observed using text-to-marker framework.
Apart from these baselines, we also include a textto-text variant of our model, where we fine-tune a pre-trained BART model to generate the text sequences corresponding to sentences instead of their markers. We call this variant BART (fine-tuned).

Results
In this section, we evaluate the performance of RE-BART on several benchmark sentence ordering datasets. We also conduct a series of experiments to better understand the working of our model and investigate its generalization capability. Table 3 reports the experimental results on all benchmark datasets. 5 RE-BART improves over all baselines by a significant margin and achieves the new state-of-the-art results in PMR and τ metrics for all datasets. In particular, RE-BART improves the previous state-of-the-art performance in PMR metric by a relative margin of 18.8% on NeurIPS, 22.9% on AAN, 28.9% on NSF, 11.3% on arXiv, 36.2% on SIND and 20% on ROCStories. We observe similar relative gains in τ : 4.7% on NeurIPS, 7.1% on AAN, 13.4% on NSF, 3.6% on arXiv, 10.8% on SIND and 6.8% on ROCStories.
We observe that RE-BART's performance on Wikipedia Movie Plots is relatively poor compared to other datasets. This could be because this dataset has relatively longer input sequences (Table 1), making the task more challenging for the model. Comparison with text-to-text framework: Table 3 also shows that RE-BART outperforms BART (fine-tuned), our text-to-text baseline, for all datasets. BART (fine-tuned) performs reasonably well on the NeurIPS, AAN, SIND and ROC-Stories datasets where the average number of sentences (Table 1) is low. It struggles on NSF, arXiv and Movie Plots where input sequences are longer. Upon manual inspection, we found that BART (fine-tuned) model suffers from neural text degeneration (Holtzman et al., 2020) and produces output 5 Prior results have been compiled from (Cui et al., 2020). tokens which aren't present in the input. We hypothesize that training in our proposed text-to-marker framework yields a performance gain over text-to-text framework, irrespective of the underlying sequence-to-sequence model. To verify this hypothesis, we compare two settings of our framework that use BART and T5 as the underlying sequence-to-sequence model. In Table 4, we observe significant gains for both BART and T5 using our text-to-marker framework. This shows the text-to-marker framework outperforms text-to-text baseline irrespective of the generation module.
From results in Table 3, we observe that our simple framework is effective and outperforms more complex baseline architectures. One explanation behind RE-BART's success could be the use of sentence markers. RE-BART is able to encapsulate the context in individual sentences (observed in generated attention maps in §6.5) and produce markers at the correct output position. Additionally, our text-to-marker framework is better at leveraging causal and temporal cues implicitly captured by BART during pre-training.

BART vs. T5
We want to study the effect of BART's pre-training objective on its performance in sentence ordering task. BART is pre-trained on multiple tasks including the rearrangement of permuted sentences, which is closely relevant to our task. To investigate if this pre-training objective provides an edge to BART, we conduct the following experiment on the ROCStories dataset. We visualize the UMAP (McInnes et al., 2018) projections of sentence representations obtained from pre-trained BART-large and T5-large models, and color code  them according to their position in the ordered text S * . For example, red color represents the first sentence of every instance. We compare with T5, which has a similar architecture but is not pretrained with sentence permutation objective. We observe that in case of BART, sentence embeddings belonging to an identical output position (s i ) are better clustered in space, making them easier to be identified as shown in Figure 3. 6 In case of T5, the overlap among embeddings at different sentence positions is higher. To quantify the overlap we measure cluster purity following Ravfogel et al. (2020). We perform k-means clustering on UMAP projections of sentence embeddings from pre-trained BART & T5 models (k = 5, ROCStories has 5 sentences per instance). We measure average purity in each cluster by computing the relative proportion of the most common sentence position. The mean cluster purity for BART: 35.9% and T5: 23.6%. This indicates that since pre-trained BART is already able to segregate sentences based on their original position, it finds it easier to reorder them given a shuffled set.
The impact on downstream performance is shown in Table 4, where BART outperforms T5 in both setups. We posit that sentence permutation denoising during pre-training gives BART an advantage in the sentence ordering task.

Ablations
We perform a series of ablation experiments with different setups to better understand the working of our model. All the experiments in this section were performed on ROCStories dataset.
In the first ablation test, we want to verify whether the model is able to capture coherence among sentences or is just over-fitting on the data.
To this end, we train our model using an arbitrarily shuffled order as output instead of the ground-truth order. We observe near random prediction performance as shown in the second row of Table 5 Next, we examine whether the sentence-markers provide any strong supervision to the model during training. Our initial assumption was that the model can use these markers adequately to learn sentence ordering. To validate our assumption, we remove the sentence-markers from the input (the input is simply a sequence of shuffled sentences) and evaluate if the model is implicitly able to figure out the sentence positions. We observe a significant drop in τ (-14.97%) and PMR (-6.19%) comparing the third and last row in Table 5 . This result shows that sentence-markers are indeed helpful.
Finally, we investigate if the sequential nature of sentence markers have an impact on the performance. We append every sentence in an input with random sentence markers between 0-100 (e.g. <S47>, <S78> etc.). We observe that the model performance is quite close to the final setup (fourth row in Table 5). There is a slight drop in performance which can be attributed to inconsistent assignment of sentence markers across different instances. This shows that the model can still effectively exploit sentence markers and their sequential nature have little impact on the final performance.

Zero-shot Performance
We investigate how well our model is able to generalize across different datasets. To this end, we evaluate the zero-shot performance of our model on different datasets.
In our experiment, we train the RE-BART model on a single dataset and test it on all others in a zeroshot setup. 7 From the results in Table 6, we observe that in most zero-shot setups RE-BART is able to perform well across different domains. Particularly, RE-BART fine-tuned on Wikipedia Movie Plots generalizes well to other unseen datasets. Surprisingly it even outperforms the previous state-of-theart BERSON, which was fine-tuned on in-domain data, on PMR score for all datasets except NSF abstract (see the last row for comparison). We posit that the presence of longer sentences with more complex language in the Movie Plots dataset helps the model generalize to other datasets.  Table 6: Performance of our model when trained on a dataset and evaluated on another in a zero-shot setup. The best and second-best performance for any metric are in bold and underline respectively. * We include the performance of BERSON for comparison purposes, when it is evaluated on the same dataset it is fine-tuned on (from Table 3). RE-BART trained on ROCStories performs the worst across all datasets. Its poor performance can be attributed to the fact that ROCStories has fixed length stories with short sentences and simpler language, which makes transfer to other complex datasets harder. However, it performs reasonably well on SIND, where the data is from a similar domain and most instances are five-sentence long.
From the results in Table 6, we also observe that RE-BART performs equally well across domains (narrative → abstract, abstract → narrative). The model trained on Wikipedia Movie Plots (narrative domain), achieve the best zero-shot performance on AAN and NSF abstract (abstract domain). We also observe good performance during (narrative → abstract) transfer, when RE-BART trained on AAN and NSF is tested on ROCStories. From these experiments, we show that our model is able to generalize across domains and is not restricted to the domain of the dataset it is trained on.

Analysis
In this section, we perform experiments to explore RE-BART's functioning with variation in different aspects of inputs.

Effect of Shuffling
We analyze if RE-BART's performance is sensitive to the degree of shuffling in the input. To this end, we define the degree of shuffling d(S , S * ) as the minimum number of swaps required to reconstruct the ordered sequence S * from S . Lower d(S , S * ) indicates that the input S is more similar to the ordered output sequence S * . To effectively compare the performance across all datasets, we compute the normalized degree of shuffling as: In Figure 4, we observe a gradual decline in performance across all metrics with an increase in the normalized degree of shuffling,d(S , S * ). Overall, the results show that RE-BART performance is higher whend(S , S * ) is lower. This could be because a lower degree of shuffling means more coherent and meaningful input, resulting in an easier task for the model.

Effect of Input Length
In this experiment, we analyse how RE-BART performance varies with the number of sentences in the input. Figure 5 shows RE-BART's performance for inputs with different number of sentences, N S . We observe a general declining trend in performance with increasing input length across different datasets. 8 This shows that the model finds it dif- ficult to tackle longer input instances. The drop in performance is more pronounced for NSF and arXiv which have instances with higher number of sentences compared to other datasets. For all datasets, we observe that the rate of decline in τ is much less than Accuracy and PMR. From this observation, we infer that even if the predicted positions of individual sentences are incorrect, our model produces sentence order which are correlated with the original order.

Position-wise Performance
Here, we explore how the performance of RE-BART varies while predicting sentences at different positions in the ordered output. To uniformly investigate this across all datasets, we measure performance using a relative output position defined as y rel = y i |S| . We consider y rel correct to 1 decimal place and compute the prediction accuracy for each y rel . The position-wise prediction accuracy for all datasets is shown in Figure 6. We observe that prediction accuracy is the highest for the first sentence, then there is a steady decline till it starts to rise again towards the end of the output sequence.
We conjecture that RE-BART is able to pick up on shallow stylistic cues, often present in the first and last sentences enabling it to have higher prediction accuracies for these positions. For example, in ROCStories all first sentences have a proper noun  and introduce the protagonist of the story. In the abstracts, many papers start with similar phrases like "In this paper,", "We present " and ends with "Our contributions are ", "We achieve ", etc. For Movie plots, last sentence accuracy is significantly less than other datasets because we consider the first 20 sentences only. Following previous works (Gong et al., 2016;Cui et al., 2018), we report the prediction accuracy of the head and tail (first and last) sentences for arXiv and SIND in Table 7. RE-BART outperforms all baselines by a large margin on both datasets.

Prediction Displacement
For instances where the model prediction was wrong Y = Y * , we investigate how far was the model prediction Y was from the gold label Y * . To evaluate this we compute d(Y, Y * ), the minimum number of swaps required to retrieve Y * from Y . We experiment on Wikipedia Movie Plots dataset where the performance of RE-BART was not as good as other datasets. From Figure 8, we observe that most of the incorrectly predicted samples had a low d(Y, Y * ), with 70% of the incorrect predictions having d(Y, Y * ) ≤ 6. This shows that even The y-axis shows output tokens, x-axis shows input tokens, and colorized cells denote the cross-attention between tokens at a position (x, y). Lighter color indicates higher attention values. The model learns to attend around sentence markers and other special tokens. if the model make a wrong prediction, it mostly misses a few positions and does not get the entire order wrong.

Attention Visualization
We visualize the norm-based cross-attention map (Kobayashi et al., 2020), between the decoded output and encoder input, of one of the attention head in Figure 7. Lighter color indicates higher attention values. We append all input instances with special tokens [shuffled] and [orig] at the beginning and end respectively, along with sentence markers at the start of each sentence. In Figure 7, we observe that the model attends to tokens near those special tokens. This shows that during decoding the model finds only tokens next to the sentence markers useful. We hypothesize this is due to the fact that these tokens are able to encapsulate the context of the corresponding sentence. We observe similar maps across different attention heads.

Effect of Sentence Displacement
We investigate if there is any variation in performance if a sentence is placed too far from its position in the shuffled sentence. We compute relative distance from the original position δ rel (s i ) as:  Figure 9 shows how the performance varies with respect to δ rel (s i ). We observe that accuracy doesn't change much with relative displacement. We infer that local sentence-level relative displacement doesn't dictate the performance as much as global input-level factors like degree of shuffling and input length.

Conclusion
In this work, we address the task of sentence ordering by formulating it as a conditional text generation problem. We observe that simply generating output text from shuffled input sequences is difficult due to neural text degeneration. We solve this problem by proposing RE-BART, a text-tomarker generation framework. RE-BART achieves the state-of-the-art performance on 7 benchmark datasets and is able to generalize well across different domains in a zero-shot setup. We investigated the limitations of our model, and found that RE-BART is sensitive to various factors like number of input sentences and degree of shuffling. Future works can focus on developing models which are robust to such factors.