STaCK: Sentence Ordering with Temporal Commonsense Knowledge

Sentence order prediction is the task of finding the correct order of sentences in a randomly ordered document. Correctly ordering the sentences requires an understanding of coherence with respect to the chronological sequence of events described in the text. Document-level contextual understanding and commonsense knowledge centered around these events are often essential in uncovering this coherence and predicting the exact chronological order. In this paper, we introduce STaCK — a framework based on graph neural networks and temporal commonsense knowledge to model global information and predict the relative order of sentences. Our graph network accumulates temporal evidence using knowledge of ‘past’ and ‘future’ and formulates sentence ordering as a constrained edge classification problem. We report results on five different datasets, and empirically show that the proposed method is naturally suitable for order prediction. The implementation of this work is available at: https://github.com/declare-lab/sentence-ordering.


Introduction
Coherence is an essential quality for any natural language text (Halliday and Hasan, 1976). Correct ordering of sentences is a necessary attribute of coherence. As such, there has been much research in correct sentence order detection due to its application to various down-stream tasks, such as, question answering (Yu et al., 2018), multidocument summarization (Barzilay et al., 2002), automated content addition to text , text generation , and others. It also has potential applications in the evaluation of the quality of machinegenerated documents. Existing approaches to sentence order prediction can be broadly classified into two categories: (1) sequence generation methods, and (2) pair-wise methods (Zhu et al., 2021;Prabhumoye et al., 2020). While the former considers tagging the entire sequence, the latter takes one sentence pair at a time and predicts their relative order. Pair-wise methods ignore the importance of document level global information, i.e., while predicting the relative order of two sentences (s i , s j ), other sentences s k from the same document do not play any role. Global document information Jennifer has her final exam tomorrow.
She got so stressed, she pulled an all-nighter.
She went into class the next day, weary.
Her teacher postponed the test for next week.
Jennifer felt bittersweet about it.
Jennifer will miss her friends after college.
She has a great job lined up after graduation.
That's the end of her college term.
Jennifer felt bittersweet about it.
Jennifer has her final exam tomorrow.

Placements of highlighted sentences differ with document-level information
Happens after Hap pen s afte r Figure 1: Position of two sentences differ based on the dissimilar contextual utterances; the ordering is also inferred using commonsense knowledge in document A.
is especially important while predicting the relative order of sentences that are further apart, as it can provide essential contextual cues. As an example, consider the two highlighted sentences in the two sample documents shown in Figure 1. Although the sentences describe seemingly identical events, they have a different relative order in the two documents because of their different contexts. We recognize this fundamental limitation in existing methods, and we hypothesize that global information is essential for predicting the relative order of a sentence pair. It encompasses not only the semantic information of the discourse, but also commonsense knowledge centered around all the sentences of the document.
In this paper, we propose a graph-based framework to represent sentences in a document and their relations. Using a a two-layer relational graph convolutional network (RGCN) applied on this graph, we build a classifier that is able to learn the relative order of sentences in a document by accounting for the global document information encoded in the graph. We further infuse temporal commonsense knowledge (CSK) information into this graph to improve the model performance. The key motivation is that temporal commonsense knowledge can bring important information about events that may occur before or after an event described in a sentence.
Our paper makes two important contributions. Firstly, we show how we can construct a document graph that captures global context information about the document. We employ a RGCN to encode the information in this graph and build an edge classifier that predicts the relative order of sentence pairs. Unlike previous work attempting to predict the relative order of sentence pairs, our approach explicitly accounts for global documentlevel information. Secondly, we infuse temporal CSK into our graph convolutional neural network to further improve its performance. To the best of our knowledge, there is no prior work that attempted the use of CSK for sentence order prediction. Our results suggest that the graph representation encoding global document information and the temporal CSK are both effective to determine the order of sentences.

Background
Coherence and the problem of sentence order prediction have been extensively studied in literature due to their applicability in various downstream problems. Early work in this direction mainly used domain knowledge and handcrafted linguistic features to model the relation between sentences in a document (Lapata, 2003;Barzilay and Lee, 2004;Barzilay and Lapata, 2008). Sentence ordering methods in recent literature are primarily based on neural network architectures, and can be broadly categorized into two main families -i) Sequence generation methods and ii) Pair-wise methods.
Sequence generation methods use the entire sequence of the randomly ordered document to model local and global information. This information is then used to predict the correct order. The sentences and documents are typically encoded using a recurrent or transformer-based network (Gong et al., 2016;Yin et al., 2020;Kumar et al., 2020). Hierarchical encoding schemes are also common (Wang and Wan, 2019). Prediction is then generally performed with a pointer network decoder (Vinyals et al., 2015) based on beam search. Alter-natively, ranking losses (Xia et al., 2008) have also been explored to circumvent the expensive beam search algorithm (Kumar et al., 2020). Such models predict a score for each sentence, which are then sorted to obtain the final order. The Pair-wise methods are motivated by a different principle of sentence ordering. These models aim to predict the relative order of each pair of sentences in the document. The final order is then constrained on all of the predicted relative orders. The constraint solving problem is generally tackled with topological sorting (Prabhumoye et al., 2020), or more sophisticated neural network models (Zhu et al., 2021).
Our proposed STaCK framework falls under this family of Pair-wise models. In STaCK, temporal commonsense is modelled using the Commonsense Transformers (COMET) model (Hwang et al., 2020). The COMET model uses a BART (Lewis et al., 2020) sequence-to-sequence encoder decoder framework and is pretrained on the ATOMIC 20 20 commonsense knowledge graph (Hwang et al., 2020). The pretraining objective is to take a triplet {s, r, o} from the knowledge graph, and generate the object phrase o from concatenated subject and relation phrase s and r. The set of relations R include temporal relations 'is before' and 'is after'. COMET is pretrained on approximately 50,000 of such temporal triplets along with other commonsense relations from ATOMIC 20 20 . The pretraining on commonsense knowledge ensures that COMET is capable of distinguishing cause-effects (causality), past-future (temporality), and other eventcentered commonsense knowledge.
Recently integer sequence generation methods have achieved impressive performance (Chowdhury et al., 2021). However, there exist some key differences that make the generative approaches fundamentally different and incomparable from the family of pair-wise (sentence pair classification-based) approaches: 1. The generative models for sentence ordering take sequences as input that contains temporal information in the form of learned positional embeddings. One could argue that this temporal information is noisy and thus would not provide any useful information to the model. However, this does not remove all the temporal information from the input that could assist the model, e.g., a shuffled order 5, 1, 3, 4, 2 still contains valid temporal sequence or information i.e., sen-tence 1 precedes sentence 4, and 2, sentence 3 precedes sentence 4. Hence, a sequence generation model that accepts positional encoding of the sentences can still get confounding temporal signal despite the shuffling.
2. The integer generative models for sentence ordering (Chowdhury et al., 2021) tend not to work when the sentence count during inference exceeds the highest sentence count observed in training. For instance, it has been observed that generative models trained on samples with five sentences would only generate tokens 1 − 5 during inference, even if the test document has more sentences. This raises serious questions about such models' reasoning ability in zeroshot situations. One future direction to tackle such issues would be passing the input sequence length as a prompt or input to the generative model. The use of sequence length embedding could also be a possible solution. Contrary to this, pair-wise methods are robust at handling any number of sentences.

Methodology
An overall illustration of the proposed STaCK framework is shown in Fig. 2. We represent document D consisting of n sentences as a directed graph G = (V, E, R), with nodes v i ∈ V and directed labeled edges (v i , r ij , v j ) ∈ E, where r ij ∈ R is the relation type of the edge between v i and v j . Initial node embeddings are denoted as g i .

Graph Construction
The graph is constructed from the given document D as follows: Nodes and Node Embeddings: We consider three different types of nodes in V: Sentence nodes. Each sentence s i in D is a sentence node in the graph. We pass the sentence through a DeBERTa (He et al., 2020) model and use the final layer vector corresponding to starting token <s> as the node embedding. CSK nodes. For each sentence s i , we have a past and future node p i and f i , respectively. The CSK node embeddings are initialized from the BART encoder of the COMET model. Following COMET, we append temporal relation specific tokens isAfter [GEN] and isBefore [GEN] with the sentence s i . The concatenated text sequence is passed through the BART encoder and final layer vector corresponding to <s> is used as the node embedding for p i and f i . Global node. The entire document D is considered as an additional global node g in G. We pass the document through a non-positional (position embeddings removed) RoBERTa model (Liu et al., 2019), and use the final layer vector corresponding to starting token <s> as the global node embedding. This non-positional model is insensitive to the sequence of tokens passed into it. The usage of a non-positional model is essential, as the model must not have any information about the relative order of the sentences. For a document D consisting n sentences, we have 3n + 1 total nodes in G.

Edges and Relations:
We construct edges with different relations based on the constituent nodes: Our formulation leads to bidirectional edges between each sentence pair, i.e. both (s i , r s , s j ) and (s j , r s , s i ) ∈ E. CSK edges. Each CSK node p i and f i , has an edge with the corresponding sentence node: The relation is set different for past and future CSK nodes. The direction of the edge is from the CSK node to the sentence node. Global edges. The global node g has an edge with every sentence node: (g, r g , s i ), for i = 1, 2, .., n.
As indicated, the direction of the edge is taken from the global node to the sentence node.

Graph Encoder: RGCN
We use a two-layer Relational Graph Convolutional Network (RGCN) (Schlichtkrull et al., 2018) as our graph encoding model. The RGCN model is able to accumulate relational evidence in multiple inference steps from the neighborhood around a given node. The RGCN model is a natural choice of encoding algorithm as it enables the modelling of different relations across our graph. In RGCN, the transformation of node embeddings are performed . where N r i indicates the neighbouring nodes of v i under relation r ∈ R; c i,r is a normalization constant which either can be learned in a gradientbased learning setup, or can be set in advance, such that, c i,r = |N r i | and W (1/2) r , W (1/2) 0 are learnable parameters of the transformation. The selfdependent connection with weight W

Topological Ordering
Sentence node CSK node; past, future Global node Ordering edge to ensure direct information among same nodes in consecutive layers in the graph framework. For node v i , we start with initial node embeddings g i ( §3.1), and transform it to h i , following the two layer RGCN transformation process.

Graph Decoder: Pairwise Edge Classifier
The final module in our graph network is built upon the principle of pairwise edge classification. This module predicts the relative order between any two sentences in D by using the initial input embeddings g and output activations h from the RGCN encoder. For example, let us take two sentences s i and s j , where i < j, i.e. s i appears earlier in D, and s j appears later. In this formulation, we will first consider the bidirectional edges between s i and s j in E -(s i , r s , s j ) and (s j , r s , s i ). The classification objective is then to classify the first edge (s i , r s , s j ) as 1 and second edge (s j , r s , s i ) as 0. In other words, if the originating sentence of the directed edge appears earlier than the destination sentence in the original document, then we predict the class of the edge as 1, otherwise 0.
To achieve this, we use a function f , which takes the concatenated feature vectors m i = [g i , h i ] and m j = [g j , h j ] and outputs a single scalar value as a score. We compute f (m i , m j ), and f (m j , m i ) and normalize them with softmax activation to output two probabilities p ij , p ji for the two edges (s i , r s , s j ) and (s j , r s , s i ). The softmax operation ensures that, p ij + p ji = 1. During training, the probabilities are pushed towards 1 and 0 for the paired edges. During inference, for sentences s x and s y , if p xy > p yx then we predict s x appears earlier than s y (s x → s y ), or vice versa (s x ← s y ).
Naturally the function f has to be sensitive to the order of its inputs in this formulation. The more different the outputs scores are, the more the where m i , m j ∈ R d and the sine operation is performed element-wise. w ∈ R d is the learnable parameter of the function. The sine operation is the anti-symmetric component in our function, as, Other functions such as outer product performed worse than the sine function in our experiments.

Topological Sorting
The topological sorting method (Prabhumoye et al., 2020) is used to obtain the final ordered sequence of sentences from the all the pairwise classifications. If the pairwise classifier predicts that p xy > p yx i.e. s x → s y , then the sorting method ensures s x comes before s y in the final orderingô.
4 Experimental Setup

Datasets
We benchmark the proposed STaCK framework on five different datasets. NeurIPS, AAN, NSF Abstracts. These three datasets contain abstracts from NeurIPS papers, AAN papers, and the NSF Research Award abstracts respectively (Logeswaran et al., 2018) . The number of sentences in each   Zhu et al. (2021). For the other datasets, we use the splits following the original papers. Some statistics about the datasets is shown in Table 1.

Evaluation Metrics
Kendall's τ is an automatic metric widely used for evaluating text coherence. It measures the distance between the predicted order and the correct order of sentences in terms of number of inversions. It is calculated as, τ = 1 − 2I/ n 2 , where I is the number of pairs predicted with incorrect relative order and n is the number of sentences in the paragraph. The score ranges from -1 to 1, with 1 indicating perfect prediction order. PMR (Perfect Match Ratio) measures the percentage of instances for which the entire order of the sequence is correctly predicted. It is a more strict metric, as it only gives credit to sequences that are fully correct, and does not give credit to partially correct sequences.

Training Setup
Training is performed by optimizing the binary cross-entropy loss function for pairwise edge classification. We use the AdamW optimizer with a learning rate of 1e-6 for the parameters of the transformer models used in extracting node embeddings. For the parameters of the RGCN encoder and edge classifier, we use the Adam optimizer with a learn-  Table 3: Performance of models with and without commonsense knowledge. We report accuracy of predicting first, last, and absolute (Abs) position of sentences correctly. Longest common subsequence (LCS) ratio and displacement within window 1 (D-Win=1) metric are also reported in percentage.
ing rate of 1e-4. We train our models for 10 epochs with a batch size of 8 documents. Test results are reported corresponding to the best validation τ .

Baselines and State-of-the-art Methods
We compare STaCK against the following methods: LSTM Pointer Net ( the input to the BERT encoder is <CLS> s1 <SEP> s2 <SEP>, and the classification is performed from the <CLS> token vector. Topological sort is then used to obtain the final prediction order from all relative orders. Constraint Graphs (Zhu et al., 2021): Another pairwise model which also applies the BERT-base encoder on concatenated sentence pairs to predict the relative order of the sentences. Constraints from the relative orders are then represented as a constraint graphs and integrated into the sentence representation by using Graph Isomorphism Networks (GINs). All sentence representations from the GINs are fused together to predict the final score of sentences. The ListMLE objective function (Xia et al., 2008) is then used on these scores to predict the final order. Among the above methods, BT-Sort and Constraint Graphs are of our main interest for comparative study. Constraint Graphs model is the current state-of-the-art for sentence order prediction. Table 2 shows the results across all the datasets for the baseline methods and our proposed model. STaCK achieves improved scores over the previous state-of-the-art across almost all the datasets on both evaluation metrics. Interestingly, we observe that the improvement in the τ metric is more significant in NSF and SIND. However, for NeuRIPS, AAN, and ROCStory the improvement in more prominent in the PMR metric. A modification of model which don't use any commonsense knowledge (STaCK: w/o CSK Nodes, Edges) also surpass previous state-of-the-art results in most cases. We expand upon the obtained results and report a number of analysis studies next.

Novelty of the Proposed Graph-based Model with CSK Nodes and Edges
State-of-the-art models BT-Sort and Constraint Graphs use sentence pair concatenation method to perform the relative order prediction between a pair of sentences. This method is widely used in GLUE style classification tasks. As illustrated before, this method doesn't consider any document level information for the relative order prediction. We also compare our proposed graph model without any CSK nodes and edges to the state-of-the-art methods. We report the results for this model in Table 2   scores in all datasets except the PMR metric in NSF. BT-Sort and our proposed model uses the same topological sort method to infer the final order of sentences. The significant improvement of STaCK: w/o CSK Nodes, Edges over BT-Sort can thus be directly attributed to the integration of document level information in our graph. Furthermore, even though Constraint Graphs uses a parametric neural network model to infer the final order of the sentences (compared to non-parametric topological sort of ours), it records an overall poorer performance across most metrics. From the empirical results, we conclude that document level information is indeed crucial for the task of sentence order prediction. In the future, the topological sorting employed in our work can be replaced with a more complex neural network-based sorting approach as used in the Constraint Graphs by (Zhu et al., 2021). A natural question might arise -what if we use a different transformer encoder for the state-of-theart models? We find that this change doesn't improve the results of the state-of-the-art models due to a mismatch in pretraining objective functions and the GLUE style classification setting. Other choices of encoders such as RoBERTa, ALBERT, or DeBERTa perform poorly compared to BERT for both BT-Sort and Constraint Graphs (Zhu et al., 2021). These encoders are not pretrained with the next sentence prediction (NSP) objective used in BERT. The NSP objective is similar to the concatenated sentence pair classification strategy used in BT-Sort and Constraint Graphs, enabling BERT to obtain the best possible performance.

Effect of Commonsense Knowledge
To compare the effect of commonsense knowledge, we propose another model without the CSK components. The CSK nodes and edges are discarded, and the resulting model contains only sentence nodes, global node, sentence edges, and global edges. We call this model STACK w/o CSK Nodes, Edges. Note that this model surpasses previously reported state-of-the-art results in most datasets. To have a better understanding of how CSK helps, we compare this model with STaCK across several metrics in Table 3. We use the following metrics for this evaluation: First, Last, Absolute Accuracy: The accuracy of correctly predicting the first sentence, the last sentence, and the absolute position of any sentence in the document. Longest Common Subsequence (LCS) is the ratio of longest common subsequence between the predicted order and the actual order (Gong et al., 2016). Consecutiveness is not considered necessary. The ratio is measured in percentage, and higher ratios are considered better. Displacement is measured by calculating the % of sentences for which the predicted location is within distance 1 of the original location. The displacement can occur in either direction (left or right). A higher % of this metric indicates less displacement. We denote this metric as Displacement-Window=1 or D-Win=1. We compare the models with and without CSK in Table 3 and conclude the following: i) For both CSK and w/o CSK models, predicting the correct first sentence is relatively straightforward. This is followed by correctly predicting the last sentence, and then the sentences in between. ii) Incorporation of CSK always helps, except for one particular case in NSF. iii) CSK is most helpful in NeuRIPS, followed by AAN. CSK is the least helpful in NSF. iv) Improvement brought by CSK varies in different degrees across the evaluation metrics. In NeuRIPS and AAN, the last sentence prediction accuracy and the absolute accuracy are improved the most after integrating CSK.

Ablation Study
Extending the commonsense specific analysis above ( §5.4), we further perform some ablation study on the CSK specific components of our proposed model. The results are reported in Table 2. For the first ablation setting, we consider the edges with past (p i ) and future (f i ) nodes to have the same relation i.e. r p = r f . The resultant performance is slightly worse in most cases apart from the NSF dataset. The most significant drop is observed in the PMR metric of AAN, where the result is almost 2% poorer. For NSF, this ablation setting results in improved performance, suggesting that the distinction of temporal directionality (past and future) is not essential for this dataset.
The other ablation setting corresponds to the model without the CSK components, which is the same as STaCK w/o CSK Nodes, Edges in §5.4. For this setting, we observe a sharp drop in performance across most of the datasets. The decrease in performance is most significant in the PMR metric of NeuRIPS and AAN. Considerable reduction in performance is also observed across various metrics in SIND and ROCStory. The ablation study with respect to CSK components coupled with the more detailed analysis in Table 3 indicates that temporal CSK is indeed beneficial and helps in the sentence order prediction task with varying degrees across different datasets and metrics.
We experimented with different sentence encoders and found the embeddings created by De-BERTa perform the best, followed by RoBERTa and BERT. ALBERT, on the other hand, perform the worst with around 2% drop in τ and 4% drop in PMR. We also experimented by removing the global node from the STaCK graph, resulting performance drop around 1%-2% across the datasets.

Prediction of First and Last Sentence
The correct prediction of the first sentence and the last sentence is often paid more importance due to their crucial positions in a paragraph (Kumar et al., 2020;Zhu et al., 2021). We compare STaCK against BT-Sort and Constraint Graphs in Table 4 for the task of predicting the first and last sentence correctly. Results are reported for BT-Sort and Constraint Graphs wherever available. First of all, we observe a common trend present in all the three methods -the accuracy of correctly predicting the first sentence is significantly better compared to correctly predicting the last sentence. This is an interesting aspect which has been observed by other previous works as well (Kumar et al., 2020;Yin et al., 2019Yin et al., , 2020. Next, we compare the results across the three methods and find that our proposed model is significantly better than BT-Sort in predicting both the first and last sentences accurately. The difference in performance ranges from 2.8% -7.9% across different datasets. We also obtain improved results over Constraint Graphs in AAN and SIND, with margins between 1.7% -5.9%.    The plot on the left shows the initial sentence embeddings (g i ) for a non-finetuned DeBERTa model.
The plot on the right shows the final node embeddings (h i ) in the trained graph model. Visually it is evident that the initial input embeddings do not carry much order information. However, the updated representations are much more significantly grouped together by their position. Interestingly, sentences corresponding to positions 1 (first) and 5 (last) are the most separable after the UMAP transformation. However, sentences at positions (2-4) did not separate quite so cleanly. The results indicate that sentences appearing at the beginning and the end of a document are much easier to identify than the ones in the middle. Same conclusion can be drawn from the reported results in Table 3.

Manifold of temporal knowledge
We visualize the manifold of temporal commonsense embeddings in Fig. 4. It shows the UMAP transformation of 'past' and 'future' node embeddings for the test sentences in ROCStory. Interestingly, embeddings corresponding to commonsense knowledge of the first sentences are grouped together more cleanly as compared to the other sentences. This pattern further substantiates the hypothesis, drawn from Table 3, that the first sentences are the easiest to identify. In contrast, the embeddings corresponding to the other sentences are noisier and cannot be distinguished clearly.

Order Prediction in Longer Documents
We report order prediction results for documents having more than ten sentences in Table 5. The Constraint Graphs paper does not report this result, and thus we compare STaCK with the BT-Sort method. We report results only in NeuRIPS, AAN, and NSF, as SIND and ROCStory have exactly five sentences in all documents. From the results, we conclude that STaCK is significantly better than BT-Sort for long documents in NeuRIPS and AAN. The perfect match ratio, and absolute accuracy are several percentage point higher for STaCK compared to BT-Sort. For NSF, both the models perform very poorly in the PMR metric, with scores lesser than 1%. However, BT-Sort has superior performance compared to STaCK across the other metrics. Note that the overall result of BT-Sort was worse compared to STaCK (Table 2). Results from Table 5 and Table 2 suggest that, BT-Sort is better for longer documents and STaCK is better for shorter documents in NSF.

Case Studies
We report several case studies in Table 6. The gold standard order of the three documents from the ROCStory and SIND datasets are shown on the left. The columns on the right depict the order pre-dicted by our framework with and without CSK, and the Constraint Graphs (CG) model. STaCK predicts the sentence order most accurately, whereas STaCK w/o CSK often swaps absolute positions or shifts consecutive sentences. CG predicts the first sentence correctly in all cases, but suffers from predicting contextual discrepancies. For instance, He couldn't wait to cook in it! is predicted after The first time, he turned it on and smoke billowed out. In the third example, temporal commonsense around the event I called 911 to report the accident is aligned to the event The police soon arrived through the relation isBefore in COMET. Such commonsense knowledge helps in predicting the entire order correctly. We note that CG predictions for this example are displaced within window 1, with I called 911 to report the accident and The police soon arrived having wrong relative order. Such instances of the importance of document information and CSK are prevalent throughout the dataset.

Effect of COMET
We experimented by adding other temporal and causal commonsense relations in COMET such as causes, as-a-result, desires, requires as nodes to the proposed graph in STaCK. However, they did not result in any significant performance improvements. We posit this could be due to the fact that there exists a large overlap between the generated output of isBefore, isAfter and the four relations as mentioned above. Nonetheless, we think all types of CSK relations available in COMET can be used in the task. The graph structure to accommodate those additional CSK is left as future work.

Choice of Transformer Encoder
The choice of the transformer encoder plays a crucial role in the sentence order prediction task. Several choices of transformer-based models are available, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2019), DeBERTa (He et al., 2020), etc. Different objective functions are used to pre-train these models, which directly affects how these models perform on the downstream sentence ordering task. In particular, BERT is pre-trained with the masked language modelling (MLM) and the next sentence prediction (NSP) objective. RoBERTa and De-BERTa models are pre-trained only with the MLM objective. ALBERT model is pre-trained with the masked language modelling (MLM) and a sentence order prediction (SOP) objective.
BT-Sort and Constraint Graphs are the current state-of-the-art models which use a BERT encoder. Both models are of pair-wise nature ( §2), that first predict the relative order between each pair of sentences in the document. The final order is then inferred from all the relative orders. The relative order prediction is performed by concatenating the sentence pairs with a <SEP> token and then passing through BERT encoder. This setting directly aligns with the NSP objective of BERT, and is capable of achieving state-of-the-art results. However, as reported in (Zhu et al., 2021), replacing the BERT encoder with a RoBERTa encoder results in poorer performance because of the absence of the NSP objective. Interestingly, using ALBERT encoder also results in a performance drop, even though ALBERT was pre-trained with a sentence order prediction objective (albeit rather differently). Furthermore, we also experimented by replacing the the BERT encoder of BT-Sort with DeBERTa and found that the performance does not surpass the reported results of BERT. Our proposed model is also a pair-wise model. However, the graphbased encoding technique is different from the commonly used sentence pair concatenating method. We found that for our graph model, sentence embeddings created by DeBERTa perform the best, followed by RoBERTa and BERT. Sentence embeddings produced by ALBERT perform the worst.

Conclusion
In this work, we presented STaCK, a framework that uses Relational Graph Convolutional Network to model document-level contextual information and temporal commonsense knowledge for sentence order prediction. In the graph network, the edge classification objective was applied for pairwise relative order prediction of the sentence pairs. This was followed by a topological sorting for the final order prediction of the sentences. STaCK achieves state-of-the-art results in several benchmark datasets.