Improving Long Document Topic Segmentation Models With Enhanced Coherence Modeling

Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve $F_1$ of old SOTA by 3.42 (73.74 ->77.16) and reduces $P_k$ by 1.11 points (15.0 ->13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on $P_k$ on WikiSection. The average relative $P_k$ drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.


Introduction
Topic segmentation aims to automatically segment the text into non-overlapping topically coherent parts (Hearst, 1994).Topic segmentation makes documents easier to read and understand, and also plays a key role in many downstream tasks such as information extraction (Prince and Labadié, 2007;Shtekh et al., 2018) and document summarization (Xiao and Carenini, 2019;Liu et al., 2022).Topic segmentation methods can be categorized into linear segmentation (Hearst, 1997), which yields a linear sequence of topic segments, and hierarchical segmentation (Bayomi and Lawless, 2018;Hazem et al., 2020), which produces a hierarchical structure with top-level segments divided into subsegments.We focus on linear topic segmentation in this work, especially for long documents.
Based on the definition of topics, each sentence in a topic relates to the central idea of the topic, and topics should be discriminative.Hence, two adjacent sentences from the same topic are more similar than those from different topics.Exploring this idea, prior unsupervised models mainly infer topic boundaries through computing text similarity (Riedl and Biemann, 2012b;Glavaš et al., 2016) or exploring topic representation of text (Misra et al., 2009;Du et al., 2013).Different from the shallow features carefully designed and used by unsupervised methods, supervised neural models can model deeper semantic information and explore clues of topic shift from labeled data (Badjatiya et al., 2018;Koshorek et al., 2018).Supervised models have achieved large gains on topic segmentation through pre-training language models (PLMs) (e.g., BERT) and fine-tuning on large-scale supervised datasets (Kenton and Toutanova, 2019;Lukasik et al., 2020;Zhang et al., 2021;Inan et al., 2022).Recently, (Arnold et al., 2019;Xing et al., 2020;Somasundaran et al., 2020;Lo et al., 2021) improve topic segmentation performance by explicitly modeling text coherence.However, these approaches either neglect context modeling beyond adjacent sentences (Wang et al., 2017), or require additional label information (Arnold et al., 2019;Barrow et al., 2020;Lo et al., 2021;Inan et al., 2022), or impede learning sentence-pair coherence without considering both coherent and incoherent pairs (Xing et al., 2020).Moreover, compared to short documents, topic segmentation becomes more critical for understanding long documents, and coherence modeling for long document topic segmentation is more crucial.
Coherence plays a key role in understanding both logical structures and text semantics.Consequently, to enhance coherence modeling in supervised topic segmentation methods, we propose two auxiliary coherence-related tasks, namely, Topicaware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL).We create disordered incoherent documents, then the TSSP task utilizes these documents and enhances learning sentence-pair structure information.The CSSL task regulates sentence representations and ensures sentences in the same topic have higher semantic similarity while sentences in different topics are less similar.Experimental results demonstrate that both TSSP and CSSL improve topic segmentation performance and their combination achieves further gains.Moreover, performance gains on out-of-domain data from the proposed approaches demonstrate that they also significantly improve generalizability of the model.
Large Language Models such as ChatGPT2 have achieved impressive performance on a wide variety of NLP tasks.We adopt the prompts proposed by Fan and Jiang (2023) and evaluate ChatGPT on the WIKI-50 dataset (Koshorek et al., 2018).We find ChatGPT performs considerably worse than fine-tuning BERT-sized PLMs on long document topic segmentation (as shown in Appendix A).
Our contributions can be summarized as follows.
• We investigate supervised topic segmentation on long documents and confirm the necessity of exploiting longer context information.• We propose two novel auxiliary tasks TSSP and CSSL for coherence modeling from the perspectives of both logical structure and semantic similarity, thereby improving the performance of topic segmentation.• Our proposed approaches set new state-of-theart (SOTA) performance on topic segmentation benchmarks, including long documents.Ablation study shows that both new tasks effectively improve topic segmentation performance and they also improve generalizability of the model.

Topic Segmentation Models
Both unsupervised and supervised approaches have been proposed before to solve topic segmentation.Unsupervised methods typically design features based on the assumption that segments in the same topic are more coherent than those that belong to different topics, such as lexical cohesion (Hearst, 1997;Choi, 2000;Riedl and Biemann, 2012b), topic models (Misra et al., 2009;Riedl and Biemann, 2012a;Jameel and Lam, 2013;Du et al., 2013) and semantic embedding (Glavaš et al., 2016;Solbiati et al., 2021;Xing and Carenini, 2021).In contrast, supervised models can achieve more precise predictions by automatically mining clues of topic shift from large amounts of labeled data, either by classification on the pairs of sentences or chunks (Wang et al., 2017;Lukasik et al., 2020) or sequence labeling on the whole input sequence (Koshorek et al., 2018;Badjatiya et al., 2018;Xing et al., 2020;Zhang et al., 2021).However, the memory consumption and efficiency of neural models such as BERT (Kenton and Toutanova, 2019) can be limiting factors for modeling long documents as their length increases.Some approaches (Arnold et al., 2019;Lukasik et al., 2020;Lo et al., 2021;Somasundaran et al., 2020) use hierarchical modeling from tokens to sentences, while others (Somasundaran et al., 2020;Zhang et al., 2021) use sliding windows to reduce resource consumption.However, both directions of methods may not be adequate for capturing the full context of long documents, which is critical for accurate topic segmentation.

Coherence Modeling
The NLP community has developed models for comprehending text coherence and tasks to measure their effectiveness, such as predicting the coherence score of documents (Barzilay and Lapata, 2008), predicting the position where the removed sentence was originally located (Elsner and Charniak, 2011) and restoring out-of-order sentences (Logeswaran et al., 2018;Chowdhury et al., 2021).Some researchers have aimed to improve topic segmentation models by explicitly modeling text coherence.However, all of prior works consider coherence modeling for topic segmenta-tion only from a single perspective.For example, Wang et al. (2017) ranked sentence pairs based on their semantic coherence to segment documents within the Learning-to-Rank framework, but they did not consider contextual information beyond two sentences.CATS (Somasundaran et al., 2020) created corrupted text by randomly shuffling or replacing sentences to force the model to produce a higher coherence score for the correct document than for its corrupt counterpart.However the fluency of the constructed document is too low so that the semantic information is basically lost.Xing et al. (2020) proposed to add the Consecutive Sentence-pair Coherence (CSC) task by computing the cosine similarity as coherence score.But no more incoherent sentence pairs are considered in CSC, except for those located at segment boundaries.Other methods (Arnold et al., 2019;Barrow et al., 2020;Lo et al., 2021;Inan et al., 2022) have used topic labels to constrain sentence representations within the same topic, but they require additional topic label information.In contrast to these works, our work is the first to consider topical coherence as both text semantic similarity and logical structure (flow) of sentences.

Methodology
In this section, we first describe our baseline model for topic segmentation (Section 3.1), then introduce our proposed Topic-aware Sentence Structure Prediction (TSSP) module (Section 3.2) and Contrastive Semantic Similarity Learning (CSSL) module (Section 3.3).Figure 1 illustrates the overall architecture of our topic segmentation model.

Baseline Model for Topic Segmentation
Our supervised baseline model formulates topic segmentation as a sentence-level sequence labeling task (Zhang et al., 2021)).Given a document represented as a sequence of sentences [s 1 , s 2 , s 3 , ..., s n ] (where n is the number of sentences), the model predicts binary labels [y 1 , y 2 , ..., y n−1 ] corresponding to each sentence except for the last sentence, where y i = 1, i ∈ {1, • • • , n − 1} means s i is the last sentence of a topic and 0 means not.Following prior works (Somasundaran et al., 2020;Zhang et al., 2021), we prepend a special token BOS before each sentence and the updated sentence is shown in Eq. 1, where t i,1 is BOS and |s i | is the number of tokens in s i .The token sequence for the document is embedded through the embedding layer and then fed into the encoder to obtain its contextual representations.We take the representation of each BOS h i as the sentence representation, as shown in Eq. 4. Then we apply a softmax binary classifier g as in Eq. 3 on top of h i to compute the topic segmentation probability p of each sentence.We use the standard binary cross-entropy loss function as in Eq. 2 to train the model.same document but 50% of the time their order is reversed.The ternary Sentence Structural Objective (SSO) in StructBERT (Wang et al., 2019) further increases the task difficulty of BSO by adding a class of sentence pairs from different documents.All these tasks for learning inter-sentence coherence have not explored topic structures.Different from them, we propose a Topic-aware Sentence Structure Prediction (TSSP) task to help the model learn sentence representations with structural information that explore topic structures and hence are more suitable for topic segmentation.

Data Augmentation
We tailor data augmentation techniques for topic segmentation.As depicted in the right half of Figure 1, we create an augmented document d ′ from the original document d and feed d ′ into the shared encoder after the embedding layer to enhance inter-sentence coherence modeling.Different from the auxiliary coherence modeling approach proposed by Somasundaran et al. (2020) which merely forces the model to predict a lower coherence for the corrupted document than for the original document, we simultaneously perturb d at both topic and sentence levels, constructing the augmented document to force the model to learn topic-aware inter-sentence structure information.Hence, our task is more challenging and the learned sentence representations are more suitable for topic segmentation.Figure 2 illustrates the process of constructing an augmented document.We first shuffle topics within the doc-ument, then randomly replace some topics with topics from other documents to increase diversity.Specifically, for a randomly selected subset of p 1 percent of the documents, we replace each topic in them with a topic snippet from other documents with a probability of p 2 and keep the same with the probability of 1 − p 2 .The default values of p 1 and p 2 are both 0.5.Finally, we shuffle the sentences in each topic to further increase the difficulty of the TSSP task.
Sentence-pair Relations After constructing the augmented document d ′ , we define the TSSP task as an auxiliary objective to assist the model to capture inter-sentence coherence, by learning the original structural relations between adjacent sentence pair a and b in the augmented incoherent document d ′ .We define three types of sentence-pair relations.The first type (label 0) is when a and b belong to different topics, indicating a topic shift.The second type (label 1) is when a and b are in the same topic and b is the next sentence of a.The third type (label 2) is when a and b belong to the same topic but b is not the next sentence of a.For example, the sequence of sentences in Figure 2 will be assigned with the TSSP labels as [1, 0, 2, 2, 0, 1, 2, 2].We use ỹ = [ ỹ0 , ỹ1 , ỹ2 ] to represent the one-hot encoding of the TSSP labels, where ỹj = 1, j ∈ {0, 1, 2} if the sentence pair belongs to the j-th category; otherwise, ỹj = 0.For the TSSP task, we use the cross entropy loss function defined in Eq. 5, where ỹ i,j denotes the label of the i-th sentence pair, and p i,j denotes the probability of the i-th sentence pair belonging to the j-th category.

Contrastive Semantic Similarity Learning
We assume that two sentences or segments from the same topic are inherently more coherent than those from different topics.2023) proposes a contrastive learning method for unsupervised topic segmentation.However, in their unsupervised method, similar and dissimilar sample pairs could be noisy due to lack of ground truth topic labels, which would not occur in our supervised settings.
Loss Function As illustrated in Figure 1, we utilize the following loss function to train our model and learn contrastive semantic representations of inter-topic and intra-topic sentences.k 1 and k 2 are hyperparameters that determine the number of sentences used to form positive and negative pairs, respectively.For each sentence representation h i , h + i,j denotes the j-th similar sentence in the same topic as sentence i, while h − i,j denotes the j-th dissimilar sentence in a different topic from sentence i.We select sentences to form sentence pairs based on their distances to the anchor sentence, from closest to farthest.The objective of our loss function is to bring semantically similar neighbors closer and push away negative sentence pairs, as in Eq. 6. τ in Eq. 8 is a temperature hyper-parameter to scale the cosine similarity of two vectors, with a default value 0.1.In future work, in order to avoid pushing away sentence pairs in different topics but covering similar topical semantics, we plan to consider refining the loss, such as assigning loss weights based on their semantic similarity.
Combining Eq. 2, 5 and 6, we form the final loss function of our topic segmentation model as Eq. 9, where α 1 and α 2 are hyper-parameters used to adjust the loss weights.(Koshorek et al., 2018) and English WikiSection (Arnold et al., 2019), which are widely used as benchmarks to evaluate the text segmentation performance of models.WIKI-727K is a large corpus with segmentation annotations, created by leveraging the manual structures of about 727K Wikipedia pages and automatically organizing them into sections.WikiSection consists of 38K English and German Wikipedia articles from the domains of disease and city, with the topic labeled for each section of the text.We use en_city and en_disease throughout the paper to denote the English subsets of disease and city domains, and use WikiSection to represent the collection of these two subsets.Each section is divided into sentences using the PUNKT tokenizer of the NLTK library3 .Additionally, we utilize the newline information in WikiSection to only predict whether sentences with line breaks are  et al., 1999), and WindowDiff (WD) (Pevzner and Hearst, 2002) 4 .To simplify notations, we use F 1 and WD throughout the paper to denote posi- 4 We use https://segeval.readthedocs.io/to compute P k and WD tive F 1 and WindowDiff.F 1 is calculated based on precision and recall of correctly predicted topic segmentation boundaries.The P k metric is introduced to address some limitations of positive F 1 , such as the inherent trade-off between precision and recall as well as its insensitivity to near-misses.WD is proposed by Pevzner and Hearst (2002) as a supplement to P k to avoid being sensitive to variations in segment size distribution and over-penalizing nearmisses.By default, the window size for both P k and WD is equal to half the average length of actual segments.Lower P k and WD scores indicate better algorithm performance.Baseline Models Although Transformer (Vaswani et al., 2017) has become the SOTA architecture for sequence modeling on a wide variety of NLP tasks and transformer-based PLMs such as BERT (Devlin et al., 2019) become dominant in NLP, the core self-attention mechanism has quadratic time and memory complexity to the input sequence length (Vaswani et al., 2017), limiting the max sequence length during pre-training (e.g., 512 for BERT) for a balance between performance and memory usage.As shown in Table 1, the avg.num-ber of tokens per document of each dataset exceeds 512 and hence these datasets contain long documents.We tailor the backbone model selection for long documents.Prior models using BERTlike PLMs for topic segmentation either truncate long documents into the max sequence length or use a sliding window.These approaches may degrade performance due to losing contextual information.Consequently, we first evaluate BERT-Base (Devlin et al., 2019) and several competitive efficient transformers on the WikiSection dataset, including BigBird-Base (Zaheer et al., 2020) and Longformer-Base (Beltagy et al., 2020).As shown in Table 2, Longformer-Base achieves 82.19 and 72.29 F 1 , greatly outperforming BERT-Base (78.99 and 67.34 F 1 ) by (+3.2, +4.95) F 1 and BigBird-Base (80.49 and 70.61 F 1 ).Hence we select Longformer-Base as the encoder for the main experiments.To compare with our coherence-related auxiliary tasks, we evaluate Longformer-Base on WikiSection with the prior auxiliary CATS or CSC task in Section 2.2.In addition, following Inan et al. (2022), we evaluate the pre-trained settings where we first pre-train Longformer on WIKI-727K and then fine-tune on WikiSection.Under the domain transfer setting, we cite the results in (Xing et al., 2020).Note that all the baselines we include for comparisons are well-established and exhibit top performances on these benchmarks.Implementation Details To investigate the efficacy of longer context for topic segmentation, we conducted additional evaluations on WikiSection using maximum sequence lengths of 512, 1024, and 4096, alongside the default 2048.For documents longer than the max sequence length, we use a sliding window to take the last sentence of the prior sample as the start sentence of the next sample.We run the baseline Longformer-Base and w/ our model (i.e., Longformer-Base+TSSP+CSSL) three times with different random seeds and report means and standard deviations of the metrics.Details of hyperparameters are in Appendix B.

Main Results
Intra-domain Performance Table 2 and Table 3 show the performance of baselines and w/ our approaches on WikiSection and WIKI-727K test sets, respectively.The results of Longformer-Base and LongformerSim in Table 2 show that using cosine similarity alone is insufficient to predict the topic segmentation boundary.Longformer- (Koshorek et al., 2018) -22.13 -Cross-segment BERT (Lukasik et al., 2020) 66.0 --Hier.BERT (Lukasik et al., 2020) 66.5 --CATS (Somasundaran et al., 2020   Longformer-Base does not perform best on WIKI-50, incorporating our method achieves +5.84 F 1 gain and 3.06 point gain on P k , setting new SOTA on WIKI-50 and Elements for both unsupervised and supervised methods.Overall, the results demonstrate that our proposed method not only greatly improves the performance of a model under the intra-domain setting, but also remarkably improves the generalizability of the model.Inference Speed Our proposed TSSP and CSSL do not bring any additional computational cost to inference and do not change the inference speed.We randomly sample 1K documents from WIKI-727K test set and measure the inference speed on a single Tesla V100 GPU with batch_size = 1.On average, BERT-Base with max sequence length 512 processes 19.5K tokens/sec while Longformer-Base with max sequence length 2048 processes 15.9K tokens/sec.This observation is consistent with the findings in (Tay et al., 2020) that Longformer does not show a speed advantage over BERT until the input length exceeds 3K tokens.

Analysis
Effect of Context Size To study the effect of the context size, we evaluate Longformer with max sequence length of 512, 1024, 2048 and 4096 on WikiSection.We evaluate the effectiveness of our proposed methods using the corresponding max sequence length to investigate whether they remain effective with different context sizes.As can be seen from Figure 3, the topic segmentation F 1 gradually improves as the context length increases.Among them, the effect of increasing the input length from 512 to 1024 is the largest with F 1 on en_city and en_disease improved by +2.54 and   +4.41 respectively.Considering that the average document length of WikiSection is 1321, we infer that capturing more context information is more beneficial to topic segmentation on long documents.We also observe that compared to Longformer-Base, Longformer+TSSP+CSSL yields consistent improvements across different input lengths on both en_city and en_disease test sets.These results suggest that our methods are effective at enhancing topic segmentation across various context sizes and can be applied to a wide range of data sets.Ablation Study of TSSP and CSSL Figure 4 shows ablation study of TSSP and CSSL on the WikiSection dev set, respectively.Figure 4(a) demonstrates effectiveness of the three classification tasks in the TSSP task (Section 2.1).Compared to SSO and CATS, TSSP helps the model learn better intersentence relations and both intra-and inter-topic  structure labels are needed to improve performance.Figure 4(b) illustrates the impact of varying numbers of negative sample pairs for CSSL on Longformer.We find that adding similarity-related auxiliary tasks improves the performance.Compared with CSC, CSSL focuses on sentences of the same and different topics when learning sentence representations.As the number of negative samples increases, the model performance improves and optimizes at k 2 =3.Gain from TSSP is slightly larger than that from CSSL, indicating that comprehending structural information contributes more to coherence modeling.We speculate that encoding the entire topic segment into a semantic space to learn contrastive representation may help detecting topic boundaries, which we plan to explore in future work.

Similarity of Sentence-pair Representations
To investigate impact of coherence-related auxiliary tasks on sentence representation learning, we calculate cosine similarity of adjacent sentence representations for predicting topic boundaries.We compute F 1 of baselines and our Longformer (i.e., Longformer-Base+TSSP+CSSL) on en_city and en_disease dev sets.As shown in Figure 5, compared to Longformer-Base, our model achieves higher F 1 , indicating that sentence representations learned with our methods are more relevant to topic segmentation and are better at distinguishing sentences from different topics.We also explore combining probability and similarity to predict topic boundaries in Appendix D but find no further gain.The results suggest that the model trained with TSSP+CSSL covers more features than similarity.

Conclusion
Comprehending text coherence is crucial for topic segmentation, especially on long documents.We propose the Topic-aware Sentence Structure Prediction and Contrastive Semantic Similarity Learning auxiliary tasks to enhance coherence modeling.Experimental results show that Longformer trained with our methods significantly outperforms SOTAs on two English long document benchmarks.Our methods also significantly improve generalizability of the model.Future work includes extending our approach to spoken document topic segmentation and other segmentation tasks at various levels of granularity and with modalities beyond text.

Limitations
Although our approach has achieved SOTA results on long document topic segmentation, further research is required on how to more efficiently model even longer context.In addition, our method needs to construct augmented data for the TSSP task, which will take twice the training time.

A ChatGPT for Topic Segmentation in Long Document
To investigate the performance of ChatGPT on long document topic segmentation, we adopt the prompts proposed by Fan and Jiang (2023) and evaluate ChatGPT on the WIKI-50 test set.The prompts are shown in Table 5.We set temperature as 0 to ensure the consistency of ChatGPT.The post-processing strategy for ChatGPT remains unchanged to obtain the formatted output.Different from Fan and Jiang (2023), we change the key word dialogue to document in the prompt and try to use one-shot prompt to see if it can improve performance.
Table 6 shows the results of ChatGPT with different prompts and Longformer with supervised data.Firstly, the results show that the generative prompt can achieve higher performance than the discriminative prompt, which is consistent with the conclusion of Fan and Jiang (2023) that representing structure directly can better leverage the generation ability of ChatGPT.Additionally, incorporating a single example in the generative prompt can further improve 2-point F 1 , which indicates one-shot prompt can better stimulate the in-context learning ability of Large Language Models.However, while ChatGPT-GP one achieves an F 1 metric that is 4.6 points higher than Longformer ood , its' P k and WD metric are significantly worse due to the high false recall rate of topic boundaries.This suggests that how to fully apply the ability of Large Language Models to the topic segmentation in long documents remains to be further explored.Finally, compared with ChatGPT, the significant improvement of Longformer iid shows the key role supervised data plays in the topic segmentation task.

B Training Details
Our experiments are implemented with transformers package5 .The model parameters are initialized with corresponding pre-trained parameters.The initial learning rate is 5e− 5 and the dropout probability is 0.1.AdamW (Loshchilov and Hutter, 2017) is used for optimization.The batch size for WIKI-727K and WikiSection is 4 and 8, and the epoch for WIKI-727K and WikiSection is 3 and 5 respectively.We set the number of positive pair in CSSL as k 1 = 1 and carry out grid-search of loss weight α 1 , α 2 ∈ [0.5, 1.0], k 2 ∈ [1, 2, 3, 4] on dev set.The final configuration in the two benchmarks is k 2 = 3, α 1 = 0.5.α 2 performs best on WikiSection when set to 1.0 while on WIKI-727K α 2 = 0.5 performs best.

Type
Prompts for Document Topic Segmentation

Discriminative
The following is a document.Give each utterance a binary label, where 1 indicates that the utterance starts a new topic.please output the result of the sequence annotation as a python list.0: s 1 1: s 2 ... n − 1: s n

Generative
Please identify several topic boundaries for the following document and each topic consists of several consecutive utterances.please output in the form of {topic i:[], ... ,topic j:[]}, where the elements in the list are the index of the consecutive utterances within the topic, and output even if there is only one topic.0: s 1 1: s 2 ... n − 1: s n  (+TSSP+CSSL) are shown in Table 7.Our approach significantly improves the baseline on both short and long documents.Notably, the gains from our approach are larger on long documents, suggesting that our coherence modeling benefits long document topic segmentation even more.This is consistent with our hypothesis.

D Ensemble Probability and Similarity
As shown in Formula 10, we try to combine the cosine similarity of neighbor sentence representations (Sim) and model probability (Prob) to get the final score to infer topic boundary.Specifically, we get

Figure 2 :
Figure 2: The process of constructing the augmented document (the bottom line) from the original document (the top line).s i denotes i-th sentence of the document.The sentences with the same colors are in the same topic.Sentences in light purple are topics from another document.

Figure 4 :
Figure 4: Ablation study of Longformer with our method on the WikiSection dev set with (a) TSSP and (b) CSSL separately.CATS and SSO in Figure (a) are previously proposed auxiliary tasks where CATS denotes Coherence-Aware Text Segmentation (Section 2.2) and SSO denotes Sentence Structural Objective (Section 3.2).In Figure (a), TSSP w/o inter-topic means without category 0 and w/o intra-topic means category 1 and 2 in Section 3.2.In Figure (b), 0 negative pairs represents fine-tuning with just the CSC task (Section 2.2).

Figure 5 :
Figure 5: F 1 from using cosine similarity of two adjacent sentence representations to predict topic boundaries, with different thresholds on WikiSection dev set.
1 , t i,2 , ..., t i,|s i |+1 ] (1) The overall architecture of our model.s i is i-th sentence in document d. d ′ is the augmented data we construct corresponding to document d (Section 3.2).TS denotes Topic Segmentation, TSSP denotes Topicaware Sentence Structure Prediction (Section 3.2) and CSSL denotes Contrastive Semantic Similarity Learning (Section 3.3).L ts , L tssp and L cssl denote the losses we describe in Section 3.
each sentence in the document as the anchor sentence, we choose k 1 sentences in the same topic to constitute positive pairs and k 2 sentences from different topics as negative pairs based on the ordering of the distance of a sentence from the anchor sentence, starting from the nearest to the farthest.Recently,Gao et al. (

Table 1 :
Statistics of the Intra-domain and Out-ofdomain datasets.#X denotes the average number of X per document.

Table 2 :
Performance of baselines and w/ our methods on en_city and en_disease test sets of WikiSection.LongformerSim denotes Longformer-Base that uses cosine similarity of neighbor sentences as the predictor.Pretrained Longformer-Base denotes further pre-training Longformer-Base with WIKI-727K training set and then fine-tuning with WikiSection training set.† denotes training Longformer-Base with the corresponding auxiliary task described in Section 2.2.Max sequence length for BigBird-Base and Longformer-Base is 2048.x and y in x y denote mean and standard deviation from three runs with different random seeds.* indicates the gains from +TSSP+CSSL over Longformer-Base are statistically significant with p < 0.05.

Table 3 :
Performance of baselines and w/ our methods on the WIKI-727K test set.† represents our reproduced results.x and y in x y denote mean and standard deviation from three runs with different random seeds.* indicates the gains from +TSSP+CSSL over Longformer-Base are statistically significant with p < 0.05.
on the out-of-domain WIKI-50 and Elements test sets.Longformer-Base already achieves 5.51 point reduction on P k on Elements over the prior best performance from supervised models, and our approach further improves P k by 2.83 points.While

Table 4 :
Performance of baselines and w/ our methods under domain transfer setting.BERT and Longformer are Base size.The training set of WikiSection is used for fine-tuning.† denotes fine-tuning Longformer with the corresponding auxiliary task described in Section 2.2.x and y in x y denote mean and standard deviation from three runs with different random seeds.* indicates the gains from +TSSP+CSSL over Longformer-Base are statistically significant with p < 0.05.

Table 6 :
Comparison of ChatGPT and Longformer on the WIKI-50 test set.Longformer ood denotes finetuning Longformer on WikiSection and Longformer iid denotes fine-tuning Longformer on WIKI-727K.DP and GP are short for discriminative prompt and generative prompt, respectively.zeroandone denote zero-shot and one-shot prompting settings.C Performance of the Proposed Approach on Short and Long DocumentsIt is important to note that our proposed TSSP and CSSL approaches are agnostic to document lengths and are applicable to models and datasets for short documents.In order to evaluate the performance of our proposed approach on various document lengths, we partition the WIKI-727K test set into a short document subset (18310 samples) and a long document subset (54922 samples) according to whether the number of tokens in a document is less than 512 or not.The results from the baseline Longformer and Longformer with our approaches Base 83.360.0911.580.1011.780.1175.200.0615.340.0216.760.03+TSSP+CSSL(ours) 83.97 *

Table 7 :
The performance of Longformer-Base and our approach on short and long document subsets of WIKI-727K test set.* indicates the gains from +TSSP+CSSL over Longformer-Base are statistically significant with p < 0.05.Pk ↓ WD ↓ F1 ↑ Pk ↓ WD ↓

Table 8 :
The results of combing the probability and cosine similarity to predict topic boundary on en_city and en_disease.Prob Only denotes only using probability.Prob and Sim denotes compute score as Formula 10.