Discourse-Based Sentence Splitting

Sentence splitting involves the segmentation of a sentence into two or more shorter sentences. It is a key component of sentence simpliﬁcation, has been shown to help human comprehension and is a useful preprocessing step for NLP tasks such as summarisation and relation extraction. While several methods and datasets have been proposed for developing sentence splitting models, little attention has been paid to how sentence splitting interacts with discourse structure. In this work, we focus on cases where the input text contains a discourse connective, which we refer to as discourse-based sentence splitting. We create synthetic and organic datasets for discourse-based splitting and explore different ways of combining these datasets using different model architectures. We show that pipeline models which use discourse structure to mediate sentence splitting outperform end-to-end models in learning the various ways of expressing a discourse relation but generate text that is less grammatical; that large scale synthetic data provides a better basis for learning than smaller scale organic data; and that training on discourse-focused, rather than on general sentence splitting data provides a better basis for discourse splitting.


Introduction
Sentence splitting segments a sentence into two or more shorter sentences.It is a key component of sentence simplification.It has also been shown to help human comprehension (Mason, 1978;Williams et al., 2003) and to be a useful preprocessing step for several NLP tasks, such as relation extraction (Niklaus et al., 2016) and machine translation (Chandrasekar et al., 1996;Mishra et al., 2014;Li and Nenkova, 2015;Mishra et al., 2014).
So far however, little attention has been paid to how discourse splitting interacts with discourse structure.As illustrated in Table 1, two main types of splitting can be distinguished depending on whether the split is licensed by a syntactic construct or by a discourse connective.Whereas syntax-based splitting is licensed by syntactic constructs such as relative clauses, VP or sentence coordinations, gerund or appositive constructions, discourse-based splitting is licensed by the presence of a discourse relation between two discourse units.
Importantly, in the case of discourse-based splitting, the discourse relation which holds in the input must be preserved in the split output.This is illustrated in Table 1 where the temporal relation marked by and after this in the input (C1) is made explicit in the split output (S1) by the adverbial Afterwards.In contrast, omitting this adverbial (S3) results in a semantic loss and makes the output more difficult to understand.As shown by the (S2) variant, a split can also use a discourse adverbial with an inverse meaning (Before this) which induces a corresponding inversion in the linear order of the text.
In this paper, we focus on discourse-based sentence splitting and make the following contributions: 1. We create synthetic and organic training data for discourse splitting and investigate various ways of leveraging this data for training discourse-based sentence splitting models.
2. We compare a discourse-agnostic, end-to-end approach with a pipeline model that uses dis-

C1.
The Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell and after this Mindaugas crossed the Vistula river and captured the fortress of Jazdów.S1.The Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.Afterwards, Mindaugas crossed the Vistula river and captured the fortress of Jazdów.S2.Mindaugas crossed the Vistula river and captured the fortress of Jazdów.Before this, the Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.

S3.
Mindaugas crossed the Vistula river and captured the fortress of Jazdów.The Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.T <DR> TEMPORAL:ASYNCHRONOUS <ARG1> The Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell <ARG2> Mindaugas crossed the Vistula river and captured the fortress of Jazdów <EOS>

C2.
He settled in London, devoting himself chiefly to practical teaching.

S4.
He settled in London.He devoted himself chiefly to practical teaching.

C3.
It was a time to go back to nature, and the plastic flamingo quickly became the prototype of bad taste and anti-nature.

S5.
It was a time to go back to nature.The plastic flamingo quickly became the prototype of bad taste and anti-nature.3. We show that training on discourse-focused rather than general sentence splitting data helps to improve performance.

4.
To help spur research on discourse-based sentence splitting, we make our dataset and code publicly available.1

Related Work
Together with deletion, reordering and substitution, sentence splitting is one of the main operations used in text simplification.
Early work on simplification used a rule based approach to splitting (Siddharthan, 2006;Siddharthan and Mandya, 2014).For instance, (Siddharthan, 2006) defines 26 handcrafted rules for simplifying apposition and/or relative clauses in dependency structures and 85 rules to handle subordination and coordination.
Further work focused on learning statistical simplification models from parallel datasets of complex-simplified sentences derived from English Wikipedia and Simple English Wikipedia.(Zhu et al., 2010) introduces a syntax-based machine translation model where splitting probabilities are learned from syntactic structure.(Woodsend and Lapata, 2011) induced a grammar from the parallel Wikipedia corpus annotated with syntactic trees and use an integer linear programming model for selecting the most appropriate simplification from the space of possible rewrites generated by the grammar.They report learning 438 rules for sentence splitting.Probabilistic models have also been proposed.(Narayan and Gardent, 2014) determine splitting points using a dedicated probabilistic module trained on the Parallel Wikipedia corpus annotated with semantic structures while (Narayan and Gardent, 2016) extends this approach to an unsupervised setting where splitting points are determined based on the maximum likelihood of sequences of thematic role sets present in the simplified version of English Wikipedia.
More recent work has directly adressed the sentence splitting task.(Narayan et al., 2017) introduce a dataset for training sentence splitting models called WebSplit and report results for various neural models trained on this data, comparing a vanilla sequence-to-sequence model with a multisource and a semantically informed model.(Aharoni and Goldberg, 2018) present an alternative train/dev/test partition for WebSplit which better supports generalisation and show that adding a copy mechanism helps improve results.One limitation of the WebSplit corpus is that it uses a small vocabulary.To remedy this shortcoming, (Botha et al., 2018)  While these efforts are focused on syntax-or semantic-based sentence splitting, our work targets discourse-based sentence splitting.
Closest to our work, (Niklaus et al., 2019b,a,c) defines a set of 35 hand-crafted transformation rules to recursively decompose sentences into a hi-erarchical structure relating core sentences linked via rhetorical relations.They do not generate a wellformed text and the proposed rule-based approach will fail to easily generalise to other languages.Furthermore, because they focus on producing sentences representing minimal semantic units, their system outputs contain a large number of very short sentences which poses some readability issues.In contrast, we present a dataset for training discourse splitting models and Transformer-based, encoder-decoder models for generating discourse splits.The included examples exhibit a single split per sentence and do not rely on a deep hierarchical representation of the discourse structure, thereby preserving readibility.
3 Tasks and Data

Tasks
We focus on cases of discourse-splitting such as illustrated in the top tier of Table 1, where the input text C1 includes a discourse connective ("after this") denoting a discourse relation between two discourse units and the split output includes a corresponding discourse adverbial ("Afterwards" in S2, "Before this" in S3)2 .We refer to the discourse tree representing the discourse structure of both C and S as T .
We consider two approaches: an end-to-end approach where the model directly splits the input text C into two shorter sentences S; and a pipeline approach where we first map C to a discourse tree T and then map this tree to the split output S.

Data
We create (C, S) pairs using both synthetic and organic, parallel data.We then extend these pairs to (C, T, S) triples using rule-based and discourse parsing techniques to create the associated discourse tree T .

Creating C/S Pairs
Organic, Parallel Data.We create this data by extracting discourse-split instances from two existing datasets, WikiSplit and MUSS.
WikiSplit (Botha et al., 2018) is a sentence splitting dataset containing 1M single sentences alongside a two sentence variant which preserves their original meaning.This data is extracted from Wikipedia edit history, and therefore contains organic instances of C to S transformations.
The multilingual unsupervised sentence simplification dataset (MUSS) (Martin et al., 2020) contains 2.7M pairs of text sequences mined from Common Crawl web data which were estimated to be paraphrases of each other using L2 distance on LASER embeddings.Filtering out only those pairs that represent a splitting operation yields a subset of 157K examples.Like WikiSplit, this dataset is organically human-authored.
To create a discourse splitting dataset, we then extract from these two datasets all instances such that either the input contains a discourse connective or the output contains a discourse adverbial.We consider the discourse relations specified in the Penn Discourse Treebank (PDTB) and select a subset of these which we determined to be commonly represented via an adverbial connective between two sentences.We then compile a set of intra-sentential connective analogues for each.Table 2 shows the set of discourse connectives and adverbials used together with their corresponding discourse relations.3They cover 7 out of the 15 second order relations occurring in the PDTB.
Synthetic Data.The Common Crawl News corpus (CC-News) (Nagel, 2016) is a large collection of news articles that have been scraped from the internet.We use the news-please (Hamborg et al., 2017) python library to mine (i) a set of 1 million sentence pairs (D-CC-News-S) whose second sentence contains an adverbial and (ii) a set of 800K sentences (D-CC-News-C) which contain a discourse connective.We then create the corresponding input text (C) and discourse tree (T ) for each sentence in D-CC-News-S, and the corresponding discourse tree for each sentence in D-CC-News-C using rules and a discourse parser, as explained in the following section.

Creating (C,T,S) Triplets
We use discourse trees (i) to derive (C, T, S) triplets from the parallel data and (ii) to create matching C texts for the S texts in D-CC-News-S.
Creating Discourse Trees.For a given S, we employ the following rule-based method to derive a linearized tree of the form shown in Table 1, T .The adverbial is removed from the sentence pair and mapped to the corresponding PDTB discourse relation while the two sentences are used as the tree's arguments and are rearranged into the linearized tree for the relation instantiated by the adverbial.The ordering of the arguments is determined according to a defined schema for each relation, as stipulated in the PDTB manual.Table 1, T shows an example of this for the Temporal.Asynchronous relation, where the arguments are ordered chronologically.
To create a discourse tree for a complex sentence C occurring in D-CC-News-C, we use the (Lin et al., 2014) end-to-end PDTB discourse parser.Although not the most recent discourse parser, we chose it because it is publicly available as a simple to use end-to-end system and specifically uses the PDTB schema.However, we noticed that the parser often fails to extract the arguments of the relation, so we also fall back to using a naive extraction strategy in such cases.This naive approach works by selecting the content on either side of the connective as the relation arguments.
In both cases (deriving a discourse tree from a complex sentence C or from a pair of sentences S), the created discourse tree is similar to a PDTB discourse tree in that it uses the PDTB inventory of discourse relations and order their arguments according to the PDTB annotation guidelines.
Deriving Complex Sentences from pairs of Simple Sentences.We derive a single sentence variant C from a sentence pair S in D-CC-News-S using a simple rule-based method which fuses the pair while maintaining the appropriate discourse relation and instantiating different possible argument orderings and connective alternatives.This process works by first randomly selecting a connective from the set of possibilities, given the adverbial in S, and then combining it along with the two arguments to form a single sentence.These combinations are of the form "arg1 connective arg2" or "connective arg1, arg2", depending on the selected connective.This method only partially captures possible variations between C and S due to C being constructed from S using simple rules that do not take into account lexical variability (paraphrasings, etc.) that can exist for organic examples.However, as shall be shown in Section 6, because it permits creating multiple discourse variants of the same discourse split S using different connectives and orderings, this synthetic data helps to train discourse splitting models that are better able to generalise, such that they can generate different constructions for the same relation.
We do not attempt to automatically derive S from C for D-CC-News-C as this is a more complex task requiring many more alterations to reliably produce coherent samples.For instance, when there is a connective at the beginning of the sentence, it is difficult to identify which parts of the remaining sentence constitute the individual relation arguments.Additionally, rewriting and coreference resolution regularly need to be performed.

Training and Test Data
Table 3 summarises the data used for training and development.For evaluation, we extracted a set of 352 (C, T, S) triples from the organic datasets (184 triples from WikiSplit and 168 from MUSS), making sure to maintain an approximately even distribution over the supported connectives.To ensure a high level of quality, we then manually corrected the contents of T , C, and S, where necessary i.e., when C and S connectives did not match or when the wrong parts of the text have been flagged as relation arguments in T .

Models
Given a complex sentence C with discourse tree T and split output S, we consider and compare two approaches: an end-to-end approach C2S where the split output S is directly generated from C; and a pipeline approach P L which uses C's discourse tree to mediate the split i.e., first mapping C to its discourse tree T and second, mapping this tree to the split output S. We try both of these approaches in order to investigate how difficult it is for an endto-end model to incorporate the discourse structure on its own and to what extent, if any, explicit mediation of this information aids the performance of discourse-based splitting.
For each of these two approaches, we explore different ways of combining the training data: using only the synthetic data (Synth), only the organic data (Organic) or both (Synth+Organic).We also investigate a pre-training and fine-tuning approach where we pre-train on the synthetic data and finetune on the organic data; and a multi-task learning approach where we multi-task on the intermediate mapping tasks (mapping C to T and mapping T to S) and on the end-to-end task (mapping C to S).

Experimental Setup
All of our generative models use the BART architecture (Lewis et al., 2020) and were trained on a computing grid using 4 Nvidia RTX 2080 Ti GPUs.Each experiment starts by fine-tuning the facebook/bart-base model hosted by Hugging-Face4 , which has 6 layers in each of the encoder and decoder, a hidden size of 768, and was pretrained to perform reconstruction of corrupted documents on a combination of books and Wikipedia data.
During training, we used a learning rate of 3e −5 , a batch size of 16, and performed dropout with a rate of 0.1 and early stopping as regularisation measures.For each experiment we set aside 5% of the training set for validation.During generation, we perform beam search with a beam size of 4.
We compare the following models: Split Baseline (BL Split ) Pre-trained BART finetuned on a 1M example dataset of both syntax-and discourse-based splittings (WikiSplit).This baseline allows us to compare training with very large heterogeneous training data (BL Split ) vs. learning from smaller, discourse-split data (BL DSplit ).
Discourse-Split Baseline (BL DSplit ) Pretrained BART fine-tuned on a discourse-focused subset of WikiSplit (D-WikiSplit).This baseline is to be directly compared with BL Split .
Parser Pipeline Baseline (P L P arse ) A pipeline of two models.The first uses the discourse parser process used to generate T s from Cs in Section 3 (C2T) and the second is a pre-trained BART finetuned on (T, S) data (T2S).We experimented training the T2S component on various datasets and found the best to be that trained purely on synthetic data.Thus, any pipeline mentioned in the remainder of this paper refers to a specific C2T component connected to this same T2S component.This baseline allows us to compare pipeline models whose C2T component is learned on the split data vs. one where the C2T component uses an existing discourse parser.
End-to-End Model (E2E) Pre-trained BART fine-tuned on discourse-split data.We report results for variants trained on D-CC-News-S (E2E Synth ), D-Wikisplit and D-MUSS (E2E Organic ), and all three combined (E2E Both ).
Pipeline Model (PL) A pipeline of two models.The first model is pre-trained BART finetuned on (C, T ) data and the second is a pretrained BART fine-tuned on (T, S) data from D-CCNews-S.We report results for pipelines with a C2T component trained on all D-CCNews data (P L Synth ), D-MUSS data (P L Organic ), and D-CCNews combined with D-Wikisplit and D-MUSS data (P L Both ).
Pre-training and Fine-tuning (PT+FT) Pretrained BART fine-tuned on one data set before being further fine-tuned on another.We try training first on either synthetic or standard WikiSplit data and then fine-tuning on D-WikiSplit and D-MUSS data.Using WikiSplit for the first step was found to be the best performing configuration for the endto-end system (E2E ptf t ), while using D-CCNews proved better for the pipeline (P L ptf t ).

Multi-Tasking (MTL)
We prefix the training data with a control token indicating whether a training instance maps a complex input to a discourse tree (c2t), a discourse tree to a split text (t2s) or a complex input to a split output (c2s) and train pre-trained BART on this data.We use training examples from D-CCNews, D-WikiSplit and D-MUSS.At inference time, we prefix the input with the c2s control token for the end-to-end model; and with the c2t and t2s control tokens for the two components of the pipeline model.

Evaluation Metrics
As illustrated in Table 4, variants of a discourse split may differ in terms of sentence order, discourse connnective and rephrasing.To account for such variants while automatically assessing meaning preservation and discourse structure in the generated output, we use a combination of metrics.
Meaning Preservation.We measure meaning preservation using BLEU-4 and SAMSA.We calculate BLEU scores (Papineni et al., 2002) between the ground-truth reference and the generated text using the SacreBLEU library (Post, 2018).We use the EASSE python library (Alva-Manchego et al., 2019) to compute SAMSA scores.SAMSA (Sulem et al., 2018)  We also introduce a custom binary metric (D-ACC) which classifies an output as positive if (i) the correct discourse relation is maintained, (ii) the sentences are correctly ordered, and (iii) there is sufficient semantic similarity between the generated text and the ground-truth.A text will have a D-ACC score of 1 if it has a high BLEU (BLEU > 0.5) and either a low sentence BLEU (S-BLEU < 0.1) with a discourse adverbial which reverses the order of the argument (Table 4, Ex. 2) or a high sentence BLEU and a discourse adverbial which preserves the input discourse relation (Table 4, Ex. 2 and 3).Conversely, outputs with low BLEU and outputs with high BLEU, low S-BLEU and the same discourse connective as the reference (Table 4, Ex. 4) will be assigned a score of 0. 6We treat SAMSA and D-ACC as our primary metrics for comparing performance between models as, together, they provide an evaluation of both the meaning preservation and coherence of the split as well as the preservation of the discourse structure.Table 4 shows several example outputs and their corresponding scores.

Human Evaluation
In addition to using automated metrics, we performed human evaluation to compare our highest performing models and baseline systems using the MTurk platform.We considered a subset of 96 randomly selected examples from our test set (12 from each discourse relation type) and presented human annotators with the generated text for that example from our best performing pipeline system (P L Synth ) and asked them to compare it with (a) believe it is sufficient here as all outputs should contain the same number of sentences and therefore would recieve the same non-split penalty.
the result from BL Split (trained on generic split data), (b) the ground-truth result with adverbial removed, and (c) the result from our best performing end-to-end model (E2E Both ).Each combination was presented to 10 different annotators who were asked to compare the two texts in terms of their grammaticality, as well as how similar in meaning they are to the C input.In total we collected 5, 760 judgments: 960 judgments for each pair of models compared and each criteria (grammaticality vs. meaning preservation).Further details of this process are outlined in Appendix A.2.
We do not compute inter-annotator agreement scores due to some of the complexities in using the crowd-sourcing platform.Specifically, it would require having every annotator complete every comparison task, which is hard to manage at scale when posing each comparison as an individual task.To mitigate this issue, we opted to have a larger number of annotators complete each task, coupled with a larger number of unique tasks, in an attempt to smooth out individual differences.

Results and Discussion
Table 5 summarises the results.
Pipeline vs. End-to-End.While no single configuration outperforms all others, P L Synth ranks high for meaning preservation (SAMSA and BLEU) and for discourse structure (D-ACC and D-Rel).
More generally, we see that P L models universally outperform their E2E variant in terms of discourse structure (Rel-and D-ACC).Conversely, the E2E models tend to show better results in terms of meaning preservation (SAMSA and BLEU).This suggests that while the P L models are good at producing valid connectives and the correct sentence order (high D-ACC), their generative capacity needs improvement.
Synthetic vs Organic Data.Another clear trend is that models trained with synthetic data have significantly higher D-ACC than those trained with organic data.This confirms our hypothesis that, because it includes multiple variants of the same discourse split using different connectives and orderings, the synthetic data helps to train discourse splitting models that are better able to generalise i.e., are able to generate with different connectives for the same relation.For both E2E and P L models, combining organic and synthetic data (E2E Both and P L Both ) appears to reduce the performance trade-off of using one data type in isolation.
Alternative ways of combining organic and synthethic data using either fine-tuning and pretraining or multi-tasking did not yield improvements.For both regimes, we experimented with multiple hyper-parameters and data combinations.
The details of these experiments are given in Appendix A.3.Generic-vs.Discourse-Split Data In terms of meaning preservation (BLEU, SAMSA), BL DSplit (trained on 371K instances) performs on par with BL Split (1M instances), showing that discoursefocused models can compete with standard splitting models when trained on much smaller, ded-icated datasets.Moreover, in terms of discourse structure (D-ACC) and generalisation (Rel-ACC, Conn-ACC), E2E Organic has significantly higher generalisation capacity than BL Split (p = 0.046).This improvement becomes more dramatic when also including the synthetic data (E2E Both ) (p = 8.72e −7 ).

Human Evaluation
The results from the human evaluation (Table 6) confirm those of the automatic evaluation.
Human annotators find the output of P L Synth less grammatical and meaning preserving than either of the end-to-end models (E2E Both and BL Split ).This corroborates the divergence seen between E2E and P L models for SAMSA and BLEU scores.
For meaning preservation, annotators more often selected P L Synth over BL Split than they did P L Synth over E2E Both (p = 0.138), strengthening the observation that discourse-focused models perform this task better than generic splitting models.
P L Synth produces texts that are equally grammatical yet significantly more meaning preservative (p = 0.017) than the adverbial-stripped groundtruths.This reinforces the importance of maintaining discourse coherence when performing sentence splitting.
Upon examination of human evaluations, we found that annotators often marked the less grammatical text as being less meaning preservative by default.When controlling for this and only considering cases where both texts were labelled as equally grammatical (bottom tier of Table 6), we see improved results for P L Synth such that, in terms of meaning preservation, there is less difference between P L Synth and the end-to-end models and an increased difference between P L Synth and the ground-truth with adverbial removed.
Qualitative Analysis In addition to the automatic and human evaluations, we perform a qualitive analysis of common mistakes seen in system outputs.Table 7 in Appendix A.4 shows some examples of common errors for P L Synth , E2E Both and BL Split .
We can group these mistakes into 4 broad categories: connective, content, splitting, and hallucinations.Connective errors are those that use an incorrect connective or lack one entirely.Content errors are cases where the semantic content of the input is not maintained in the output.Splitting errors are cases where splitting has not been performed or has been done in the wrong place.We also occasionally see hallucinations where the output has included out-of-context information.
The BL Split model will often fail to use a valid adverbial, instead merely splitting the sentence at the position of the connective.We believe this is due to it not fully learning to maintain the discourse relation.It has also been observed to include hallucinated terms in the output.
We commonly see splitting errors for both P L and E2E models.The P L often splits at a position containing a known connective term, but where it is not acting as a connective given the context.This is due to the intermediary task incorrectly segmenting the input, possibly as a result of parser mistakes in the training data.On the other hand, the E2E will sometimes not perform any split, particularly where certain grammatical markers (e.g.semicolons) are present.

Conclusion
In this paper we introduced the task of Discoursebased Sentence Splitting together with a largescale dataset of both organic and synthetic discourse splits.Experimental evaluation revealed that discourse-based, pipeline models have better discourse relation preservation capabilities than endto-end models, and that synthetic data is critical for learning models that can generalise i.e. that can generate mutliple variants of the same discourse relation.In future work, we would like to create more document-aware models incorporating both syntax-and discourse-based sentence splitting at the document level.

A.1 Training Details
During this work we ran a range of experiments using different data set combinations.For each of our primary model types (E2E, P L) we ran training experiments with solely organic data, solely synthetic data, both together, and various combinations therein.For instance, in the case of solely organic data, we ran separate experiments using D-WikiSplit, D-MUSS, and the two in combination, for both E2E and P L. The highest performing of these was then selected as E2E organic and P L Organic .
All of these used the BART architecture with the same fixed hyperparameters, as outlined in the paper.Training and convergence times were quite varied depending on the task and data used, but E2E models were trained on average for ∼48 hours, and P L models for ∼24 hours.

A.2 Human Evaluation
We perform our human evaluation via the crowdsourcing platform, Amazon Mechanical Turk.We present a web form to evaluators, which includes some example texts and questions they must answer.These forms are referred to as hits.In our case, each hit contains three pieces of text (A, B, and X).These are the output from P L Synth for a given test example, the output from one of our three comparators (BL Split , E2E Both , and the groundtruth with no adverbial) for the same example, and the input C, respectively.
Evaluators are then asked to answer the following questions: • Which text (A or B) has more grammatical/fluent/well-formed English?
• Which text (A or B) is most similar in meaning to X?
For each of the two questions they must answer with either A, B, or Equal.For each of the 96 selected test examples, we performed 3 model comparisons and sourced 10 separate evaluators for each, meaning we had 2,880 hits completed.Each of these hits gives us 2 judgements (one for grammaticality and one for meaning preservation), thus we recieved 5,760 individual judgements.We paid $0.06 USD for each hit, meaning we spent $172.8USD in total.
An example of how one of these hits looks to evaluators can be seen in Figure 1.

A.3 Fine-tuning and Multi-Task Learning Experiments
In this work, we experimented with various finetuning and multi-task learning regimes in order to see if further performance gains could be met.In the case of multi-task learning, we trained a sequence-to-sequence model to simultaneously learn to perform the C2T, T2S and C2S tasks.The motivation behind this was that there could potentially be useful shared features between the tasks that would help overall learning performance.We used C2T data from D-CCNews, D-MUSS, and D-WikiSplit; and both T2S and C2S data from D-CCNews-S, D-MUSS, and D-Wikisplit.
For our pre-training and fine-tuning experiments we experimented with different dataset combinations and training strategies for both our end-toend model and our pipeline system.Initially, we tried pre-training first on synthetic data and then finetuning on organic data; either as D-MUSS and D-WikiSplit in combination, or one after the other.We also went on to experiment with pre-training on the standard WikiSplit dataset in an attempt to see whether useful features could be learned from training on a generic splitting task.For each of these, we also experimented with freezing/unfreezing different layers in the network (embedding, encoder, and decoder).
As mentioned in the paper, we were unable to observe any improvements over our standard models for any of these experiments.In the case of our pre-training and fine-tuning experiments, the best strategy we found for the end-to-end model was to simply train the BART architecture on standard WikiSplit and further finetune on D-MUSS and D-WikiSplit in combination.For the pipeline system, this was to train on the D-CCNews data, fine-tune on D-MUSS, then further fine-tune on D-WikiSplit.The performance of these models are reported in the paper, but, as can be seen, they failed to outperform other experiments in their respective categories.

A.4 Generation Examples
Table 7 shows example outputs from several models (P L Synth , E2E Both , BL Split ) for a range of different example inputs.We try to showcase various ways each of the models can fail.

Figure 1 :
Figure 1: An example human evaluation hit for a single test example.
create a new dataset called WikiSplit by mining Wikipedia's edit history.WikiSplit contains one million naturally occurring sentence splits.

Table 3 :
Discourse Split Training Data (# Instances: Number of (C, T ) pairs for D-CCNews-C, number of (C, T, S) aims to put more focus on the structural aspects of the text, by leveraging a semantic parser.

Table 4 :
The Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.After this, Mindaugas crossed the Vistula river and captured the fortress of Jazdów.The Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.Afterwards, Mindaugas crossed the Vistula river and captured the fortress of Jazdów.Mindaugas crossed the Vistula river and captured the fortress of Jazdów.Before this, the Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.Mindaugas crossed the Vistula river and captured the fortress of Jazdów.After this, the Masovians were caught by surprise, since virtually without any defense the capital, Płock, fell.Example illustrating how correct and incorrect variants of the reference impact the scores.O indicates that the order of the sentences has been reversed, C that the discourse adverbial differs from that used in the reference, and T that the text has changed.Only D-ACC distinguishes good from bad variants.

Table 5 :
A summary of results.Each row represents the results of the best E2E and PL model for the specified data category.

Table 6 :
Results for human evaluation.Cells show the proportion of cases where the pipeline was deemed better, equal or worse than a particular baseline.