BiSECT: Learning to Split and Rephrase Sentences with Bitexts

An important task in NLP applications such as sentence simplification is the ability to take a long, complex sentence and split it into shorter sentences, rephrasing as necessary. We introduce a novel dataset and a new model for this ‘split and rephrase’ task. Our BiSECT training data consists of 1 million long English sentences paired with shorter, meaning-equivalent English sentences. We obtain these by extracting 1-2 sentence alignments in bilingual parallel corpora and then using machine translation to convert both sides of the corpus into the same language. BiSECT contains higher quality training examples than the previous Split and Rephrase corpora, with sentence splits that require more significant modifications. We categorize examples in our corpus and use these categories in a novel model that allows us to target specific regions of the input sentence to be split and edited. Moreover, we show that models trained on BiSECT can perform a wider variety of split operations and improve upon previous state-of-the-art approaches in automatic and human evaluations.


Introduction
Understanding long and complex sentences is challenging for both humans and NLP models. NLP tasks like machine translation (Pouget-Abadie et al., 2014;Koehn and Knowles, 2017) and dependency parsing (McDonald and Nivre, 2011) tend to perform poorly on long sentences. Text simplification (Zhu et al., 2010;Xu et al., 2015) is often formulated with a specific step to break longer sentences into shorter sentences. This task is referred to as Split and Rephrase .
Several past efforts have created Split and Rephrase training sets, which consist of long, complex input sentences paired with multiple shorter * Equal contribution. 1 Our code and data are available at https://github. com/mounicam/BiSECT. sentences that preserve the meaning of the input sentence.  introduced the WEBSPLIT corpus based on decomposing a long sentence into RDF triples (a form of semantic representation), and generating shorter sentences from subsets of these triples. However, the reliance on RDF triples and a limited vocabulary results in unnatural expressions (Botha et al., 2018) and repeated syntactic patterns (Zhang et al., 2020a).
More recently, the WIKISPLIT corpus (Botha et al., 2018) was introduced. It contains one million training examples of sentence splitting that were mined from the revision history of English Wikipedia. While this yields an impressive number of training examples, the data are often quite noisy, with around 25% of WIKISPLIT pairs containing significant errors (detailed in §3.2). This is because Wikipedia editors are not only trying to split a sentence, but also often simultaneously modifying the sentence for other purposes, which results in changes of the initial meaning.
In this paper, we introduce a novel methodology for creating Split and Rephrase corpora via bilingual pivoting (Wieting and Gimpel, 2018;Hu et al., 2019b). Figure 1 demonstrates the process. First, we extract all 1-2 and 2-1 sentence-level alignments (Gale and Church, 1993) from bilingual parallel corpora, where a single sentence in one language aligns to two sentences in the other language. We then machine translate the foreign sentences into English. The result is our BISECT corpus.
Split and Rephrase corpora, including BISECT, contain pairs with variable amounts of rephrasing. Some pairs only edit around the split location, while others require more involved changes to maintain fluency. In this work, we leverage this knowledge by introducing a classification task to predict the amount of rephrasing required, and a novel model that targets that amount of rephrasing.
The main contributions of this paper are: • We introduce BISECT, the largest multilingual Split and Rephrase corpus. BiSECT contains 938K English pairs, 494K French pairs, 290K Spanish pairs, and 186K German pairs. • We show that BISECT is higher quality than WIKISPLIT, that it contains a wider variety of splitting operations, and that models trained with our resource produce better output for the Split and Rephrase task. • We introduce a novel classification task to identify the types of sentence splitting outputs based on how much rephrasing is necessary. • We develop a novel Split and Rephrase model that accounts for these classifications to control the amount of rephrasing.

Related Work
The idea of splitting a sentence into multiple shorter sentences was initially considered a sub-task of text simplification (Zhu et al., 2010;Narayan and Gardent, 2014). However, the structural paraphrasing required to split a sentence makes for an interesting problem in itself, with many downstream NLP applications. Thus,  proposed the Split and Rephrase task, and introduced the WEBSPLIT corpus, created by aligning sentences in WebNLG . WEBSPLIT contains duplicate instances and phrasal repetitions (Aharoni and Goldberg, 2018;Botha et al., 2018), and most splitting operations can be trivially classified (Zhang et al., 2020a), so subsequent Split and Rephrase corpora have been created to improve training (Botha et al., 2018) and evaluation (Sulem et al., 2018;Zhang et al., 2020a). The main work we compare against is WIKISPLIT, a corpus created by extracting split sentences from Wikipedia edit histories (Botha et al., 2018). Concurrent work used a subset of WIKISPLIT to focus on sentence decomposition (Gao et al., 2021). While this approach is able to both extract many potential sentence splits and transfer across languages, edited sentences do not necessarily have to retain the same meaning. In contrast, our corpus BISECT is created from aligned parallel documents. Bilingual corpora is generally leveraged for monolingual tasks with bilingual pivoting (Bannard and Callison-Burch, 2005), which assumes that two English phrases that translate to the same foreign phrase have similar meaning. This technique was used to create the Paraphrase Database (Ganitkevitch et al., 2013;Pavlick et al., 2015), a collection of over 100 million paraphrase pairs, and to improve neural approaches for sentential paraphrasing (Mallinson et al., 2017;Wieting and Gimpel, 2018;Hu et al., 2019a,b) and sentence compression (Mallinson et al., 2018).
In introducing the Split and Rephrase task,  also reports the performance of several baseline models, where the strongest is an LSTM-based model. Subsequent works have improved performance using a copy-attention mechanism (Aharoni and Goldberg, 2018). We instead start with a BERT-initialized transformer model (Rothe et al., 2020), and train it with an adaptive loss function to emphasize split-based edits. Concurrent work also introduced an additional neural graph-approach for Split and Rephrase (Gao et al., 2021).

BISECT Corpus
To address the need of Split and Rephrase data that is both meaning preserving and sufficient in size for training, we present the BISECT corpus.

Corpus Creation Procedure
The construction of the BISECT corpus relies on leveraging the sentence-level alignments from OPUS (Tiedemann and Nygaard, 2004), a publicly available collection of bilingual parallel corpora over many language pairs. While most of the translated sentences in OPUS are aligned 1-1, i.e., one sentence in Language A is mapped to one sentence in Language B, there are many aligned pairs consisting of multiple sentences from either A or B. This is a result of natural variation in the process of human translation. Sentence alignment algorithms (Gale and Church, 1993) match 1-1, 2-1, and 1-2 alignments in bitext. We extract all 1-2 and 2-1 sentence alignments from parallel corpora, where A is English and B is one of several foreign languages. Next, the foreign sentences are translated into English using Google Translate's Web API service 2 to obtain English sentence alignments between a single long sentence l and two corresponding split sentences s = (s 1 , s 2 ). As the alignment information provided by OPUS is based on the presence of a sentence-breaking punctuation, there are noisy alignments where l contains a pair of sentences instead of one complex sentence. These noisy alignments belong to two categories: two sentences pasted contiguously without any space around the sentence-breaking delimiter and two independent sentences joined by a space without any punctuation. For the first case, we remove l and its corresponding splits when it contains a token with a punctuation after the first two and before the last two alphabetic characters. For the second case, we generate a dependency tree 3 for l and discard l if it contains more than one unconnected component.
Moreover, we remove the misalignment errors based on lexical and semantic overlap. We compute lexical overlap ratio r as follows: where L l , L s 1 and L s 2 denote the sets of lemmatized tokens in l and (s 1 , s 2 ), respectively. We consider an aligned pair valid if r ≥ 0.25 and l, s 1 and s 2 all contain a verb. We discard invalid pairs. We also remove (l, s) pairs with length-penalized 2 https://pypi.org/project/googletrans/ 3 We generate dependency trees using Spacy. Zhang et al., 2020b;Maddela et al., 2021). 4 We repeat this process over all available parallel corpora for each English-Foreign language pair, resulting in 938,102 filtered English-English pairs. An important characteristic of BiSECT to note is that its size can be further increased with the addition of new parallel corpora on OPUS, processed in the method described above. Table 1 breaks down the OPUS corpora and parallel languages used in creating the English version of BISECT. For the testing set, a different set of corpora is used from the training set to prevent domain overlap. Moreover, the choice of corpus is based on the number of alignments extracted from each corpus. We choose corpora of relatively smaller sizes for development and testing to avoid a loss of size in the training set. To demonstrate our approach can be extended to other languages, we also create BISECT corpora for French, Spanish, and German, using English as the pivot language. Corpus statistics of non-English languages are given in Appendix G.

Comparison to Existing Corpora
Corpus Statistics. Besides corpus size, we are interested in the amount of rephrasing (indicated by %new) and the syntactic complexity of sentences (approximated by length). In Table 2, we compare BISECT with previous split and rephrase corpora, including WIKISPLIT (Botha et al., 2018), WEB-SPLIT Aharoni and Goldberg, 2018)  We compute the number of aligned pairs (#pairs); number of unique long sentences l (#unique); the percentage of new words added to s compared to l (%new), and the average token Length of l and that of the individual split sentences. † marks crowdsourced corpora.
tantly containing longer aligned sentence pairs and a higher %new score, indicating that BISECT contains more complex pairs with significantly more rephrasing (see also examples in Tables 3 and 4).
Manual Quality Assessment. While BiSECT does not suffer from meaning-altering edits like WIKISPLIT does, a potential concern is the error induced from translating a foreign text to English. Thus, we perform a manual assessment of corpus quality by comparing 100 randomly selected pairs from both BISECT and WIKISPLIT corpora. We categorize each example (l, s) into two groups: (1) high-quality pairs, where both l and s are grammatical, l consists of exactly one sentence, and s contains exactly two sentences; and (2) significant errors, where the pair contains drastic errors impacting its usability. Table 3 shows the results of the manual inspection. When compared with WIK-ISPLIT, BISECT contains significantly more highquality pairs, while containing fewer pairs with significant errors. Pairs containing unsupported and deleted details are comparable across corpora, though WIKISPLIT skews more towards adding unsupported information, which is consistent with previous work (Zhang et al., 2020a). Moreover, we take 100 random samples from the German BISECT corpus and perform manual inspection. We chose German because translating to/from German is notoriously challenging for translation systems (Twain, 1880;Collins et al., 2005). As shown in Table 3, German BISECT still contains 77% high-quality pairs.

Categorization for Split and Rephrase
One aspect of the Split and Rephrase task that has received little attention, outside of Zhang et al. (2020a), is the amount of rephrasing that occurs in each instance, and more specifically the syntactic patterns involved in this rephrasing. Unlike more open-ended language generation tasks, the structural paraphrasing involved in Split and Rephrase is likely to be relatively consistent across domains, thus identifying these patterns is a critical step towards further improvement of neural-based approaches. In this work, we consider three major categories, and break down each of these further into more specific syntactic patterns. The categories are derived from the entire dataset, spanning the domains of web, newswire, medical and legal text, and others.
The first group involves Direct Insertion, when a long sentence l contains two independent clauses, and requires only minor changes in order to make a fluent and meaning-preserving split s. Within this category, we identify two sub-categories: Colon/Semicolon, which occurs when the clauses are connected by a colon or semicolon; and Conjunction with subject, where the clauses are connected by a conjunction, and the second clause contains an explicit subject. The second group involves Changes near Split, when l contains one independent and one dependent clause, but modifications are restricted to the region where l is split. Within this category, we identify four sub-categories: instances containing a conjunction without subject, which involves two clauses connected by a conjunction, but the second clause does not have an explicit subject; instances that contain a gerund, followed by an adjectival clause, adverbial clause, or prepositional phrase; instances that involve an explicit subordinate clause; and instance that contain a concluding relative clause. Finally, the third major group involves Changes across Sentences, where major changes are required throughout l in order to create a fluent split s. The main subcategory within this group involves a preceding relative clause, followed by a comma. Table 4 presents the examples and prevalence of each category in WIKISPLIT and BISECT, computed using a manual inspection of 100 random examples from each corpus. BISECT contains significantly more instances that require changes across the sentence to form a high-quality split. To assess the relative difficulty of these categories, we analyze the quality of sentence splits generated by DisSim (  hand-crafted rules based on a syntactic parse tree. DisSim produces disfluent sentence splits 34% of the time, and performs no splitting 9% of the time.
For the Changes near Split and Changes Across Sentence categories, the number of erroneous splits increases to 55% and 63%, respectively. Although rules correctly identify the location of sentence splits, they fail to effectively modify sentences requiring more expansive rephrasing.

Our Model
The BISECT corpus contains a significant amount of paraphrasing along with sentence splitting, and models trained on BISECT tend to alter the lexical choices made in the input sentence. Although this is desirable in some situations, like for the task of sentence simplification, sometimes it can alter the meaning of the input sentence. We propose a novel model that allows finer-grained control over what parts of the sentence are changed. Our approach leverages the sentence split categories described in §3.3 to identify the split-based edits and incorporates them into a customized loss function as distantly supervised labels. This section describes the base model and its variant that adapts a high paraphrasing BISECT corpus to a sentence splitting task with minimal rephrasing.

Base Model
Our base model is a BERT-Initialized Changes Near Split 66% 49% The virus is carried and passed to others through blood or sexual contact and can cause liver inflammation, fibrosis, cirrhosis and cancer. (de→en) The virus is transmitted to other people through blood or sexual contact. It can cause liver inflammation, fibrosis, cirrhosis, and cancer.

Conjunction without subject
18% 13% An additional advantage is that a shorter ramp can be used, thereby reducing weight and improving the rear view of the driver. (de→en) Another advantage is that a shorter ramp can be used . This saves weight and improves the look of the rear of the vehicle.

7% 10%
For the fur edge I choose the smudge tool with a dissolved brush and paint in the mask along the black edge to get a smooth transition.
For the fur edge, I choose the tool with speckled brush tip and drag on the black edge in the mask. This creates a transition to the background.

17% 9%
Over 3500 people visit the Centre every year where they are greeted by volunteers who show them around the study room and tell them about the collection. (fr→en) Each year, more than 3,500 people visit the Center. They are greeted by volunteers who show them the study room and introduce them to the collection.

24% 17%
Changes Across Sentence 1% 11% ↑↑ Because these cities, settlements and regions were constructed for not hundred years, but for centuries. (fr→en) All these towns, these localities were not built in a hundred years. They were created over the centuries.

Adaptive Loss using Distant Supervision
The base model treats all the sentence splitting categories (Table 4) similarly even though the edits necessary to split the sentence vary across the categories. We utilize heuristics and linguistic rules to categorize each source-target sentence pair and extract required edits based on the category. Finally, we train the base model on these classification and edit labels to guide the model to perform appropriate edits for each category.
Classification and Edit Labels. Given the source x = (x 1 , x 2 , . . . x N ) and target y = (y 1 , y 2 , . . . y N ), we assign a sentence category label l ∈ {"Direct Insertion","Changes Near Split" ,"Changes Across Sentence"} to the training pair, and a binary label δ i to each position indicating whether the word is modified from the input. Here, δ = (δ 1 , δ 2 , . . . δ N ) represent the edit labels and δ i = 1 represents the necessary changes to split the sentence that cannot be copied from x. We ensure that x and y are of the same length using padding around the split. The split position for y corresponds to the position of the [SEP ] token. For x, we extract the lexical differences between x and y using an edit distance algorithm 5 and label the edit in x close to the [SEP ] token in y as the split position. Finally, we pad the sequences before and after the split positions so that they are of equal length. We provide an example in Appendix D. We extract l for each pair using the following rules: (1) If the first level of the parse tree of x contains the pattern "S CC S", x contains a colon/semicolon, or the lexical differences between x and y contain only the split, then we label the pair as Direct Insertion. Once again, we extract lexical differences using an edit distance algorithm.
(2) If the first level parse tree of x contains the pat-tern "S N P V P " or "SBAR N P V P ", then we label the pair as Changes across sentence. (3) If the first level of the parse tree contains "V P CC V P " or at least 5 words at the beginning and end of the sentence are copied from the source, then we categorize the pair as Changes near split. (4) We label the rest as Changes across sentence. In case of multiple potential splits, we choose the split whose lengths is closest to that of the reference.
After extracting l, we construct δ using the lexical overlap between x and y. For Direct Insertion, we set the δ i corresponding to the split position and its adjacent positions to 1 to capture the punctuation and capitalization. For Changes near split, we construct a variable length window around the split position to facilitate the addition of the new words and set the δ i in the window to 1. To construct this window, we scan the sequence on each side of the split position until the position where at least 3 consecutive positions are copied from x to y. Finally, we set δ to a one vector for Changes Across Sentence, as the changes cannot be localized. Our manual inspection of 100 training pairs from the BISECT training set showed that the rules correctly classified 83% of the pairs. Distant Supervision. As l depends on the reference and cannot be used during inference, we introduce a multi-class classification task distantly supervised by l. We train our model in a multi-task learning setting to predict l and perform generation. The classifier predicts the probability that x belongs to a split category using the encoder representation of the [CLS] token prepended to the input by the BERT encoder. The classifier contains a linear layer with a sof tmax activation function.
While l represents the sentence category, δ captures split-related edits. To ensure our model learns only split-based edits, we combine x and y in our decoder generation loss (L seq ) using δ as follows: where m is the number of training examples and y <i represents the mixture of of x and y histories. In other words, our model only learns the edits where δ i = 1 and copies from the source sentence for the rest of the positions. Finally, we jointly train the classifier and the Transformer using the cross entropy loss and our custom split-focused loss. We provide model and training details in Appendix A.

Experiments and Results
In this section, we compare different split and rephrase models trained on our new BISECT corpus. We also conduct a carefully designed human evaluation as automatic metrics are not totally reliable. Our model trained on BISECT establishes a new start-of-the-art for the task.

Data and Baselines
We train the models on BISECT and WIKISPLIT corpora. For evaluation, we select the BISECT and HSPLIT-WIKI (Sulem et al., 2018) test sets to represent splitting with a high degree and minimal of rephrasing respectively. HSPLIT-WIKI is a human annotated dataset with 359 complex sentences and 4 references for each complex sentence. Following previous work (Botha et al., 2018;Zhang et al., 2020a), we do not use WIKISPLIT for evaluation, because this corpus was constructed explicitly to be used only as training data, as it contains inherent noise and biases. While BISECT contains 928,440/9,079 train and dev pairs, WIKISPLIT contains 989,944/5,000 train and dev pairs. Note that we constructed BISECT test set by manually selecting 583 high-quality sentence splits from 1000 random source-target pairs from EMEA and JRC-ACQUIS corpora. We compare our approach with Copy512 (Aharoni and Goldberg, 2018), a state-of-the-art model consisting of an attention-based LSTM encoderdecoder with a copy mechanism (See et al., 2017). We use our base model trained on WIKISPLIT (Rothe et al., 2020) as another state-of-the-art baseline.

Automatic Evaluation
Existing automatic metrics, such as BLEU (Papineni et al., 2002) and SAMSA (Sulem et al., 2018), are not optimal for the Split and Rephrase task as they rely on lexical overlap between the output and the target (or source) and underestimate the splitting capability of the models that rephrase often. We focus on BERTScore (   sure the correctness of inserted, kept and deleted n-grams when compared to both the source and the target. We use an extended version of SARI that considers lexical paraphrases of the reference. An n-gram from the output is considered correct if the given n-gram or its paraphrase from PPDB (Pavlick et al., 2015) occurs in the reference, using the PPDB-L version. Without this change, the original SARI also tends to underestimate rephrasing. Table 5 shows that our models trained on BI-SECT outperform their equivalents trained on WIKISPLIT in terms of SARI and BERTScore. Note that the models trained on WIKISPLIT have an advantage over HSPLIT-WIKI test set because they belong to the same domain. Models trained on BISECT do not have a similar advantage on BISECT test set because it belongs to a different domain than the training data. When compared to the base model (Transformer w/ BISECT), our model (Transformer control w/ BISECT) shows higher self-BLEU and lower percentage of new words, indicating that it performs less rephrasing by focusing on split-based edits.

Human Evaluation
We asked three annotators to rate the overall quality of the sentence splits generated by different models on a 0-100 point scale. 0 represents an erroneous split and 100 represents a perfect meaningpreserving split. Unlike the previous work that measures meaning preservation and fluency separately, we collected only one rating because it was difficult to distinguish between the grammatical and the meaning-changing errors. We modeled our evaluation after the WMT evaluation (Bojar et al., 2019) that also uses a similar setting. We evaluated on 100 random sentences from the BISECT and HSPLIT-WIKI test sets. The annotators were university students trained using an instructional video and a qualification phase. To capture the annotation quality, we included a control output generated by

Source
Having determined, after consulting the Advisory Committee that sufficient evidence existed for the initiation of a partial interim review, the Commission published a notice in the Official Journal of the European Communities and commenced an investigation. Transformer w/ WIKI Having determined, after consulting the Advisory Committee, that sufficient evidence existed for the initiation of a partial interim review. The Commission published a notice in the Official Journal of the European Communities and commenced an investigation. Transformer w/ BI-SECT After consulting the Advisory Committee, the Commission determined that there was sufficient evidence for the initiation of a partial interim review. The Commission issued a notice in the Official Journal of the European Communities and began an investigation. Transformer control w/ BISECT After consulting the Advisory Committee, there was sufficient evidence for the initiation of a partial interim review. The Commission published a notice in the Official Journal of the European Communities and initiated an investigation. Table 7: Examples of system outputs from the BISECT test set. Here, the source sentence belongs to the category "Changes Across Sentence". Blue marks the location of the required edits in the source sentence. Green indicates good edits and red indicates errors. Figure 2: Human ratings on 100 generated sentence splits from the BISECT test set broken down by sentence split categories as described in Table 4. randomly selecting a system output and replacing 4 to 8 words with random words. Our annotators gave low ratings (<20) to the control outputs, indicating that the ratings are reliable. We provide the annotation interface design in Appendix F. Table 6 shows that results on the entire BISECT and WIKI-HSPLIT test sets. Figure 2 shows the results on different split categories of the BISECT test set. The sentences splits generated by models trained on BISECT are of better quality than the ones trained on WIKISPLIT. Our model with adaptive loss (Transformer control w/ BISECT) performs better than the base model (Transformer w/ BISECT) in four of the seven split categories. The difference in quality is much more evident for the Preceding Relative Clause category, as this requires changes across sentences. We provide an example in Table 7, as well as several more in Appendix E.

Conclusion
In this work, we introduce BISECT, a new corpus for the Split and Rephrase task in several languages. We create this by making use of bilingual parallel corpora, and translating instances of aligned split sentences. We show that the sentence splitting models trained on our new corpus generate fewer errors than their counterparts trained on the existing datasets. To further improve meaning preservation and diversity, we propose a novel approach that identifies split-related edits in a training pair using linguistic rules and trains the model solely on splitbased edits. Our proposed approach trained on BISECT outperforms existing systems in terms of both automatic and human evaluations. We plan to investigate and create better automatic evaluation metrics for future work.

A Implementation and Training details
We implemented the BERT-intialized Transformer using the Fairseq 6 toolkit. Here, the encoder and decoder follow BERT base 7 architecture. The encoder is also initialized with BERT base checkpoint and the decoder is randomly initialized. The sentence classifier is a feedforward network containing an inputer layer, one hidden layer with 1000 nodes, and an output layer with 3 nodes and softmax activation. We used Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.0001, linear learning rate warmup of 40k steps, and 100k training steps. We used a batch size of 64. We used BERT WordPiece tokenizer. During inference, we use beam-search of width 10 and ensure that the beam-search does not repeat trigrams. We used the hyperparameters of the BERT-initialized Transformer described in Rothe et al. (2020). The model takes 10 hours to train on 1 NVIDIA GeForce GPU.

B BiSECT Language Composition Dataset
French German Spanish Arabic Dutch Italian Portuguese Russian

NEWSELA-AUTO
Source: About 160,000 Girl Scouts participated in the program over the past year and were credited with selling nearly 2.5 million boxes of cookies beyond those sold through traditional in-person methods.
Reference: About 160,000 Girl Scouts used Digital Cookie last year . They sold almost 2.5 million boxes of cookies online.
HSPLIT Source: West Berlin had its own postal administration, separate from West Germany's, which issued its own postage stamps until 1990. Reference: West Berlin had its own postal administration. It was separate from West Germany's. West Berlin issued its own postage stamps until 1990.
CONTRACT Source: Except for Supplier's obligations and liability resulting from Section 10.0, Supplier Liability for Third Party Claims, Supplier's liability for any and all claims will be limited to the amount of $1,000,000 USD per occurrence, with an aggregated limit of $4,500,000 USD during the term of this Agreement . Reference: The following applies, not including the Supplier 's obligations and liability resulting from Section 10.0, Supplier Liability for Third Party Claims. Supplier's liability for any and all claims will be limited to the amount of $1,000,000 USD per occurrence. Additionally, there is an aggregated limit of $4,500,000 USD during the term of this Agreement .

D Our Model
Figure 3: Overview of our proposed approach, where the model is trained on only the split-related edits in y determined by edit δ and sentence category l labels. We also illustrate the padding and the construction of δ for a training pair belonging to Changes Near Split category.

Source
To that end, the annual contribution from each Member State shall be established on the basis of the gross national product (GNP) according to the scale used for determining the GNP element in own resources for financing the general budget of the European Union. Transformer w/ WIKI To that end, the annual contribution from each Member State shall be established on the basis of the gross national product (GNP). According to the scale used for determining the GNP element in own resources for financing the general budget of the European Union . Transformer w/ BISECT To this end, the annual contribution of each Member State shall be established on the basis of the gross national product (GNP). the scale used to determine the GNP element in own resources for the financing of the general budget of the European Union . Transformer control w/ BISECT To that end, the annual contribution from each Member State shall be established on the basis of the gross national product. This depends on the scale used to determine the GNP element in own resources for financing the general budget of the European Union .

Source
Member States shall take appropriate and practicable measures to reduce the risk of transmission of avian influenza from birds living in the wild to susceptible birds kept in zoos, taking into account the criteria and risk factors set out in Annex I. Transformer w/ WIKI Member States shall take appropriate and practicable measures to reduce the risk of transmission of avian influenza from birds living in the wild to susceptible birds kept in zoos. Taking into account the criteria and risk factors set out in Annex I. Transformer w/ BISECT Member States shall take appropriate and practical measures to reduce the risk of transmission of avian influenza from birds living in the wild to susceptible birds kept in zoos. These measures will take into account the criteria and risk factors set out in Annex I. Transformer control w/ BISECT Member States shall take appropriate and practicable measures to reduce the risk of transmission of avian influenza from birds living in the wild to susceptible birds kept in zoos. They must take into account the criteria and risk factors set out in Annex I.

Source
There were two main studies involving a total of 529 patients with SLE, in which Riquent was compared with a placebo (a dummy treatment). Transformer w/ WIKI There were two main studies involving a total of 529 patients with SLE. In which Riquent was compared with a placebo (a dummy treatment). Transformer w/ BISECT There were two main studies involving a total of 529 patients with SLE. Riquent was compared with a placebo (a dummy treatment). Transformer control w/ BISECT There were two main studies involving a total of 529 patients with SLE. Riquent was compared with a placebo (a dummy treatment).

Conjunction Without Subject
Source Undertakings required to maintain separate accounts are undertakings that enjoy special or exclusive rights granted by a Member State as referred to in Article 86 (1) of the Treaty or are entrusted with the operation of a service of general economic interest as referred to in Article 86 (2) of the Treaty and receive State aid in any form whatsoever in relation to such service and that carry on other activities . Transformer w/ WIKI Undertakings required to maintain separate accounts are undertakings that enjoy special or exclusive rights granted by a Member State as referred to in Article 86 (1) of the Treaty. They may also be entrusted with the operation of a service of general economic interest as referred by the Treaty and receive State aid in any form whatsoever in relation to such service and that carry on other activities . Transformer w/ BISECT Undertakings required to maintain separate accounts are undertakings that enjoy special or exclusive rights granted by a Member State under Article 86 (1) of the Treaty or are entrusted with the operation of a service of general economic interest referred to in Article 86 Para . 2) and receive State aid in any form in relation to this service and carry out other activities . Transformer control w/ BISECT Undertakings required to maintain separate accounts are undertakings that enjoy special or exclusive rights granted by a Member State as referred to in Article 86 (1) of the Treaty or are entrusted with the operation of a service of general economic interest. They must also receive state aid in any form whatsoever in relation to such service and that carry on other activities .

Preceding Relative Clause
Source Because 1'-hydroxymidazolam is an active metabolite, the sedative effect of midazolam may be increased. Transformer w/ WIKI Because 1'-hydroxymidazolam is an active metabolite. The sedative effect of midazolam may be increased. Transformer w/ BISECT 1'-hydroxymidazolam is an active metabolite. The sedative effect of midazolam can be increased. Transformer control w/ BISECT 1'-hydroxymidazolam is an active metabolite. The sedative effect of midazolam may therefore be increased.
Table 10: Examples of system outputs from the BISECT test set. Here, the source sentence belongs to the category "Changes Near Split". Blue marks the location of the required edits in the source sentence. Green indicates good edits and red indicates errors.

F Human Evaluation
Figure 4: Annotation interface and guidelines for human evaluation. Each system output is followed by a slider ranging between 0 to 100 with labels "Very Low Quality" on the left and "Perfect Quality" on the right. Highlighted words indicate newly added words when compared to the source sentence. Hovering the mouse over the red ticks displays words removed from the source sentence. Every HIT contains a control text, where 4 to 8 words are replaced with random words. Workers are expected to give low scores to the control text. Furthermore, the system outputs are shuffled for every HIT to eliminate position bias.