ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences

Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.


Introduction
Atomic clauses are fundamental text units for understanding complex sentences. The ability to decompose complex sentences facilitates research that aims to identify, rank or relate distinct predications, such as content selection in summarization (Fang et al., 2016;Peyrard and Eckle-Kohler, 2017), labeling argumentative discourse units in argument mining (Jo et al., 2019) or elementary discourse units in discourse analysis (Mann and Thompson, 1986;Burstein et al., 1998;Demir et al., 2010), or extracting atomic propositions for question answering (Pyatkin et al., 2020). In this work, Orig Sokuhi was born in Fujian and was ordained at 17. SS1 Sokuhi was born in Fujian.

SS2
Sokuhi was ordained at 17. Figure 1: Example of a complex sentence (Orig) rewritten as two simple sentences (SS1, SS2). Underlined words in the source are preserved in the same order in the two outputs, the conjunction and (red font) is dropped, and the subject Sokuhi (blue font) is copied to the second simple sentence.
we propose a new task to decompose complex sentences into a covering set of simple sentences, with one simple output sentence per tensed clause in the source sentence. We focus on tensed clauses rather than other constituents because they are syntactically and semantically more prominent, thus more essential in downstream tasks like argument mining, summarization, and question answering. The complex sentence decomposition task we address has some overlap with related NLP algorithms, but each falls short in one or more respects. Elementary discourse unit (EDU) segmentation segments source sentences into a sequence of non-overlapping spans (Carlson et al., 2003;Wang et al., 2018). The output EDUs, however, are not always complete clauses. Text simplification rewrites complex sentences using simpler vocabulary and syntax (Zhang and Lapata, 2017). The output, however, does not preserve every tensed clause in the original sentence. The split-and-rephrase (SPRP) task aims to rewrite complex sentences into sets of shorter sentences, where an output sentence can be derived from non-clausal constituents in the source (Narayan et al., 2017). In contrast to the preceding methods, we convert each tensed clause in a source sentence, including each conjunct in a conjoined VP, into an independent simple sentence. Unlike EDU segmentation, a belief verb and its that-complement do not lead to two output units. Unlike text simplification, no propositions in the source are omitted from the output. Unlike SPRP, a phrase that lacks a tensed verb in the source cannot lead to a distinct sentence in the output. Figure 1 shows an example complex sentence (Orig) with conjoined verb phrases and its rewrite into two simple sentences (SSs). Observe that besides producing two sentences from one, thus breaking the adjacency between words, words inside the verb phrases (underlined in the figure) remain in the same linear order in the output; the single subject Sokuhi in the source is copied to the more distant verb phrase. Finally, the connective and is dropped. We find that most rewrites of complex sentences into simple sentences that preserve the one-to-one mapping of source tensed predication with target simple sentence involve similar operations. Building on these observations, we propose a neural model that learns to Accept, Break, Copy or Drop elements of a special-purpose sentence graph that represents word adjacency and grammatical dependencies, so the model can learn based on both kinds of graph proximity. We also introduce DeSSE (Decomposed Sentences from Students Essays), a new annotated dataset to support our task.
The rest of the paper presents two evaluation datasets, our full pipeline, and our ABCD model. Experimental results show that ABCD achieves comparable or better performance than baselines. 1

Related Work
Related work falls largely into parsing-based methods, neural models that rewrite, and neural segmenters. Gao et al. (2019) propose a decomposition parser (DCP) that extracts VP constituents and clauses from complex sentences as part of a summarization evaluation tool. Niklaus et al. (2019a) present a system (DisSim) based on parsing to extract simple sentences from complex ones. Jo et al. (2020) propose seven rules to extract complete propositions from parses of complex questions and imperatives for argumentation mining. Though performance of these methods depends on parser quality, they often achieve very good performance. We include two whose code is available (DCP, DisSim) among our baselines.
SPRP models are based on encoder-decoder architectures, and the output is highly depending on the training corpus. Aharoni and Goldberg (2018) present a Copy-augmented network (Copy 512 ) based on (Gu et al., 2016) that encour-ages the model to copy most words from the original sentence to the output. As it achieves improvement over an earlier encoder-decoder SPRP model (Narayan et al., 2017), we include Copy 512 among our baselines.
Finally, recent neural EDU segmenters (Wang et al., 2018; achieve state-of-the-art performance on a discourse relation corpus, RST-DT (Carlson et al., 2003). As they do not output complete sentences, we do not include any among our baselines.
Our ABCD model leverages the detailed information captured by parsing methods, and the powerful representation learning of neural models. As part of a larger pipeline that converts input sentences to graphs, ABCD learns to predict graph edits for a post processor to execute.

Datasets
Here we present DeSSE, a corpus we collected for our task, and MinWiki, a modification of an existing SPRP corpus (MinWikiSplit (Niklaus et al., 2019b)) to support our aims. We also give a brief description of differences in their distributions. Neural models are heavily biased by the distributions in their training data (Niven and Kao, 2019), and we show that DeSSE has a more even balance of linguistic phenomena.

DeSSE
DeSSE is collected in an undergraduate social science class, where students watched video clips about race relations, and wrote essays in a blog environment to share their opinions with the class. It was created to support analysis of student writing, so that different kinds of feedback mechanisms can be developed regarding sentence organization. Students have difficulty with revision to address lack of clarity in their writing (Kuhn et al., 2016), such as non-specific uses of connectives, run on sentences, repetitive statements and the like. These make DeSSE different from corpus with expert written text, such as Wikipedia and newspaper. The annotation process is unique in that it involves identifying where to split a source complex sentence into distinct clauses, and how to rephrase each resulting segment as a semantically complete simple sentence, omitting any discourse connectives. It differs from corpora that identify discourse units within sentences, such as RST-DT (Carlson et al., 2003) and PTDB (Prasad et al., 2008), because  clauses are explicitly rewritten as simple sentences. It differs from split-and-rephrase corpora such as MinWikiSplit, because of the focus in DeSSE on rephrased simple sentences that have a one-to-one correspondence to tensed clauses in the original complex sentence. DeSSE is also used for connective prediction tasks, as in (Gao et al., 2021). 2 We perform our task on Amazon Mechanical Turk (AMT). In a series of pilot tasks on AMT, we iteratively designed annotation instructions and an annotation interface, while monitoring quality. Figure 2 illustrates two steps in the annotation: identification of n split points between tensed clauses, and rephrasing the source into n + 1 simple clauses, where any connectives are dropped. The instructions ask annotators to focus on tensed clauses occurring in conjoined or subordinate structures, relative clauses, parentheticals, and conjoined verb phrases, and to exclude gerundive phrases, infintival clauses, and clausal arguments of verbs. The final version of the instructions describes the two annotation steps, provides a list of connectives, and illustrates a positive and negative example. 3 The training and tests sets contains 12K and 790 examples, respectively.
It is built from WikiSplit, a text simplification dataset derived from Wikipedia revision histories (Narayan et al., 2017), modified to focus on minimal propositions that cannot be further decomposed. It was designed for simplifying complex sentences into multiple simple sentences, where the simple sentences can correspond to a very wide range of structures from the source sentences, such as prepositional or adjectival phrases. To best utilize this corpus for our purposes, we selected a subsample where the number of tensed verb phrases in the source sentences matches the number of rephrased propositions. The resulting MinWiki corpus has an 18K/1,075 train/test split. Table 1 presents prevalence of syntactic patterns characterizing complex sentences in the two datasets. Four are positive examples of one-to-one correspondence of tensed clauses in the source with simple sentences in the rephrasings: discourse connectives (Disc. Conn.), VP-conjunction, clauses introduced by whsubordinating conjunctions (e.g., when, whether, how) combined with non-restrictive relative clauses (wh-& Rel. Cl.), and restrictive relative clauses (Restric. Rel. Cl.). The sixth column (negative examples) covers clausal arguments, which are often that-complements of verbs that express belief, speaking, attitude, emotion, and so on. MinWiki has few of the latter, presumably due to the genre difference between opinion essays (DeSSE) and Wikipedia (MinWiki).

Problem Formulation
We formulate the problem of converting complex sentences into covering sets of simple sentences as a graph segmentation problem. Each sentence is represented as a Word Relation Graph (WRG), a directed graph constructed from each input sentence with its dependency parse. Every word token and its positional index becomes a WRG vertex. For every pair of words, one or more edges are added as follows: a neighbor edge that indicates that the pair of words are linearly adjacent; a dependency edge that shows every pair of words connected by a dependency relation, adding critical grammatical relations, such as subject. Figure 3 shows an example sentence and a simplified version of its WRG (edge directions are not shown, for readability). Vertices are labeled with word-index pairs in red font, and edges are labeled as ngbh for neighboring words, or with the tags corresponding to their dependency relations, such as nsubj between Sokuhi-1 and ordained-13. An edge can have both types of relation, e.g. neighbor and dependency for was-12 and ordained-13. The graph is stored as an Edge Triple Set, a set of triples with (source node, target node, label) representing each pair of words connected by an edge, as shown in Figure 3, bottom left. Given a sentence and its WRG, our goal is to decompose the graph into n connected components (CC) where each CC is later rewritten as an output simple sentence. To perform the graph decomposition, decisions are made on every edge triple.We define four edit types: • Accept: retain the triple in the output • Break: break the edge between a pair of words • Copy: copy a target word into a CC • Drop: delete the word from the output CCs A training example consists of an input sentence, and one or more output sentences. If the input sentence is complex, the ground truth output consists of multiple simple sentences. The next section presents the ABCD pipeline. Two initial modules construct the WRG graphs for each input sentence, and the ABCD labels for the Edge Triple Sets based on the ground truth output. A neural model learns to assign ABCD labels to input WRG graphs, and a final graph segmenter generates simple sentences from the labeled WRG graphs. Details about the neural model are in the subsequent section.

System Overview
The full processing pipeline consists of five major components, as shown in Figure 4. Three preprocessing modules handle the WRG graph construction, conversion of graph triples to vectors, and creation of distant supervision labels for the graph. The fourth component is the ABCD neural model that learns to label a WRG graph, which is described in section 6. The last part of the pipeline is a post-processing module to segment WRG graphs based on the labels learned by the ABCD model, and to map each graph segment to a simple sentence.
Graph Constructor The first module in the system is a Graph Constructor that converts an input sentence and its dependency parse into a collection of vertices and edges. It is used during training and inference. It first extracts words and their indices from the input sentences of the training examples for the vertices of each WRG graph. A directed edge and ngbh label is assigned to all pairs of adjacent words. A directed edge and label is also assigned to every governing and dependent word pair (cf. Figure 3).
Edge Triples DB The Edge Triples DB, which is used during training and inference, creates vector representations for the input Edge Triples Sets for each training instance, using latent representations learned by an encoder component of the ABCD model. Using the word indices, a function maps the source and target words from every triple into its hidden representation learned by the encoder, and the triple's edge label is converted into a one-hot encoding with dimension d. For an edge triples set with m triples, the source and target word hidden states are each stacked into an m × h matrix, and the one-hot vectors for edge labels are stacked into an m × d matrix. These three source, target, edge matrices that represent an Edge Triple Set are then fed into an attention layer, as discussed in section 6.

Distant Supervision Label Creator
The expected supervision for our task is the choice of edit type for each triple, where the ground truth consists of pairs of an input sentence, and one or more output simple sentences. We use distant supervision where we automatically create edit labels for each triple based on the alignment between the original input sentence and the set of output simple sentences. In the Distant Supervision Label Creator module, for every triple, we check the following conditions: if the edge is a "neighbor" relation, and both source and target words are in the same output simple sentence, we mark this pair with edit type A; if the source and target words of a triple occur in different output simple sentences, the corresponding edit is B; if the source and target are in the same output simple sentence, and the only edge is a dependency label (meaning that they are not adjacent in the original sentence), we mark this pair as C; finally, if a word is not in any output simple sentence, we mark the corresponding type as D.
Graph Segmenter This module segments the graph into connected components using predicted edits, and generates the output sentences, as part of the inference pipeline. There are four stages consisting of: graph segmentation, traversal, subject copying, and output rearranging. In the graph segmentation stage, the module first performs actions on every triple per the predicted edit: if the edit is A, no action is taken; if the edit is B, the edge between the pair of words is dropped; given C, the edge is dropped, and the edge triple is stored in a temporary list for later retrieval; if the edit is D, the target word is dropped from the output graphs. After carrying out the predicted edits, we run a graph traversal algorithm on modified edge triples to find all CCs, using a modified version of the Depth-First-Search algorithm with linear time proposed in (Tarjan, 1972;Nuutila and Soisalon-Soininen, 1994). For each CC, the vertices are kept and the edges are dropped. Then we enter the subject copying stage: for each source, target pair in the temporary list mentioned earlier, we copy the word to the CC containing the target. Finally for every CC, we arrange all words in their linear order by indices, and output a simple sentence.

Neural Model
The ABCD model consists of three neural modules depicted in Figure 5: a sentence encoder to learn a hidden representation for the input sentence, a self-attention layer to generate attention scores on every edge label, and a classifier that generates a predicted distribution over the four edit types, based on the word's hidden representation, the edge label representation, and the attention scores.

Sentence Representation
The sentence representation module has two components: a word embedding look up layer based on GloVe (Pennington et al., 2014), and a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) (see Figure 5). Given an input sentence length l, and the hidden state dimension M , the output from this module is l × M . For a word with index i in the input sentence, we generate its hidden representation h i such that it combines the hidden states from forward and backward LSTMs, with h i ∈ R M . A positional encoding function is added to the word embeddings. We found this particularly helpful in our task, presumably because the same word type at different positions might have different relations with other words, captured by distinct learned representations. Our experiments compare biLSTM training from scratch to use of BERT (Devlin et al., 2019), to see if pre-trained representations are helpful.
To utilize the learned word representations in the context of the relational information captured in the WRG graph, we send the sentence representation to the Edge Triple DB and extract representations h i and h j for the source and target words, based on indices i and j. A one-hot vector with dimensionality N encodes relations between pairs of source and target words; each edge triple is thus converted into three vectors: h src , h tgt , d rel . We take positionwise summation over all one hot vectors if there is more than one label on an edge.

Edge Self-Attention
Attention has been useful for many NLP tasks. In our model, we adapt the multi-head self attention mechanism (Vaswani et al., 2017) to learn importance weights on types of edit operations, as shown in the middle green block in Figure 5. Given m edge triples, we first stack all source vectors h src into a matrix H src , and operate the same way on h tgt and d rel to obtain H tgt and D rel , such that H src , H tgt ∈ R m×M , and D rel ∈ R m×N . These three matrices are the input to self-attention. For every head of the multi-head attention, we first obtain a feature representation with the three parameters V, K, Q mapping to sources, targets and relations, respectively, then compute a co-efficient e with a learnable parameter W e as follows: (1) where e ∈ R m×1 . Then we compute the attention scores by taking a softmax over e: Finally, we concatenate all head attentions together, and pass them through a linear layer to learn the relations between heads, and generate the final attention scores: α = W (concat((head 1 , head 2 , . . .)) (3) α ∈ R m×1 . The attention scores are sent to the next module to help the classifier make its decision.

Edit Classification
The last component of our neural model is a classifier, as shown at the right of Figure 5. To aggregate the feature representation from the previous layer, we first concatenate the three matrices H src , H tgt , D rel into one representation, and multiply the attention scores as follows: An MLP layer then takes H as its input and generates the output distribution over the four edit types for each edge triple: where Out M ∈ R m×4 . As an alternative to MLP, we also investigated a bilinear classifier, which has proved efficient in capturing fine-grained differences in features for classification task (Dozat and Manning, 2017). The bilinear layer first takes H src and H tgt as input and generates transposed bilinear features : where W A , b are learnable parameters. Then we sum the bilinear features with the MLP decisions and apply softmax on the result to get the final distribution over the four edit labels: where Out B ∈ R m×4 . We use cross entropy loss between predicted edit types and gold edit types created from distant supervision (see above).

Training
The class balance for our task is highly skewed: the frequency of class A is much higher than the other three classes, as shown in the top portion of Table 2. To mitigate the impact on training, we adopt the inverse class weighting for cross entropy loss introduced in (Huang et al., 2016). With this weighting, loss is weighted heavily towards rare classes, which forces the model to learn more about the rare cases. Table 2 shows the weights for four edit labels on both datasets. On MinWiki, A occurs the most and has the lowest weights as 0.0167, a sharp contrast to B,C,D. On DeSSE, both A and D occur frequently while B and C have lower frequency with higher weights, at 0.6266 and 0.2658. DeSSE has fewer B, and more C and D than Min-Wiki. From this perspective, MinWiki is "simpler" than DeSSE because there are fewer edits on rewriting the sentences. This might be due to the different distributions of linguistic phenomena in the two datasets (see Table 1). In the next section, we will show that ABCD shows stronger improvements on complicated edits. Training details are in the appendix.

Experiments
We carry out two intrinsic evaluations of ABCD performance on MinWiki and DeSSE. Section 7.1 presents an intrinsic evaluation of ABCD variants on edit prediction, with error analysis and ablation studies. Section 7.2 compares the best ABCD model with several baselines on the quality of output propositions. We discuss evaluation metrics in section 7.3. Results show that ABCD models show consistently good performance compared to other baseline models on both datasets.

Intrinsic Evaluation on Edit Prediction
We report F1 scores on all four edit types from ABCD and its model variants. We compare two classifiers as mentioned in previous sections and investigate the difference between using biLSTM and BERT with fine-tuning, to see if pre-trained knowledge is useful for the task. Table 3 presents results on MinWiki and DeSSE from the four model settings. All models perform better on MinWiki than DeSSE, and biL-STM+bilinear shows the best performance on both, with F1 scores of 0.82 and 0.67 on MinWiki and DeSSE respectively. Presumably this reflects the greater linguistic diversity of DeSSE shown in Table 1. The lower performance from BERT variants indicates the pre-trained knowledge is not helpful. Among the four edit types, all models have high F1 scores on A across datasets, high F1 on C for MinWiki, but not on DeSSE. B and D show lower scores, and all four models report lower F1 on B than D on both datasets.
To examine the significant drop on B and D from MinWiki to DeSSE, Table 4 presents error anal-ysis on pairs of gold labels and predictions for B and D, using predictions from biLSTM+mlp. The model does poorly on B in both datasets, compared with predictions of 36.1% for A on MinWiki, on on DeSSE, 27.42% for A and 15.18% for C. The model has high agreement on D from MinWiki, but predicts 42.63% A on DeSSE. We suspect that improved feature representation could raise performance; that is, pairs of words and their relations might be a weak supervision signal for B and D.
We conducted an ablation study on the inverse class weights mentioned in section 6 on MinWiki. After removing the weights, the model fails to learn other classes and only predicts A due to the highly imbalanced label distributions, which demonstrates the benefit of weighting the loss function. We also ablate positional encoding which leads to F1 scores of 0.90 for A, 0.51 for C, and 0 for both B and D, indicating the importance of positional encoding.

Intrinsic Evaluation of Output Sentences
For baselines, we use Copy 512 and DisSim, which both report performance on Wikisplit in previous work. We also include DCP, which relies on three rules applied to token-aligned dependency and constituency parses: DCP vp extracts clauses with tensed verb phrases; DCP sbar extracts SBAR subtrees from constituency trees; DCP recur recursively applies the preceding rules.
For evaluation, we use BLEU with four-grams (BL4) (Papineni et al., 2002) and BERTScore (BS) (Zhang et al., 2019). We also include descriptive measures specific to our task. To indicate whether a model retains roughly the same number of words as the source sentence in the target output, we report average number of tokens per simple sentence (#T/SS). To capture the correspondence between the number of target simple sentences in the ground truth and model predictions, we use percentage of samples where the model predicts the correct number of simple sentences (Match #SS). BL4 captures the 4-gram alignments between candidate and reference word strings, but fails to assess similarity of latent meaning. BS applies token-level matching through contextualized word embeddings, therefore evaluates candidates on their word meanings. For each example, we first align each simple sentence in the ground truth with a prediction, compute the pairwise BL4 and BS scores, and take the average as the score for the example. A predicted output sentence with no   correspondent in the ground truth, or a ground truth sentence with no correspondent in the predicted, will add 0 to the numerator and 1 to the denominator of this average. Table 5 presents results from the baselines and our ABCD best variant, biLSTM with two classifiers. None of the models surpasses all others on both datasets. All models show lower performance on DeSSE than MinWiki, again an indication that DeSSE is more challenging. On MinWiki, ABCD is competitive with Copy 512 , the best performing model, with a narrow gap on Match#SS (0.65%) and BLEU4 (4.58). On DeSSE, ABCD BL4 and BS surpass all baselines. ABCD performance is 2.34% less than DCP recur on Match #SS, but biL-STM+mlp output sentences have an average length of 8.85, which is closer to the gold average length of 9.07, in contrast to much longer output from DCP recur of 14.16. To summarize, ABCD achieves competitive results on both datasets.

Error Analysis
While Table 4 presents error analysis on predictions of B that lead to an incorrect number of outputs, here we examine test sentences from both datasets where the prediction and ground truth have the same number of outputs. Table 6 Figure 6 shows three complex sentences from DeSSE with the annotated rewriting, and predicted propositions from Copy 512 and ABCD mlp . Copy 512 correctly decomposes only one of the examples and copies the original input on the other two samples. On the one example where Copy produces two simple sentences, it alters the sentence meaning by replacing the word "genetics" with the word "interesting". This exposes a drawback of encoder-decoder architectures on the proposition identification task, that is, the decoder can introduce words that are not in the input sentence, therefore failing to preserve the original meaning. In contrast, ABCD shows good performance on all three sentences by producing the same number of simple sentences as in the annotated rewriting. Especially for the third sentence, which contains an embedded clause, "which has been the main mission since 9/11", the first proposition written by the annotator is not grammatically correct, and the subject of the second proposition is a pronoun it, referring to the semantic subject Our main mission. Nonetheless, ABCD generates two propositions, both of which are grammatically correct and meaning preserving.

Discussion
In this section, we discuss limitations of ABCD to guide future work. The first limitation is the low performance of ABCD on B. We observe that in DeSSE, some annotators did not break the sentences appropriately. We randomly selected 50 samples, and found 13 out of 50 (26%)   #T/SS =9.07). We report numbers of token per propositions (#T/SS), number of input sentences that have match number of output between prediction and ground truth in percentage (Match #SS%), BLEU with four-gram and BERTScore.

Orig
He did not do anything wrong, yet he was targeted and his family was murdered. Human He did not do anything wrong. || He was targeted. || His family was murdered. Copy He did not do anything wrong, he was targeted and his family was murdered. ABCD He did not do anything wrong.|| he was targeted. || his family was murdered. Orig I guess I always knew it was genetics but I didnt know why our features are the way that they are. Human I guess I always knew it was genetics. || I didnt know why our features are the way that they are. Copy I guess I always knew it was interesting.|| I didnt know why our features are the way that they are. ABCD I guess I always knew it was genetics.|| I didnt know why our features are the way that they are. Orig Our main mission, which has been the main mission since 9/11 is to eliminate terrorism wherever it may exist. Human Our main mission, which has been the main mission since 9/11.|| It is to eliminate terrorism wherever it may exist. Copy same as Orig ABCD Our main mission has been the main mission. || mission is to eliminate terrorism wherever it may exist.   where annotators add breaks to rewrite NPs and infinitives as clauses. This introduces noise into the data. Another reason of lower performance on B might be attributed to the current design of ABCD that neglects sequential relations among all words. Among all edge triples where it fails to assign B, 67% and 27.42% are with ngbh relations on MinWiki and DeSSE, respectively. Two possibilities for improving performance to investigate are enhancements to the information in the WRG graph, and re-formulating the problem into sequence labeling of triples. The second limitation pertains mainly to DeSSE. In the training data, 34.7% of sentences have OOV words. For example, we noticed that annotators sometimes introduced personal pronouns (e.g.he/she/they) in their rewrites of VPconjunction, instead of copying the subjects, or they substituted a demonstrative pronoun (e.g.this/these) for clausal arguments. This could be addressed by expanding the edit types to include the ability to INSERT words from a restricted insertion vocabulary. Nevertheless, our model has a small performance gap with Copy 512 on MinWiki, and outperforms the baselines on DeSSE.
A third issue is whether ABCD would generalize to other languages. We expect ABCD would perform well on European languages with existing dependency and constituency parsers, and with an annotated dataset.

Conclusion
We presented a new task to decompose complex sentences into simple ones, along with DeSSE, a new dataset designed for this task. We proposed the neural ABCD model to predict four edits operations on sentence graphs, as part of a larger pipeline from our graph-edit problem formulation. ABCD performance comes close to or outperforms the parsing-based and encoder-decoder baselines. Our work selectively integrates modules to capitalize on the linguistic precision of parsing-based methods, and the expressiveness of graphs for encoding different aspects of linguistic structure, while still capitalizing on the power of neural networks for representation learning.

A Annotation Instruction in DeSSE
Here we present the instructions for annotators, as shown by Figure 7. The instructions illustrate the two phases of annotation. The annotator first chooses whether to add one or more split points to an input sentence, where the word after a split point represents the first word of a new segment. Once an annotator has identified the split points (first page of the AMT interface, shown as Figure 8), a second page of the interface appears. Figure 9 shows the second view when annotators rewrite the segments. Every span of words defined by split points (or the original sentence if no split points), appears in its own text entry box for the annotator to rewrite. Annotators cannot submit if they remove all the words from a text entry box. They are instructed to rewrite each text span as a complete sentence, and to leave out the discourse connectives.
Several kinds of auto-checking and warnings are applied in the interface to ensure quality. If a rewrite contains a discourse connective, a warning box pops up asking if they should drop the discourse connective before submitting it. A warning box will show up if annotators use vocabulary outside the original sentence. To prevent annotators from failing to rewrite, we monitored the output, checking for cases where they submitted the text spans with no rewriting. Annotators were prohibited to submit if the interface detected an empty rewrite box or the total lengths of the rewrites are too short compared to the source sentence. We warned annotators by email that if they failed to produce complete sentences in the rewrite boxes, they would be blocked. Some annotators were blocked, but most responded positively to the warnings.

B Quality control in DeSSE
To test the clarity of instruction and interface, the initial 500 sentences were used for evaluating the task quality, each labeled by three turkers (73 turkers overall), using three measures of consistency, all in [0,1]. Average pairwise boundary similarity (Fournier, 2013), a very conservative measure of whether annotators produce the same number of segments with boundaries at nearly the same locations, was 0.55. Percent agreement on number of output substrings was 0.80. On annotations with the same number of segments, we measured the average Jaccard score (ratio of set intersection to set union) of words in segments from different annotators, which was 0.88, and words from rephrasings, which was 0.73. With all metrics close to 1, and boundary similarity above 0.5, we concluded quality was already high. During the actual data collection, quality was higher because we monitored quality on a daily basis and communicated with turkers who had questions.

C Experiment Settings
We trained our model on a Linux machine with four Nvidia RTX 2080 Ti GPUs. We conducted grid search for the hyper-parameters, with learning rage in the range of [1e-2, 1e-5] (step size 0.0005), weight decay between [0.90, 0.99], hidden size [200,800] (step size 200). Final parameters are set with Adam optimizer and learning rate at 1e − 4, weight decay 0.99, embedding dropout at 0.2, maximum epoch as 100 with early stop. We use GloVe 100 dimension vectors, hidden size of network as 800. We set the number of heads in self-attention as 4, corresponding to the four edit types. With batch size 64, it takes about 6 hours to train MinWiki and