The Fewer Splits are Better: Deconstructing Readability in Sentence Splitting

In this work, we focus on sentence splitting, a subfield of text simplification, primarily motivated by an unproven idea that if you divide a sentence into pieces, it should become easier to understand. Our primary goal in this paper is to determine whether this is true. In particular, we ask, does it matter whether we break a sentence into two or three? We report on our findings based on Amazon Mechanical Turk. More specifically, we introduce a Bayesian modeling framework to further investigate to what degree a particular way of splitting the complex sentence affects readability, along with a number of other parameters adopted from diverse perspectives, including clinical linguistics, and cognitive linguistics. The Bayesian modeling experiment provides clear evidence that bisecting the sentence leads to enhanced readability to a degree greater than when we create simplification by trisection.


Introduction
In text simplification, one question people often fail to ask is, whether the technology they are driving truly helps people better understand texts.This curious indifference may reflect the tacit recognition of the partiality of datasets covered by the studies (Xu et al., 2015) or some murkiness that surrounds the goal of text simplification.
As a way to address the situation, we examine a role of simplification in text readability, with a particular focus on sentence splitting.The goal of sentence splitting is to break a sentence into small pieces in a way that they collectively preserve the original meaning.A primary question we ask in this paper is, does a splitting of text affect readability?In the face of a large effort spent in the past on sentence splitting, it comes as a surprise that none of the studies put this question directly to people; in most cases, they ended up asking whether generated texts 'looked simpler' than the original unmodified versions (Zhang and Lapata, 2017), which of course does not say much about their readability.We are not even sure whether there was any agreement among people on what constituted simplification.
Another related question is, how many pieces should we break a sentence into?Two, three, or more?In the paper, we focus on a particular setting where we ask whether there is any difference in readability between two-and three-sentence splits.We also report on how good or bad sentence splits are that are generated by a fine-tuned language model, compared to humans'.
A general strategy we follow in the paper is to elicit judgments from people on whether simplification made a text anyway readable for them (Section 4), and do a Bayesian analysis of their responses to identify factors that may have influenced their decisions (Section 5).1

Related Work
Historically, there have been extensive efforts in ESL (English as a Second Language) to explore the use of simplification as a way to improve reading performance of L2 (second language) students.Crossley et al. (2014) presented an array of evidence showing that simplifying text did lead to an improved text comprehension by L2 learners as measured by reading time and and accuracy of their responses to associated questions.They also noticed that simple texts had less lexical diversity, greater word overlap, greater semantic similarity among sentences than more complicated texts.Crossley et al. (2011) argued for the importance of cohesiveness as a factor to influence the readability.Meanwhile, an elaborative modification of text was found to play a role in enhancing readability, which involves adding information to make the language less ambiguous and rhetorically more explicit.Ross et al. (1991) reported that despite the fact that it made a text longer, the elaborative manipulation of a text produced positive results, with L2 students scoring higher in comprehension questions on modified texts than on the original unmodified versions.
While there have been concerted efforts in the past in the NLP community to develop metrics and corpora purported to serve studies in simplification (Zhang and Lapata, 2017;Sulem et al., 2018a;Narayan et al., 2017;Botha et al., 2018;Niklaus et al., 2019;Kim et al., 2021;Xu et al., 2015), they fell far short of addressing how their work contributes to improving the text comprehensibility by readers.Part of our goal is to break away from a prevailing view that relegates the readability to a sideline.

Method
The data come from two sources, the Split and Rephrase Benchmark (v1.0) (SRB, henceforth) (Narayan et al., 2017) and WikiSplit (Botha et al., 2018).SRB consists of complex sentences aligned with a set of multi-sentence simplifications varying in size from two to four.WikiSplit follows a similar format except that each complex sentence is accompanied only by a two-sentence simiplification. 2 We asked Amazon Mechanical Turk workers (Turkers, henceforth) to score simplifications on linguistic qualities as well as to indicate whether they have any preference between two-sentence and three-sentence versions in terms of readability.
We randomly sampled a portion of SRB, creating test data (call it H), which consisted of triplets of the form: S 0 , A 0 , B 0 , . . ., S i , A i , B i , . . ., S m , A m , B m , where S i is a complex sentence, A i a corresponding two-sentence simplification, and B i its three-sentence version.While A alternates between versions created by BART and by human, B deals only with manual simplifications. 3 See Table 1 for a further explanation.
Separately, we extracted from WikiSplit and 2 We used WikiSplit, together with part of SRB, exclusively to fine tune BART to give a single split (bipartite) simplification model, and SRB to develop test data to be administered to humans for linguistic assessments.SRB was derived from WebNLG (Gardent et al., 2017) by making use of RDFs associated with textual snippets to assemble simplifications.
3 HSplit (Sulem et al., 2018a) is another dataset (based on Zhang and Lapata (2017)) that gives multi-split simplifications.We did not adopt it here as the data came with only 359 sentences with limited variations in splitting.(Narayan et al., 2017) and WikiSplit (Botha et al., 2018).The parenthetical numbers indicate amounts of data that originate in WikiSplit (Botha et al., 2018).
SRB, another dataset B consisting of complex sentences as a source and two-sentence simplifications as a target (Table 2) i.e.B = { S 0 , A 0 , . . ., S n , A n }, to use it to fine-tune a language model (BART-large). 4The fine-tuning was done using a code available at GitHub. 5 task (or a HIT in Amazon's parlance) we asked Turkers to do was to work on a three-part language quiz.The initial problem section introduced a worker to three short texts, corresponding to a triplet S i , A i , B i ; the second section asked about linguistic qualities of A i and B i along three dimensions, meaning, grammar, and fluency; and in the third, we asked two comparison questions: (1) whether A i and B i are more readable than S i , and (2) which of A i and B i is easier to understand.
Figure 1 gives a screen capture of an initial section of the task.Shown Under Source is a complex sentence or S i for some i .Text A and Text B correspond to A i and B i , which were displayed in a random order.
In total, there were 221 HITs (Table 1), each administered to seven people.All of the participants were self-reported native speakers of English with a degree from college or above.The participation was limited to residents in US, Canda, UK, Australia, and New Zealand.

Preliminary Analysis
Table 3 summarizes results from comparison questions.A question, labelled ⟪S, BART-A⟫ |q , asks a Turker, which of Source and BART-A he or she finds easier to understand, where BART-A is a BART generated two-sentence simplification.We had 791 (113×7) responses, out of which 32% said they preferred Source, 67% liked BART better, and 1% replied they were not sure.Another question, labelled ⟪S, HUM-A⟫ |q , compares Source to HUM-A, a two-sentence split by human.It got 756 responses (108×7).The result is generally parallel to ⟪S, BART-A⟫ |q .The majority of people favored a two-sentence split over a complex sentence.The fact that three sentence versions are also favored over complex sentences suggests that breaking up a complex sentence improves readability, regardless of how many pieces it ends up with.
Table 4 gives a tally of responses to comparison questions on two-and three-sentence splits.More people voted for bipartite over tripartite simplifi-cations.Tables 5 and 6 show scores on fluency, grammar, and meaning retention of simplifications, comparing BART-A and HUM-B,6 on one hand, and HUM-A and HUM-S, on another, on a scale of 1 (poor) to 5 (excellent).In either case, we did not see much divergence between A and B in grammar and meaning, but they diverged the most in fluency.A T-test found the divergence statistically significant.Two-sentence simplifications generally scored higher on fluency (over 4.0) than three sentence counterparts (below 4.0).Table 7    that they made.We answer the question by way of building a Bayesian model based on predictors assembled from the past literature on readability and in related fields.

Model
We consider a Bayesian logistic regression.7 Y j Ber(λ), Ber(λ) is a Bernoulli distribution with a parameter λ. β i represents a coefficient tied to a random variable (predictor) X i , where β 0 is an intercept.We assume that β i , including the intercept, follows a normal distribution with the mean at 0 and the variance at σ i .Y i takes either 1 or 0. Y = 1 if a Turker finds a two-sentence simplification more readable, and Y = 0 if a three-sentence version is preferred.
TYPE TEXT ORIGINAL The Alderney Airport serves the island of Alderney and its 1st runway is surfaced with poaceae and has a 497 meters long runway .

BART-A
Alderney Airport serves the island of Alderney .The 1st runway at Aarney Airport is surfaced with poaceae and has 497 meters long .

HUM-A
The runway length of Alderney Airport is 497.0 and the 1st runway has a poaceae surface .The Alderney Airport serves Alderney .

HUM-B
The surface of the 1st runway at Alderney airport is poaceae .Alderney Airport has a runway length of 497.0 .The Alderney Airport serves Alderney .(Roark et al., 2007;Boghrati et al., 2018).They reflect various approaches to measuring the cognitive complexity of a sentence.For example, yngve scoring defines a cognitive demand of a word as the number of non-terminals to its right in a derivation rule that are yet to be processed.

yngve
Consider Figure 2. yngve gives every edge in the parse a number reflecting its cognitive cost.NP gets '1' because it has a sister node VP to its right.The cognitive cost of a word is defined as the sum of numbers on a path from the root to the word.In Figure 2, 'Vanya' would get 1 + 0 + 0 = 1, whereas 'home' 0. Averaging words' costs gives us an Yngve complexity.5.2.4 subset and subtree subset and subtree are both measures based on the idea of Tree Kernel (Collins and Duffy, 2002;Moschitti, 2006;Chen et al., 2022). 11The former considers how many subgraphs two parses share, while the latter how many subtrees.Note that subtrees are those structures that end with terminal nodes.

Classic readability features
We also included features that have long been established in the readability literature as standard, i.e.Dale-Chall Readability, Flesch Reading Ease, and Flesch-Kincaid Grade Level (Chall and Dale, 1995;Flesch, 1979;Kincaid et al., 1975).

Perceptual features
Those found in the perception category are from judgments Turkers made on the quality of simplifications we asked them to evaluate.We did not 11 Tree Kernel is a function defined as K(T1, T2) = a represents the i-th child of a.We let σ > 0.
provide any specific definition or instruction as to what constitutes grammaticality, meaning, and fluency during the task.So, it is most likely that their responses were spontaneous and perceptual.

split and samsa
Finally, we have split, which records whether or not the simplification is bipartite: it takes true if it is, and false if not.samsa is a recent addition to a battery of simplification metrics, which looks at how much of a propositional content in the source remains after a sentence is split (Sulem et al., 2018b).(The greater, the better.)We standardized all of the features, except for bart and split, by turning them into z-scores, where z = x−x σ .

Evaluation
We trained the model (Eqn. 1) using BAMBI (Capretto et al., 2020), 12 with the burn-in of 50,000 while making draws of 4,000, on 4 MCMC chains (Hamiltonian).As a way to isolate the effect (or importance) of each predictor, we did two things: one was to look at a posterior distribution of each factor, i.e. a coefficient β tied with a predictor, and see how far it is removed from 0; another was to conduct an ablation study where we looked at how the absence of a feature affected the model's performance, which we measured with a metric known as 'Watanabe-Akaike Information Criterion' (WAIC) (Watanabe, 2010;Vehtari et al., 2016), a Bayesian incarnation of AIC (Burnham and Anderson, 2003). 13igure 4 shows what posterior distributions of parameters associated with predictors looked like after 4,000 draw iterations with MCMC.None of the chains associated with the parameters exhibited divergence.We achieved R between 1.0 and 1.02, for all β i , a fairly solid stability (Gelman and Rubin, 1992), indicating that all the relevant parameters had successfully converged. 14R = the ratio of within-and between-chain variances, a standard tool to check for convergence (Lambert, 2018).The At a first glance, it is a bit challenging what to make of Figure 4, but a generally accepted rule of thumb is to assume distributions that center around 0 as of less importance in terms of explaining observations, than those that appear away from zero.If we go along with the rule, then the most likely candidates that affected readability are: ease, subset, fk grade, grammar, meaning, fluency, split, and overlap.What remains unclear is, to what degree the predictors affected readability.
One good way to find out is to do an ablation study, a method to isolate the effects of an individual factor by examining how seriously its removal from a model degrades its performance.The result of the study is shown in Table 9.Each row represents performance in WAIC of a model with a particular predictor removed.Thus, 'ted1' in Table 9 represents a model that includes all the predictors in Table 8, except for ted1.A row in blue represents a full model which had none of the features disabled.Appearing above the base model means that a removal of a feature had a positive effect, i.e. the feature is redundant.Appearing below means that the removal had a negative effect, indicating that we should not forgo the feature.A feature becomes more relevant as we go down, and becomes less relevant as we go up the table.Thus the most relevant is fluency, followed by meaning, the least relevant is subtree, followed by dale, and so forth.We can tell from Table 9 what predictors we need to keep to explain the readability: they are closer the ratio is to the unity, the more likely MCMC chains have converged.grammar, split, fk grade, ease, meaning and fluency (call them 'select features').Note that bart is in the negative realm, meaning that from a perspective of readability, people did not care about whether the simplification was done by human or machine.samsa was also found in the negative domain, implying that for a perspective of information, a two-sentence splitting carries just as much information as a three way division of a sentence.
To further nail down to what extent they are important, we ran another ablation experiment involving the select features alone.The result is shown in Table 10.At the bottom is fluency, the second to the bottom is split, followed by meaning, and so forth.As we go up the table, a feature becomes less and less important.The posterior distributions of these features are shown in Figure 5. 15 Not surprisingly, they are found away from zero, with fluency furtherest away.The result indicates that contrary to the popular wisdom that classic readability metrics such as ease, and fk grade, are of little use, they had a large sway on decisions people made when they were asked about readability.

Conclusions
In this work, we asked two questions: does cutting up a sentence help the reader better understand the text?and if so, does it matter how many pieces we break it into?We found that splitting does allow the reader to better interact with the text (Table 3) and moreover, two-sentence simplifications are clearly favored over three-sentence simplifications (Tables 3,9,10).Why two-sentence splits make a better simplification is something of a mystery.A possible answer may lie in a potential disruption splitting may have caused in a sentencelevel discourse structure, whose integrity Crossley et al. (2011Crossley et al. ( , 2014) ) argued, constitutes a critical part of simplification, a topic that we believe is worth a further exploration in the future.

Limitations
• We did not consider cases where a sentence is split into more than three.This is mainly due to our failure to find a dataset containing manual simplifications of length greater than three in a large number.While it is unlikely that our claim in this work does not hold for cases beyond three, testing the hypothesis on cases that involve more than three sentences would be desirable.
• A cohort of people we solicited for the current work are generally well educated adults who speak English as the first language.Therefore, the results we found in this work may not necessarily hold for L2-learners, minors, or those who do not have college level education.

Figure 1 :
Figure 1: A screen capture of HIT.This is what a Turker would be looking at when taking the test.

Figure 4 :
Figure 4: Posterior distributions of coefficients (β's) in the full model.The further the distribution moves away from 0, the more relevant it becomes to predicting the outcome.

Figure 5 :
Figure 5: Posterior distributions of the coefficient parameters in the reduced model.

Table 2 :
A training setup for BART.The data comes from SRB

Table 3 :
gives an example showing what generated texts looked like in BART-A and HUM-A/B.Results from the Comparison Section.We are showing how many Turkers went with each available choice.S: source.BART-A: BART-generated two-sentence simplification.HUM-A: manual two-sentence simplification.HUM-B: manual three-sentence simplification.⟪S, BART-A⟫ |q asked Turkers which of S and BART-A they found easier to understand.67% said they would favor BART-A, and 32% S, with 1% not sure.⟪S, HUM-B⟫ |q compares S and HUM-B for readability.⟪S, HUM-A⟫ |q looks at S and HUM-A.

Table 4 :
Comparison of two-vs three-sentence simplifications.The majority went with two-sentence simplifications regardless of how they were generated.

Table 5 :
Average scores and standard deviations for HUM-A and HUM-B.HUM-A is more fluent than HUM-B.Note: ** = p < 0.01.

Table 6 :
Average scores and standard deviations of BART-A and the corresponding HUM-B.BART-A is significantly more fluent than HUM-B.'**' indicates the two groups are distinct at the 0.01 level.