Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification

Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text, usually functioning as a blackbox. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process. The first two steps in this pipeline are often neglected: 1) to predict whether a given piece of text needs to be simplified, and 2) if yes, to identify complex parts of the text. The two tasks can be solved separately using either lexical or deep learning methods, or solved jointly. Simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box simplification models can be improved by a large margin.


Introduction
Text simplification aims to reduce the language complexity of highly specialized textual content so that it is accessible for readers who lack adequate literacy skills, such as children, people with low education, people who have reading disorders or dyslexia, and non-native speakers of the language.
Mismatch between language complexity and literacy skills is identified as a critical source of bias and inequality in the consumers of systems built upon processing and analyzing professional text content. Research has found that it requires on average 18 years of education for a reader to properly understand the clinical trial descriptions on ClinicalTrials.gov, and this introduces a potential self-selection bias to those trials (Wu et al., 2016).
Text simplification has considerable potential to improve the fairness and transparency of text information systems. Indeed, the Simple English Wikipedia (simple.wikipedia.org) has been constructed to disseminate Wikipedia articles to kids and English learners. In healthcare, consumer vocabulary are used to replace professional medical terms to better explain medical concepts to the public (Abrahamsson et al., 2014). In education, natural language processing and simplified text generation technologies are believed to have the potential to improve student outcomes and bring equal opportunities for learners of all levels in teaching, learning and assessment (Mayfield et al., 2019).
Ironically, the definition of "text simplification" in literature has never been transparent. The term may refer to reducing the complexity of text at various linguistic levels, ranging all the way through replacing individual words in the text to generating a simplified document completely through a computer agent. In particular, lexical simplification (Devlin, 1999) is concerned with replacing complex words or phrases with simpler alternatives; syntactic simplification (Siddharthan, 2006) alters the syntactic structure of the sentence; semantic simplification (Kandula et al., 2010) paraphrases portions of the text into simpler and clearer variants. More recent approaches simplify texts in an end-toend fashion, employing machine translation models in a monolingual setting regardless of the type of simplifications (Zhang and Lapata, 2017;Guo et al., 2018;Van den Bercken et al., 2019). Nevertheless, these models are limited on the one hand due to the absence of large-scale parallel (complex → simple) monolingual training data, and on the other hand due to the lack of interpretibility of their black-box procedures (Alva-Manchego et al., 2017).
Given the ambiguity in problem definition, there also lacks consensus on how to measure the goodness of text simplification systems, and automatic evaluation measures are perceived ineffective and sometimes detrimental to the specific procedure, in particular when they favor shorter but not necessar-ily simpler sentences (Napoles et al., 2011). While end-to-end simplification models demonstrate superior performance on benchmark datasets, their success is often compromised in out-of-sample, real-world scenarios (D'Amour et al., 2020).
Our work is motivated by the aspiration that increasing the transparency and explainability of a machine learning procedure may help its generalization into unseen scenarios (Doshi-Velez and Kim, 2018). We show that the general problem of text simplification can be formally decomposed into a compact and transparent pipeline of modular tasks. We present a systematic analysis of the first two steps in this pipeline, which are commonly overlooked: 1) to predict whether a given piece of text needs to be simplified at all, and 2) to identify which part of the text needs to be simplified. The second task can also be interpreted as an explanation of the first task: why a piece of text is considered complex. These two tasks can be solved separately, using either lexical or deep learning methods, or they can be solved jointly through an end-to-end, explainable predictor. Based on the formal definitions, we propose general evaluation metrics for both tasks and empirically compare a diverse portfolio of methods using multiple datasets from different domains, including news, Wikipedia, and scientific papers. We demonstrate that by simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box models can be improved by a large margin.
Our work presents a promising direction towards a transparent and explainable solution to text simplification in various domains. Text simplification at word level has been done through 1) lexicon based approaches, which match words to lexicons of complex/simple words (Deléger and Zweigenbaum, 2009;Elhadad and Sutaria, 2007), 2) threshold based approaches, which apply a threshold over word lengths or certain statistics (Leroy et al., 2013), 3) human driven approaches, which solicit the user's input on which words need simplification (Rello et al., 2013), and 4) classification methods, which train machine learning models to distinguish complex words from simple words (Shardlow, 2013). Com-plex word identification is also the main topic of SemEval 2016 Task 11 (Paetzold and Specia, 2016), aiming to determine whether a non-native English speaker can understand the meaning of a word in a given sentence. Significant differences exist between simple and complex words, and the latter on average are shorter, less ambiguous, less frequent, and more technical in nature. Interestingly, the frequency of a word is identified as a reliable indicator of its simplicity (Leroy et al., 2013).
While the above techniques have been widely employed for complex word identification, the results reported in the literature are rather controversial and it is not clear to what extent one technique outperforms the other in the absence of standardized high quality parallel corpora for text simplification (Paetzold, 2015). Pre-constructed lexicons are often limited and do not generalize to different domains. It is intriguing that classification methods reported in the literature are not any better than a "simplify-all" baseline (Shardlow, 2014).

Readability assessment
Traditionally, measuring the level of reading difficulty is done through lexicon and rule-based metrics such as the age of acquisition lexicon (AoA) (Kuperman et al., 2012) and the Flesch-Kincaid Grade Level (Kincaid et al., 1975). A machine learning based approach in (Schumacher et al., 2016) extracts lexical, syntactic, and discourse features and train logistic regression classifiers to predict the relative complexity of a single sentence in a pairwise setting. The most predictive features are simple representations based on AoA norms. The perceived difficulty of a sentence is highly influenced by properties of the surrounding passage. Similar methods are used for fine-grained classification of text readability (Aluisio et al., 2010) and complexity (Štajner and Hulpus , , 2020).

Computer-assisted paraphrasing
Simplification rules are learnt by finding words from a complex sentence that correspond to different words in a simple sentence (Alva-Manchego et al., 2017). Identifying simplification operations such as copies, deletions, and substitutions for words from parallel complex vs. simple corpora helps understand how human experts simplify text (Alva-Manchego et al., 2017). Machine translation has been employed to learn phrase-level alignments for sentence simplification (Wubben et al., 2012). Lexical and phrasal paraphrase rules are extracted in . These methods are often evaluated by comparing their output to gold-standard, human-generated simplifications, using standard metrics (e.g., token-level precision, recall, F1), machine translation metrics (e.g., BLEU (Papineni et al., 2002) ), text simplification metrics (e.g. SARI (Xu et al., 2016) which rewards copying words from the original sentence), and readability metrics (among which Flesch-Kincaid Grade Level (Kincaid et al., 1975) and Flesch Reading Ease (Kincaid et al., 1975) are most commonly used). It is desirable that the output of the computational models is ultimately validated by human judges (Shardlow, 2014).

End-to-end simplification
Neural encoder-decoder models are used to learn simplification rewrites from monolingual corpora of complex and simple sentences (Scarton and Specia, 2018;Van den Bercken et al., 2019;Zhang and Lapata, 2017;Guo et al., 2018). On one hand, these models often obtain superior performance on particular evaluation metrics, as the neural network directly optimizes these metrics in training. On the other hand, it is hard to interpret what exactly are learned in the hidden layers, and without this transparency it is difficult to adapt these models to new data, constraints, or domains. For example, these end-to-end simplification models tend not to distinguish whether the input text should or should not be simplified at all, making the whole process less transparent. When the input is already simple, the models tend to oversimplify it and deviate from its original meaning (see Section 5.3).

Explanatory Machine Learning
Various approaches are proposed in the literature to address the explainability and interpretability of machine learning agents. The task of providing explanations for black-box models has been tackled either at a local level by explaining individual predictions of a classifier (Ribeiro et al., 2016), or at a global level by providing explanations for the model behavior as a whole (Letham et al., 2015). More recently, differential explanations are proposed to describe how the logic of a model varies across different subspaces of interest (Lakkaraju et al., 2019). Layer-wise relevance propagation (Arras et al., 2017) is used to trace backwards text classification decisions to individual words, which are assigned scores to reflect their separate contribution to the overall prediction. LIME (Ribeiro et al., 2016) is a model-agnostic explanation technique which can approximate any machine learning model locally with another sparse linear interpretable model. SHAP (Lundberg and Lee, 2017) evaluates Shapley values as the average marginal contribution of a feature value across all possible coalitions by considering all possible combinations of inputs and all possible predictions for an instance. Explainable classification can also be solved simultaneously through a neural network, using hard attentions to select individual words into the "rationale" behind a classification decision (Lei et al., 2016). Extractive adversarial networks employs a three-player adversarial game which addresses high recall of the rationale (Carton et al., 2018). The model consists of a generator which extracts an attention mask for each token in the input text, a predictor that cooperates with the generator and makes prediction from the rationale (words attended to), and an adversarial predictor that makes predictions from the remaining words in the inverse rationale. The minimax game between the two predictors and the generator is designed to ensure all predictive signals are included into the rationale.
No prior work has addressed the explainability of text complexity prediction. We fill in this gap.

An Explainable Pipeline for Text Simplification
We propose a unified view of text simplification which is decomposed into several carefully designed sub-problems. These sub-problems generalize over many approaches, and they are logically dependent on and integratable with one another so that they can be organized into a compact pipeline. Figure 1: A text simplification pipeline. Explainable prediction of text complexity is the preliminary of any human-based, computer assisted, or automated system. The first conceptual block in the pipeline (Figure 1) is concerned with explainable prediction of the complexity of text. It consists of two sub-tasks: 1) prediction: classifying a given piece of text into two categories, needing simplification or not; and 2) explanation: highlighting the part of the text that needs to be simplified. The second conceptual block is concerned with simplification generation, the goal of which is to generate a new, simplified version of the text that needs to be simplified. This step could be achieved through completely manual effort, or a computer-assisted approach (e.g., by suggesting alternative words and expressions), or a completely automated method (e.g., by selftranslating into a simplified version). The second building block is piped into a step of human judgment, where the generated simplification is tested, approved, and evaluated by human practitioners.
One could argue that for an automated simplification generation system the first block (complexity prediction) is not necessary. We show that it is not the case. Indeed, it is unlikely that every piece of text needs to be simplified in reality, and instead the system should first decide whether a sentence needs to be simplified or not. Unfortunately such a step is often neglected by existing end-to-end simplifiers, thus their performance is often biased towards the complex sentences that are selected into their training datasets at the first place and doesn't generalize well to simple inputs. Empirically, when these models are applied to out-of-sample text which shouldn't be simplified at all, they tend to oversimplify the input and result in a deviation from its original meaning (see Section 5.3).
One could also argue that an explanation component (1B) is not mandatory in certain text simplification practices, in particular in an end-to-end neural generative model that does not explicitly identify the complex parts of the input sentence. In reality, however, it is often necessary to highlight the differences between the original sentence and the simplified sentence (which is essentially a variation of 1B) to facilitate the validation and evaluation of these black-boxes. More generally, the explainability/interpretability of a machine learning model has been widely believed to be an indispensable factor to its fidelity and fairness when applied to the real world (Lakkaraju et al., 2019). Since the major motivation of text simplification is to improve the fairness and transparency of text information systems, it is critical to explain the ra-tionale behind the simplification decisions, even if they are made through a black-box model.
Without loss of generality, we can formally define the sub-tasks 1A, 1B, and 2-in the pipeline: Definition 3.1. (Complexity Prediction). Let text d ∈ D be a sequence of tokens w 1 w 2 ...w n . The task of complexity prediction is to find a function f : D → {0, 1} such that f (d) = 1 if d needs to be simplified, and f (d) = 0 otherwise. Definition 3.2. (Complexity Explanation). Let d be a sequence of tokens w 1 w 2 ...w n and f (d) = 1. The task of complexity explanation/highlighting is to find a function h : ..c n , where c i = 1 means w i will be highlighted as a complex portion of d and c i = 0 otherwise. We denote d|h(d) as the highlighted part of d and d|¬h(d) as the unhighlighted part of d.
Definition 3.3. (Simplification Generation). Let d be a sequence of tokens w 1 w 2 ...w n and f (d) = 1. The task of simplification generation is to find a function g : .w m and f (d ) = 0, subject to the constraint that d preserves the meaning of d.
In this paper, we focus on an empirical analysis of the first two sub-tasks of explainable prediction of text complexity (1A and 1B), which are the preliminaries of any reasonable text simplification practice. We leave aside the detailed analysis of simplification generation (2-) for now, as there are many viable designs of g(·) in practice, spanning the spectrum between completely manual and completely automated. Since this step is not the focus of this paper, we intend to leave the definition of simplification generation highly general.
Note that the definitions of complexity prediction and complexity explanation can be naturally extended to a continuous output, where f (·) predicts the complexity level of d and h(·) predicts the complexity weight of w i . The continuous output would align the problem more closely to readability measures (Kincaid et al., 1975). In this paper, we stick to the binary output because a binary action (to simplify or not) is almost always necessary in reality even if a numerical score is available.
Note that the definition of complexity explanation is general enough for existing approaches. In lexical simplification where certain words in a complex vocabulary V are identified to explain the complexity of a sentence, it is equivalent to highlighting every appearance of these words in d, or ∀w i ∈ V, c i = 1. In automated simplification where there is a self-translation function g(d) = d , h(d) can be simply instantiated as a function that returns a sequence alignment of d and d . Such reformulation helps us define unified evaluation metrics for complexity explanation (see Section 4).
It is also important to note that the dependency between the components, especially complexity prediction and explanation, does not restrict them to be done in isolation. These sub-tasks can be done either separately, or jointly with an end-toend approach as long as the outputs of f, h, g are all obtained (so that transparency and explainability are preserved). In Section 4, we include both separate models and end-to-end models for explanatory complexity predication in one shot.

Empirical Analysis of Complexity Prediction and Explanation
With the pipeline formulation, we are able to compare a wide range of methods and metrics for the sub-tasks of text simplification. We aim to understand how difficult they are in real-world settings and which method performs the best for which task.

Candidate Models
We examine a wide portfolio of deep and shallow binary classifiers to distinguish complex sentences from simple ones. Among the shallow models we use Naive Bayes (NB), Logistic Regression (LR), Support Vector Machines (SVM) and Random Forests (RF) classifiers trained with unigrams, bigrams and trigrams as features. We also train the classifiers using the lexical and syntactic features proposed in (Schumacher et al., 2016) combined with the n-gram features (denoted as "enriched features"). We include neural network models such as word and char-level Long Short-Term Memory Network (LSTM) and Convolutional Neural Networks (CNN). We also employ a set of state-of-the-art pre-trained neural language models, fine-tuned for complexity prediction; we introduce them below.
ULMFiT (Howard and Ruder, 2018) a language model on a large general corpus such as WikiText-103 and then fine-tunes it on the target task using slanted triangular rates, and gradual unfreezing. We use the publicly available implementation 1 of the model with two fine-tuning epochs for each dataset and the model quickly adapts to a new task.
BERT (Devlin et al., 2019) trains deep bidirectional language representations and has greatly advanced the state-of-the-art for many natural language processing tasks. The model is pre-trained on the English Wikipedia as well as the Google Book Corpus. Due to computational constraints, we use the 12 layer BERT base pre-trained model and fine-tune it on our three datasets. We select the best hyperparameters based on each validation set.
XLNeT (Yang et al., 2019) overcomes the limitations of BERT (mainly the use of masks) with a permutation-based objective which considers bidirectional contextual information from all positions without data corruption. We use the 12 layer XLNeT base pre-trained model on the English Wikipedia, the Books corpus (similar to BERT), Giga5, ClueWeb 2012-B, and Common Crawl.

Evaluation Metric
We evaluate the performance of complexity prediction models using classification accuracy on balanced training, validation, and testing datasets.

Candidate Models
We use LIME in combination with LR and LSTM classifiers, SHAP on top of LR, and the extractive adversarial networks which jointly conducts complexity prediction and explanation. We feed each test complex sentence as input to these explanatory models and compare their performance at identifying tokens (words and punctuation) that need to be removed or replaced from the input sentence.
We compare these explanatory models with three baseline methods: 1) Random highlighting: randomly draw the size and the positions of tokens to highlight; 2) Lexicon based highlighting: highlight words that appear in the Age-of-Acquisition (AoA) lexicon (Kuperman et al., 2012), which contains ratings for 30,121 English content words (nouns, verbs, and adjectives) indicating the age at which a word is acquired; and 3) Feature highlighting: highlight the most important features of the best performing LR models for complexity prediction.

Evaluation Metrics
Evaluation of explanatory machine learning is an open problem. In the context of complexity explanation, when the ground truth of highlighted tokens (y c (d) = c 1 c 2 ...c n , c i ∈ {0, 1}) in each complex sentence d is available, we can compare the output of complexity explanation h(d) with y c (d). Such per-token annotations are usually not available in scale. To overcome this, given a complex sentence d and its simplified version d , we assume that all tokens w i in d which are absent in d are candidate words for deletion or substitution during the text simplification process and should therefore be highlighted in complexity explanation (i.e., c i = 1).
In particular, we use the following evaluation metrics for complexity explanation: 1) Tokenwise Precision (P), which measures the proportion of highlighted tokens in d that are truly removed in d ; 2) Tokenwise Recall (R), which measures the proportion of tokens removed in d that are actually highlighted in d; 3) Tokenwise F1, the harmonic mean of P and R; 4) word-level Edit distance (ED) (Levenshtein, 1966): between the unhighlighted part of d and the simplified document d . Intuitively, a more successful complexity explanation would highlight most of the tokens that need to be simplified, thus the remaining parts in the complex sentences will be closer to the simplified version, achieving a lower edit distance (we also explore ED with a higher penalty cost for the substitution operation, namely values of 1, 1.5 and 2); and 5) Translation Edit Rate (TER) (Snover et al., 2006), which measures the minimum number of edits needed to change a hypothesis (the unhighlighted part of d) so that it exactly matches the closest references (the simplified document d ). Note these metrics are all proxies of the real editing process from d to d . When token-level edit history is available (e.g., through track changes), it is better to compare the highlighted evaluation with these true changes made. We compute all the metrics at sentence level and macro-average them.

Datasets
We use three different datasets (Table 1) which cover different domains and application scenarios of text simplification. Our first dataset is Newsela (Xu et al., 2015), a corpus of news articles simplified by professional news editors. In our experiments we use the parallel Newsela corpus with the training, validation, and test splits made available in (Zhang and Lapata, 2017). Second, we use the WikiLarge corpus introduced in (Zhang and Lapata, 2017). The training subset of WikiLarge is created by assembling datasets of parallel aligned Wikipedia -Simple Wikipedia sentence pairs available in the literature (Kauchak, 2013). While this training set is obtained through automatic alignment procedures which can be noisy, the validation and test subsets of WikiLarge contain complex sentences with simplifications provided by Amazon Mechanical Turk workers (Xu et al., 2016); we increase the size of validation and test on top of the splits made available in (Zhang and Lapata, 2017). Third, we use the dataset released by the Biendata competition 2 , which asks participants to match research papers from various scientific disciplines with press releases that describe them. Arguably, rewriting scientific papers into press releases has mixed objectives that are not simply text simplification. We include this task to test the generalizability of our explainable pipeline (over various definitions of simplification). We use alignments at title level. On average, a complex sentence in Newsela, WikiLarge, Biendata contains 23.07, 25.14, 13.43 tokens, and the corresponding simplified version is shorter, with 12.75, 18.56, 10.10 tokens.

Ground Truth Labels
The original datasets contain aligned complexsimple sentence pairs instead of classification labels for complexity prediction. We infer groundtruth complexity labels for each sentence such that: label 1 is assigned to every sentence for which there is an aligned simpler version not identical to itself (the sentence is complex and needs to be simplified); label 0 is assigned to all simple counterparts of complex sentences, as well as to those sentences that have corresponding "simple" versions identical to themselves (i.e., these sentences do not need to be simplified). For complex sentences that have label 1, we further identify which tokens are not present in corresponding simple versions.

Model Training
For all shallow and deep classifiers we find the best hyperparameters using random search on validation, with early stopping. We use grid search on validation to fine-tune hyperparameters of the pre-trained models, such as maximum sequence length, batch size, learning rate, and number of epochs. For ULMFit on Newsela, we set batch size to 128 and learning rate to 1e-3. For BERT on WikiLarge, batch size is 32, learning rate is 2e-5, and maximum sequence length is 128. For XLNeT on Biendata, batch size is 32, learning rate is 2e-5, and maximum sequence length is 32. We use grid search on validation to fine-tune the complexity explanation models, including the extractive adversarial network. For LR and LIME we determine the maximum number of words to highlight based on TER score on validation (please see Table 2); for SHAP we highlight all features with positive assigned weights, all based on TER. For extractive adversarial networks batch size is set to 256, learning rate is 1e-4, and adversarial weight loss equals 1; in addition, sparsity weight is 1 for Newsela and Biendata, and 0.6 for WikiLarge; lastly, coherence weight is 0.05 for Newsela, 0.012 for WikiLarge, and 0.0001 for Biendata.

Complexity Prediction
In Table 3, we evaluate how well the representative shallow, deep, and pre-trained classification models can determine whether a sentence needs to be simplified at all. We test for statistical significance of the best classification results compared to all other models using a two-tailed z-test.
In general, the best performing models can achieve around 80% accuracy on two datasets (Newsela and WikiLarge) and a very high performance on the Biendata (> 95%). This difference presents the difficulty of complexity prediction in different domains, and distinguishing highly specialized scientific content from public facing press releases is relatively easy (Biendata).
Deep classification models in general outperform shallow ones, however with carefully designed handcrafted features and proper hyperparameter optimization shallow models tend to approach to the results of the deep classifiers. Overall models pre-trained on large datasets and finetuned for text simplification yield superior classifi- .48% * * * Shallow models perform similarly and some are omitted for space; Difference between the best performing model and other models is statistically significant: p < 0.05 (*), p < 0.01 (**), except for †: difference between this model and the best performing model is not statistically significant. cation performance. For Newsela the best performing classification model is ULMFiT (accuracy = 80.83%, recall = 76.87%), which significantly (p < 0.01) surpasses all other classifiers except for XL-NeT and CNN (char-level). On WikiLarge, BERT presents the highest accuracy (81.45%, p < 0.01), and recall = 83.30%. On Biendata, XLNeT yields the highest accuracy (95.48%, p < 0.01) with recall = 94.93%, although the numerical difference to other pre-trained language models is small. This is consistent with recent findings in other natural language processing tasks (Cohan et al., 2019).

Complexity Explanation
We evaluate how well complexity classification can be explained, or how accurately the complex parts of a sentence can be highlighted.
Results (Table 4) show that highlighting words in the AoA lexicon or LR features are rather strong baselines, indicating that most complexity of a sentence still comes from word usage. Highlighting more LR features leads to a slight drop in precision and a better recall. Although LSTM and LR perform comparably on complexity classification, using LIME to explain LSTM presents better recall, F1, and TER (at similar precision) compared to using LIME to explain LR. The LIME & LSTM combination is reasonably strong on all datasets, as is SHAP & LR. TER is a reliable indicator of the difficulty of the remainder (unhighlighted part) of the complex sentence. ED with a substitution penalty of 1.5 efficiently captures the variations among the explanations. On Newsela and Bien- data, the extractive adversarial networks yield solid performances (especially TER and ED 1.5), indicating that jointly making predictions and generating explanations reinforces each other. Table 5 provides examples of highlighted complex sentences by each explanatory model.

Benefit of Complexity Prediction
One may question whether explainable prediction of text complexity is still a necessary preliminary step in the pipeline if a strong, end-to-end simplification generator is used. We show that it is. We consider the scenario where a pre-trained, end-toend text simplification model is blindly applied to texts regardless of their complexity level, compared to only simplifying those considered complex by the best performing complexity predictor in Table 3. Such a comparison demonstrates whether adding complexity prediction as a preliminary step is beneficial to a text simplification process when a state-of-the-art, end-to-end simplifier is already in place. From literature we select the current best text simplification models on WikiLarge and Newsela which have released pre-trained models: • ACCESS (Martin et al., 2020), a controllable sequence-to-sequence simplification model that reported the highest performance (41.87 SARI) on WikiLarge.
We apply the author-released, pre-trained AC-CESS and DMLMTL on all sentences from the validation and testing sets of all three datasets. We do not use the training examples as the pre-trained models may have already seen them. Presumably, a smart model should not further simplify an input sentence if it is already simple enough. However, to our surprise, a majority of the out-of-sample simple sentences are still changed by both models (above 90% by DMLMTL and above 70% by ACCESS, please see Table 6).
We further quantify the difference with vs. without complexity prediction as a preliminary step. Intuitively, without complexity prediction, an already simple sentence is likely to be overly simplified and result in a loss in text simplification metrics. In contrast, an imperfect complexity predictor may mistaken a complex sentence as simple, which misses the opportunity of simplification and results in a loss as well. The empirical question is which loss is higher. From Table 7, we see that after directly adding a complexity prediction step before either of the state-of-the-art simplification models, there is a considerable drop of errors in three text simplification metrics: Edit Distance (ED), TER, and Fréchet Embedding Distance (FED) that measures the difference of a simplified text and the groundtruth in a semantic space (de Masson d'Autume et al., 2019). For ED alone, the improvements are between 30% to 50%. This result is very encouraging: considering that the complexity predictors are only 80% accurate and the complexity predictor and the simplification models don't depend on each other, there is considerable room to optimize this gain. Indeed, the benefit is higher on Biendata where the complexity predictor is more accurate.
Qualitatively, one could frequently observe syntactic, semantic, and logical mistakes in the modelsimplified version of simple sentences. We give a few examples below.    • Healthy diet linked to lower risk of chronic lung disease → Healthy diet linked to lung disease (DMLMTL) • Dramatic changes needed in farming practices to keep pace with climate change → changes needed to cause climate change (DMLMTL) • Social workers can help patients recover from mild traumatic brain injuries → Social workers can cause better problems . (DMLMTL) All these qualitative and quantitative results suggest that the state-of-the-art black-box models tend to oversimplify and distort the meanings of outof-sample input that is already simple. Evidently, the lack of transparency and explainability has limited the application of these end-to-end black-box models in reality, especially to out-of-sample data, context, and domains. The pitfall can be avoided with the proposed pipeline and simply with explainable complexity prediction as a preliminary step. Even though this explainable preliminary does not necessarily reflect how a black-box simplification model "thinks", adding it to the model is able to yield better out-of-sample performance.

Conclusions
We formally decompose the ambiguous notion of text simplification into a compact, transparent, and logically dependent pipeline of sub-tasks, where explainable prediction of text complexity is identified as the preliminary step. We conduct a systematic analysis of its two sub-tasks, namely complexity prediction and complexity explanation, and show that they can be either solved separately or jointly through an extractive adversarial network. While pre-trained neural language models achieve significantly better performance on complexity prediction, an extractive adversarial network that solves the two tasks jointly presents promising advantage in complexity explanation. Using complexity prediction as a preliminary step reduces the error of the state-of-the-art text simplification models by a large margin. Future work should integrate rationale extractor into the pre-trained neural language models and extend it for simplification generation.