Enhancing Automatic Readability Assessment with Pre-training and Soft Labels for Ordinal Regression

,


Introduction
Readability assessment quantifies the difficulty of a text, that is, the degree to which it can be easily read and understood (McLaughlin, 1969;Klare, 2000).Since an automatic readability assessment (ARA) system can assign a text to a difficulty grade, it is useful for identifying texts or books that are suitable for individuals according to their language proficiency, intellectual and psychological development.ARA research harks back to the last century (Lively and Pressey, 1923;Klare, 1963) and has attracted rising attention in recent years, with impressive performance achieved by many neural approaches (Azpiazu and Pera, 2019;Tseng et al., 2019;Schicchi et al., 2020;Azpiazu and Pera, 2020;Deutsch et al., 2020;Martinc et al., 2021;Lee et al., 2021;Vajjala, 2021;Tanaka-Ishii et al., 2010;Lee and Vajjala, 2022).
There are however a number of limitations in the design and training of current ARA models.First, even though difficulty grades are clearly ordinal in nature, most systems approach the task as multiclass classification with independent labels.During training, texts in adjacent grades (e.g., Grades 2 and 3) are not treated as more similar than those in distant grades (e.g., Grades 2 and 6).Second, although a good initialization can optimize performance in many natural language processing tasks (Tamborrino et al., 2020), state-of-the-art ARA systems generally rely on random initialization (Azpiazu and Pera, 2019;Martinc et al., 2021).
This paper aims to further improve ARA performance by investigating the following research questions: Ordinal information Can the use of soft labels for ordinal regression (Diaz and Marathe, 2019) improve performance?
Model initialization Can the model be better initialized through pre-training on pairwise relative prediction of text difficulty?
In contrast to most previous work, we conduct both within-corpus and cross-corpus experiments to answer these questions.Within-corpus evaluation may not accurately reflect ARA performance when the model is deployed on texts from other collections.Further, some features (e.g., text length, topics) could be domain-dependent and may not provide the same performance boost in other domains.
The rest of the paper is organized as follows.Following a review of previous work (Section 2), we propose our model (Section 3).We then describe our datasets (Section 4) and the experimental setup (Section 5).Finally, we discuss experimental results (Section 6).

Related Work
Early studies in ARA mostly focused on readability formulas, typically developed through empirical pedagogy and psychology (Klare, 1963;Davison and Kantor, 1982).Although these formulas have the advantage of being easily interpretable, they rely on surface features and cannot measure the structure or semantic complexity of a text.
Traditional machine learning methods have been applied to train statistical classifiers for ARA.These classifiers employ a large number of features related to vocabulary, semantics and syntax (Hancke et al., 2012;Sung et al., 2015;Dell'Orletta et al., 2011;Francois and Fairon, 2012;Denning et al., 2016;Arfé et al., 2018;Jiang et al., 2019).Although they often outperform readability formulas, feature engineering and selection can be timeconsuming and labor-intensive.
Deep learning methods, which have shown impressive performance in NLP, have recently been applied in ARA.Pre-trained word embeddings (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017) and pre-trained language models such as BERT (Devlin et al., 2019) have been exploited by many neural ARA models (Deutsch et al., 2020;Tseng et al., 2019).Vec2Read captures important words and sentences through a multi-level attention mechanism, and uses a Bidirectional Long Short Term Memory (Bi-LSTM) to create representations of whole sentences and individual words.It has performed well on multilingual readability assessment by applying transfer learning (Azpiazu and Pera, 2020).
Most closely related to our model, Hierarchical Attention Networks (HAN) (Yang et al., 2016) consist of both word and sentence encoders to mimic the hierarchical structure of documents.The word encoder uses bidirectional gated recurrent units (Bi-GRU) (Bahdanau et al., 2014) to embed words while summarizing the information from the context.A word-level attention mechanism then aggregates the most informative words to form a sentence vector.At the sentence level, another encoder likewise uses Bi-GRU to embed sentences, and an attention mechanism aggregates the most formative sentences into a text vector.Originally developed for document classification, it has also shown competitive results in ARA (Martinc et al., 2021).Our proposed model follows the architecture of HAN, but uses a pre-trained BERT and a Bi-LSTM instead of Bi-GRUs in the word encoder and sentence encoder, repectively.Further, it adopts soft labels for ordinal regression and a novel pre-training task for initialization.
Combining neural models with hand-crafted linguistic features can further improve performance (Lee et al., 2021).Since our focus is on neural models that do not require feature engineering, we do not pursue this direction of research.

Proposed Model
We propose an ARA model based on hierarchical attention networks (HAN) (Yang et al., 2016) with two novel components: the use of soft labels to exploit the ordinal nature of the readability assessment task, and a novel pre-training task for model initialization.We will henceforth refer to this model as DTRA (deep text readability assessment).
The proposed model consists of three components (Figure 1).The feature representation component (Section 3.1), similar to HAN, constructs the representation for an input text.A fully connected layer following this component is utilized as a classifier, where the cross entropy loss is adopted as the loss function.The soft-label component (Section 3.2) exploits the ordinal nature of our task by converting discrete grades into soft labels with a distance metric between grades.The pre-training component (Section 3.3) aims to produce a good initialization for fine-tuning.

Feature Representation Component
The feature representation component produces the text-level representation of an input text.As shown in Figure 1, it consists of the word encoder, word attention, sentence encoder, and sentence attention modules.

Word Encoder Module
The word encoder module uses a pre-trained BERT as the feature extractor.It is stacked by 12-layer Transformer encoder (Vaswani et al., 2017) through residual connection.Its input structure consists of token embedding, segment embedding and position embedding, same as the original BERT (Devlin et al., 2019).Each sentence in the input text is inputted to a BERT, token by token, with all BERTs sharing the same parameters.Let h t i = BERT(x i ) represent the output of the word encoder module for the i-th sentence x i .
Unlike Vec2Read (Azpiazu and Pera, 2019) and HAN (Yang et al., 2016), which respectively use a Bi-LSTM and a Bi-GRU together with a pretrained static word embedding in the word encoder module, we use BERT to take advantage of its dynamic word embedding, and also to avoid word segmentation ambiguity for Chinese texts.

Word Attention Module
Similar to Vec2Read (Azpiazu and Pera, 2019), we use a token-level attention mechanism to pay more attention to those words of higher significance for the readability assessment of the text.It consists of a single hidden layer neural network to assign the corresponding weight a t ij to h t ij , defined as: where 2 ); W t 1 and W t 2 are the weights of hidden and output layers, respectively; b t 1 and b t 2 are their associated biases vectors; and ReLU is the rectified linear unit activation function defined as ReLU(x) = max{0, x}.Then, we set: as the representation of the i-th sentence, for i = 1, . . ., l.We denote ĥs := [ ĥs 1 , ĥs 2 , ..., ĥs l ].

Sentence Encoder Module
We capture the sentence-order information in the sentence-level representation with a Bi-LSTM, motivated by its sequential nature.Specifically, we sequentially feed the sentence-level representation ĥs to the Bi-LSTM in the original sentence order of the text to yield a new sentence-level representation h s = [h s 1 , h s 2 , ..., h s l ] incorporated with the correct sentence order, i.e., h s = Bi-LSTM( ĥs ).
By incorporating the context of its neighboring sentences, this enhanced sentence-level representation is intended to improve the readability assessment accuracy, since the sentence order may encode text logic and cohesion that can facilitate readability assessment.

Sentence Attention Module
Similar to Vec2Read (Azpiazu and Pera, 2019), we use a sentence-level attention mechanism to assign an attention weight to each sentence to reflect its importance in the readability assessment of the text.A single hidden layer neural network is used in the sentence-level attention.Specifically, let a s i be the attention weight corresponding to h s i , defined as: where and b s 2 are weights and bias vectors of the hidden and output layers respectively.
The final text-level representation h out of a text for classification is thus h out = l k=1 h s k a s k .Following the text module, a fully connected layer serves as the classifier, trained on cross entropy loss as the loss function.

Soft Labels for Ordinal Regression
ARA is an ordinal classification task since the labels have an underlying order from easy to difficult (e.g., Grade 1 to Grade 12).The severity of a classification error therefore depends on the distance between the gold and predicted labels.During training, there should be a greater penalty for predicting a Grade 3 text as Grade 6 (a distance of three), for example, than as Grade 2 (a distance of one).Since most existing models use "hard" labels, however, they interpret all wrong classes to be infinitely far away from the true class.
We use soft labels (Diaz and Marathe, 2019) to exploit the ordinal nature of the readability assessment task.Given an ordinal classification task with K categories, the soft label can be defined as follows: where r i ∈ Y = {r 1 , r 2 , ..., r K } is the i-th category, r t is the true category and ϕ(r i , r t ) is a distance metric between two categories.The boundary between adjacent grades tends to be vague.For example, a Grade 3 text may not be clearly more difficult than a Grade 2 text or easier than a Grade 4 text.However, it should be more difficult than those at grades farther away.We therefore take the distance metric ϕ(r i , r t ) in (4) as the following piece-wise constant function: where c is a positive hyper-parameter that represents the distance between the true label and its adjacent labels.During training, we convert original hard labels into soft labels according to (4) and (5).We empirically set c to be 1.2 according to experimental results in Appendix B.

Pre-Training Component
A good initialization is crucial for a neural language model given the highly nonconvex nature of the training loss (Tamborrino et al., 2020).To this end, we propose a pre-training task based on the prediction of pairwise relative difficulty of texts: given any two texts, the task is to predict whether the first has a higher, lower, or the same level of readability as the other.We hypothesize that accurate performance in this related task would yield a good initialization of parameters for the fine-tuning stage of an ARA model.We will refer to this pretraining task as Text Readability Order Prediction (TROP).
As shown in Figure 1, we randomly selected two texts from the training set and used the feature representation component of the proposed model to construct their representations h1 out and h2 out .We then feed their concatenation h T ROP = [h 1 out ; h 2 out ] into a three-way classifier, using cross entropy loss as the training objective.

Data
Our evaluation makes use of five datasets in English and Chinese.We used the 8:1:1 ratio for training, development and test data on all datasets.
Newsela The Newsela corpus 1 contains 10,786 texts distributed among levels 2-12 for English and Spanish.Similar to (Martinc et al., 2021), we removed the documents for Spanish while only focused on the readability assessment for those English documents.Thus, the total number of samples of Newsela used in our experiments is 9,565.
OneStopEnglish The OneStopEnglish corpus 2 was created for English as a second language learners.It contains a total of 567 English texts, with each text written in three versions: elementary, intermediate and advanced.
WeeBit The WeeBit corpus consists of 6,388 English texts from WeeklyReader3 and BBC-Bitesize4 in five grades.For a balanced dataset, we randomly sample 625 texts in each grade.
CMT China Mainland Textbook (CMT) (Cheng et al., 2020)  The two Chinese datasets facilitate cross-corpus evaluation since they follow the same grade scale defined by the national standard, but the materials are compiled independently from different sources.Cross-corpus evaluation would be difficult for the English datasets because of the lack of direct correspondence between their scales.

Experimental Set-up
This section presents the baselines to which we will compare our proposed model, the evaluation metrics, and implementation details.

Neural model baselines
Vec2Read (Azpiazu and Pera, 2019) uses pretrained static word embedding, a Bi-LSTM, wordand sentence-level attention mechanisms.The embedding size and hidden layer size of Bi-LSTM were set to be 300 and 128 respectively.When adapted to Chinese corpora, we used the model called FastText (Bojanowski et al., 2017) to yield the pre-trained word embedding.BERT (Devlin et al., 2019) uses the default BERT model for fine-tuning and the default learning rate (2e-5).ALBERT (Lan et al., 2019) uses the factorized embedding parameterization and cross-layer parameter sharing to reduce the size of model (that is, reducing from 108 M to 12 M).Longformer (Beltagy et al., 2020) uses a variant of self-attention mechanism that scales linearly with sequence length to process long texts.HAN (Martinc et al., 2021) uses two Bi-LSTMs, word-and sentence-level attention mechanisms to encode word and sentence representations.We used the same settings as Martinc et al. (2021), where word and sentence embedding sizes were 200 and 100 respectively.Lite-DTRA To reduce the requirement of storage memory of the hardware, we provide a lite version of the proposed model, where the pre-trained BERT with frozen parameters is replaced by a lite version of BERT, i.e., ALBERT (Lan et al., 2019), and thus allows the model to be trained in an end-to-end way.

Traditional classifier baselines
We report performance of the traditional machine learning methods on the Chinese datasets (CMT and CMER): Logistic Regression, Support Vector Machine (SVM), Random Forest, Naive Bayes.As shown in Table 7 (Appendix A), we manually extracted 43 features at the lexical, syntactic, semantic and cohesion levels, mainly taken from Sung et al. (2015).These traditional machine learning methods are not evaluated on the English datasets since their performance has already been extensively reported in previous research (Martinc et al., 2021;Lee et al., 2021).
The hyperparameters were tuned on development data via 10-fold cross-validation.All methods were implemented in Matlab R2017b, Intel(R) Xeon(R) E5-2667 environment.

Evaluation metrics
Our evaluation metrics include classification accuracy (C-acc), adjacent accuracy (A-acc) and the macro F1-measure (F1).Adjacent accuracy is defined as the proportion of samples with the predicted labels adjacent to the gold labels (Sung et al., 2015), motivated by the strong ambiguity between the adjacent classes.

Implementation details
For DTRA, we used a pre-trained BERT 6 for texts in the word encoder module, where the number of Transformer encoder layers is 12 and the output feature size is 768.The sizes of the input, hidden and output layers in the token-level attention mechanism are 768, 192 and 1, respectively.In particular, we froze parameters of BERT in DTRA due to the limitation of storage memory of the hardware.The sizes of the input and hidden layers in the Bi-LSTM are 768 and 256, respectively, and the sizes of the input, hidden and output layers in the sentence-level attention are 512, 128 and 1, respectively.Following the sentence attention module, there is a fully connected layer as the classifier.We used the cross-entropy loss and Adam algorithm (Kingma and Ba, 2015) as the optimizer to fine tune the proposed model.A weighting decay regularization with the regularization parameter 0.01 was also adopted.In the pre-training component, the initial learning rates of Adam for the training of these three modules and fully connected layer were all 7e-5, while in the fine-tuning stage, they were set to be 1e-5 and 4e-5 respectively.Lite-DTRA follows the same settings as DTRA, except that the frozen BERT is replaced with ALBERT (Lan et al., 2019).

Experimental Results
We report experimental results in English datasets (Section 6.1) and Chinese datasets (Section 6.2).We then present an ablation study (Section 6.3) and a comparison between the soft labels and regression (Section 6.4).Finally, we discuss results of a crosscorpus evaluation with few-shot learning (Section 6.5).

English datasets
As shown in Table 2, DTRA achieved higher accuracy on Newsela (83.26%) and OneStopEnglish (85.00%) than all five baseline neural models.These compare favorably with the best result achieved by HAN on Newsela (81.38%) and On-eStopEnglish (78.72%) reported by Martinc et al. (2021) and by BERT on OneStopEnglish reported by Lee et al.(2021). 7On WeeBit, however, BERT achieved the highest accuracy, which is similar to the result reported by Martinc et al. (2021).
Evaluation on F1-measure exhibits the same trend.These experimental results suggest the effectiveness of the soft labels and pre-training.The individual contribution of these components will be further analyzed in the ablation study.In all settings, Lite-DTRA offered slightly better performance than DTRA.This may be attributable to the end-to-end training with ALBERT, in contrast to the pre-trained BERT's frozen parameters for feature extraction.
To visualize the performance of proposed models, Figure 2 shows the confusion matrices of the top-four models (Longformer, HAN, DTRA and Lite-DTRA) in terms of accuracy on Newsela.The values in confusion matrices of DTRA and Lite-DTRA are more concentrated on the diagonal than other two models, in particular at the seventh and ninth grades.

Chinese datasets
As shown in Table 3, DTRA achieved 44.42% accuracy on CMT and 26.50% on CMER.The lower accuracy in comparison to the English results is expected since CMT and CMER contain 2,000 texts approximately but have 12 grades.On both accuracy and F1-measure, DTRA outperformed all four neural baseline models as well as all statistical classifiers.On CMT, HAN achieved the second highest accuracy (42.53%), while the LR classifier performed second best (24.98%) on CMER.

Ablation Studies
We conducted an ablation study to measure the contribution of the soft labels and the pre-training to DTRA's performance.The top of Table 3: ARA performance on the Chinese datasets in terms of accuracy (C-acc), adjacent accuracy (A-acc) and F1, in precentages.The best and second best results are marked in bold and blue color, respectively.
step and soft labels for ordinal regression.On most datasets, there was a decrease in both accuracy and F1 after removal of pre-training, indicating its utility for ARA.The use of soft label improved the accuracy on all datasets except OneStopEnglish.In terms of F1-measure, the soft labels were helpful on Newsela and Weebit but slightly hurt performance on OneStopEnglish, CMT and CMER.The bottom of Table 4 compares Lite-DTRA and its counterpart version with frozen ALBERT parameters (referred to as Lite-DTRA-frozen).Lite-DTRA obtained better results on all metrics for all datasets, except F1 for CMER.This suggests that using the trainable ALBERT model is beneficial for for Lite-DTRA in practice.

Soft labels vs. regression
ARA can be formulated as a regression, classification or ordinal regression task.We further examined the effect of the soft labels through a comparison with standard regression and multi-class classification.
The version of DTRA without pre-training, which will be referred to as the Ordinal-DTRA model, serves as the reference point.We directly used the features output by the feature representation component to train a classification model, which will be called the Classification-DTRA model.We used the same features to train a regression model, which will be called the Regression-DTRA model.
Table 5 compares the accuracy of these three models.Ordinal-DTRA gave the best performance over all five datasets.Classification-DTRA achieved the second best performance, and outperformed Regression-DTRA with a substantial gap in most datasets.These results suggest that the soft labels are more effective in capturing the ordinal nature of the readability grades than direct use of multi-class classification or regression.

Cross-corpus evaluation
Since ARA models may be used in predicting the difficulty of texts from other sources, we gauge the robustness of our proposed model in a crosscorpus evaluation.We conducted experiments in two settings: (a) train DTRA on CMER, and test on CMT; and (b) train DTRA on CMT, and test on CMER.We did not attempt cross-corpus evaluation in English because of the lack of direct mapping among the readability scales adopted in Newsela, OneStopEnglish and WeeBit.
In each setting, we further evaluated the impact of limited quantities of samples in the target corpus for few-shot learning.Specifically, we evaluated model performance when {0, 5, 10, 15, 20, 25, 30} samples from the target corpus were added to the training data.As shown in Table 6, when trained only on CMER, DTRA performed at 31.00% on CMT, which constitutes a 13% degradation compared to the within-corpus setting (44.42%).It outperformed all other baselines both on accuracy and F1.On CMER, DTRA performed at 19.87% on CMER, a 7% degration compared to the withincorpus setting (26.5%).While it outperforms all other baselines on F1, Vec2Read achieved the highest accuracy (21.95%).These results suggest that our models not only captured characteristics pe-  culiar to textbooks (Section 6.1), but also learned textual difficulty features that can be effectively transferred to other texts.
Even a small amount of data from the target corpus could improve model performance.For example, just five texts from the training set of the target corpus per grade could already boost the accuracy of DTRA by about 5% absolute (36.11% on CMT and 25.17% on CMER).It also outperformed all other baselines in both accuracy and F1-measure.The amount of training samples from the target corpus is mostly positively correlated with model performance.Lite-DTRA tends to perform better than DTRA in leveraging limited data.With 15 samples, it achieved 41.59% accuracy on CMT, which is within 3% of its accuracy in the within-corpus setting; on CMER, it surpassed its within-corpus performance with an accuracy of 28.26%, which may indicate the relatively high quality of the CMT training data.

Conclusion
This paper has proposed a deep learning model for text readability assessment that achieved competitive performance on the benchmark datasets Newsela, OneStopEnglish and WeeBit.Our model, which is based on Hierarchical Attention Network (HAN) (Yang et al., 2016), incorporates two novel elements: soft labels for ordinal regression (Diaz and Marathe, 2019), and a pre-training task on pairwise relative text difficulty that aims to improve the model initialization for fine-tuning.
We conducted experiments on both English and Chinese datasets to compare this model to a number of competitive neural models, including BERT, Vec2Read and HAN.The proposed model outperforms all baselines on most datasets in terms of both accuracy and F1.A lite version of the proposed model, with reduced storage memory requirement, also offered competitive performance.An ablation study demonstrated that the pre-training and the soft labels brought benefits in most datasets.The proposed model also outperformed most baselines in the cross-corpus setting, demonstrating its ability to learn features of text difficulty that are transferable to other kinds of texts.

Figure 1 :
Figure 1: Overview of the proposed model, which consists of the feature representation component (Section 3.1), the soft-label component (Section 3.2), and the pre-training component (Section 3.3)

Figure 2 :
Figure 2: Confusion matrices of four deep learning models over Newsela.The horizontal-and vertical-axis of each figure represent the predicted categories and the true categories of the samples, respectively.

Table 1 :
Number of texts and their average length at each grade in the CMT and CMER corpora consists of a total of 2,723,430 characters, distributed in 2,621 texts in twelve grades, all taken from Chinese textbooks from the first grade of primary school to the third grade of high school in mainland China.Table 1 reports detailed statistics on CMT.

Table 2 :
Table4compares the performance of the complete DTRA and its performance upon removal of the pre-training ARA performance on English datasets in terms of accuracy (C-acc), adjacent accuracy (A-acc) and F1, in percentage.The best and second best results are marked in bold and blue color, respectively.

Table 4 :
Ablation study on DTRA (top) and the impact of frozen ALBERT parameters (bottom).The best results are bolded.

Table 5 :
The accuracy of DTRA based on regression, classification and ordinal regression, in percentages.The best results are marked in bold.

Table 7 :
There is still much room for improvement in the performance of the ARA model before it is ready for deployment in the classroom for automatic assignment of reading materials to students.Features used for training the LR, SVM, RF and NB classifiers