Learning Syntactic Dense Embedding with Correlation Graph for Automatic Readability Assessment

Deep learning models for automatic readability assessment generally discard linguistic features traditionally used in machine learning models for the task. We propose to incorporate linguistic features into neural network models by learning syntactic dense embeddings based on linguistic features. To cope with the relationships between the features, we form a correlation graph among features and use it to learn their embeddings so that similar features will be represented by similar embeddings. Experiments with six data sets of two proficiency levels demonstrate that our proposed methodology can complement BERT-only model to achieve significantly better performances for automatic readability assessment.


Introduction
Readability is the ease with which a reader can understand a written text 1 . Predicting readability has been widely applied in education (Lennon and Burdick, 2004), book publishing (Pera and Ng, 2014), marketing (Chebat et al., 2003), newspaper readership (Pitler and Nenkova, 2008), and health information communication (Bernstam et al., 2005). Ever since the first study by Lively and Pressey in 1923, many researchers have developed various popular readability formulas including Flesch (Flesch, 1948), Fog (Gunning, 1969) and Lexile (Stenner et al., 1988). These traditional readability formulas are favored by domain applications due to their simplicity even though the formulas are mostly based on shallow features and known to lack accuracy (Bruce et al., 1981;Davison and Kantor, 1982;Graesser et al., 2004).

* Corresponding authors.
Its strong reliance on expert knowledge is also a burden to adapt it to a new domain.
Machine learning approaches, which incorporate a broader set of morphological, lexical, syntactic, and discourse features, have shown to achieve better accuracy in readability assessment (Si and Callan, 2001;Collins-Thompson and Callan, 2005). Figure 1 (a) describes a generic machine-learning framework for Automatic Readability Assessment (ARA) where manual feature engineering is an important step to extract important linguistic features for building readability classification models.
To bypass the necessity of heavy feature engineering, deep learning strategies have been studied to automatically detect patterns or extract features related to readability (Azpiazu and Pera, 2019;Martinc et al., 2019;Mohammadi and Khasteh, 2019). Figure 1 (b) provides a generic neural network structure of deep learning approach to ARA. While neural network models take word embedding as input, they in general discard linguistic features traditionally used in machine learning models (Deutsch et al., 2020). If ever incorporated, linguistic features such as POS and morphological tags are only used to guide attention mechanism for embedding representation of the text (Azpiazu and Pera, 2019). Pre-trained models such as BERT (Devlin et al., 2019) learn dense representations of text by informing the models with semantically neighboring words, sentences, or context. Despite the attempts of recent research to assess BERT's ability to implicitly capture the structural properties of language (Goldberg, 2019;Jawahar et al., 2019;Kovaleva et al., 2019), it has been observed that BERT "tends to rely more on semantic than structural differences during the 1 https://en.wikipedia.org/wiki/Readability 3014 Figure 1: Proposed Dual-Model framework (c) as compared with generic machine learning (a), and generic deep learning framework (b) for Automatic Readability Assessment (ARA). Framework to learn syntactic dense embedding for ARA is provided in (d). classification phase and therefore performs better on problems with distinct semantic differences between classes" (Martinc et al., 2019). There is clearly a lack of explicit consideration of syntactic (and structural) features in the current BERT-based models for ARA, which is known to be crucial. In this study, we address the problem of augmenting the ability of BERT with widely used linguistic features in ARA.
To best integrate with BERT, we create syntactic dense embedding as shown in Figure 1(d). An important problem we consider in this paper is the possible relationships between different features. Linguistic features defined by linguistic experts may often be related. Table 1 shows three pairs of linguistic features for Chinese readability assessment. In example one, the "percentage of conjunctions" and the "average height of parse tree" may be positively correlated because both reflect the complexity of the sentences. In the second example, "percentage of unique functional words" in a document is negatively correlated with the "average number of characters per word" for that document because Chinese functional words are usually short (i.e., one or two characters). Utilizing all these linguistic features as if they were independent may potentially hinder the classifier. We propose to consider the possible relationships among linguistic features when creating their dense embeddings with which we could complement the BERT embedding representations.
In this paper, we represent pairwise correlations between features as triplets with linguistic features as nodes and their correlations as edges. Positive correlation implies that two features behave similarly in influencing the readability level of the text and should be represented with similar embeddings. The set of triplets forms a graph (as illustrated in Figure 1(d)). We then learn the dense representations of linguistic features with graphbased models. By encoding the similarity knowledge with dense embeddings, the ARA classifier models will be better informed and gain predictive strength. Our experiments on six datasets will confirm the effectiveness of this approach.
We contribute to the research on Automatic Readability Assessment in the following directions: (1) We provide three new data sets of linguistic features for document-level readability assessment of Chinese L1, Chinese L2 and English L2 learning.
(2) We verify that the correlation relationships among linguistic features could be utilized to learn syntactic dense embeddings. (3) We propose a Dual-channel neural network model (i.e., Dual-Model) to combine the syntactic dense embeddings and the BERT semantic dense embeddings for readability predictions. (4) We verify, with six data sets of Chinese and English corpora for L1 and L2 language proficiencies, that the Dual-Model can significantly improve the predictive performances of the BERT-only model. We provide our data and codes at: https://github.com/luv2Lab/ linguistic-feature-embedding.

Automatic Readability Assessment
Corpora for readability assessment are available for many languages. Among some of the most cited of English readability assessment are the WeeBit corpus by Meurers (2012, 2014) for English L1 learning and the Cambridge exam corpus by Xia et al. (2016) for English L2. For Chinese readability assessment, Sung et al. (2015) evaluated 30 linguistic features and classification models with text books in traditional Chinese. Qiu et al. (2017), Lu et al. (2019), and Zhu et al. (2019) designed features of different categories for machine learning methods for Chinese L1 and L2 readability assessment at document and sentence levels. Similar works on other languages include French (Todirascu et al., 2016), German (Hancke et al., 2012), Swedish (Pilán et al., 2016), and Japanese (Wang and Andersen, 2016). Azpiazu and Pera (2020) analyzed the most common linguistic features for six languages and evaluated multiple classifiers for cross-lingual readability assessment.
Most of the current work on applying graphbased methods or neural networks to readability assessment operate with word-level semantic embeddings. For example, Jiang et al. (2018) incorporated word-level difficulty from lexical knowledge sources into knowledge graph and trained enriched word embedding representations. Martinc et al. (2019) applied three types of neural language models at word level for unsupervised assessment. Mohammadi and Khasteh (2019) simplified the process of feature extraction with GloVe model for word embedding and reinforcement learning for English and Persian readability assessment. Azpiazu and Pera (2019) presented a multiattentive recurrent neural network model that considers raw words as input and incorporates attention mechanism with POS and morphological tags. Deutsh et al. (2020) proposed a fusion model by adding the numerical output from transformer to the linguistic features as input into SVM classifiers for readability prediction.
We notice that in previous studies, the linguistic features are mostly considered to be independent. Each of them is used as an additional one to another. However, two features can reflect the same type of linguistic phenomenon, and thus are positively correlated in influencing the readability of a text. The correlation relationships among features may help learn dense representations of linguistic features to be utilized by neural network models for better-informed predictions.

Feature Embedding
An important question in building neural network models is how to learn embedding representation. Feature binning has been studied to exploit the relatedness between different intervals of feature values in feature vector representation (Sil et al., 2017;Liu et al., 2016). In particular, Maddela and Xu (2018) applied smooth binning and project each numerical feature into a vector representation with multiple Gaussian radial basis functions. The embedding approach captures the nuance relationships between different intervals of feature values.
Methods similar to word embedding (Mikolov et al., 2013) have been applied to create embeddings of POS tags. Chen and Manning (2014) showed that the POS tag and arc labels exhibit semantic similarity like words and embedding can capture the similarities between POS tags or arc labels. We hypothesize that the pair-wise correlations among the linguistic features for ARA can also be used to learn embedding and we propose to use graphbased model for that purpose.
The resulting similarities learned from datadriven embedding may not fully reflect the similarities one has in mind for their application (Goldberg, 2017). Retrofitting (Faruqui et al., 2014) used information from WordNet, Framenet and PPDB to improve pre-trained embedding vectors so that related words will have more similar embeddings. The method first constructs a graph ( , ) where V is the set of word types, and ⊆ × indicates semantic relationships among pairs of words with ontology Ω. Given an original embedding vector * ! , a new embedding ! is learned such that it is closer to * ! and its neighbors " , ∀ such that ( , ) ∈ and with closeness measured by Euclidean distance. The objective is to minimize Ψ( ) : where and control the importance of a word embedding ! being similar to itself in the original space or to another word in the same space connected by relational information.
While Retrofitting is used to improve entity embedding in a graph, knowledge graph embedding learns representations for both the entities and their relations. TransE is a representative translational distance model where entities and relations are modeled in the same Euclidean space. Given two entity vectors h, t and a translation vector r between them, the model requires + ≈ for the observed triple (h, r, t). Hence, TransE assumes the score function is low if (h, r, t) holds, and high otherwise. To differentiate between correct and incorrect triples, TransE score difference is minimized using margin based pairwise ranking loss.

Methodology
Let = @ & , … , ' # B (where ( is the number of features) be a linguistic feature set designed for readability assessment. Let matrix be a collection of the vector representations of To construct the syntactic dense embeddings for document representation, we perform the following steps: (1) We apply Gaussian-binning method (Maddela and Xu, 2018) to such that each feature value We concatenate the * JJJJ⃑ for all ! to form the initial data-driven embedding of feature " , with dimension of = × ) , ∀ ∈ ( . (2) We form a feature graph using positive correlations among the ( features by setting a correlation threshold of 0.7. We preserve only the positive correlations in the graph. (3) Let the matrix L ∈ /×' # be the collection of embeddings of " ∈ / . Given a feature graph and matrix L ∈ /×' # , we apply TransE (Bordes et al., 2013) or Retrofitting (Faruqui et al., 2014) to learn optimized feature embeddings for each feature " . Instead of random initialization, we use the data-driven embedding of " ∈ / from Step (1) as the initial entity embedding for optimization. The syntactic latent space of / is trained by TransE or Retrofitting respectively to encode the relationship knowledge implied by the correlations among linguistic features so that the final dense embedding of linguistic feature " will be closer to those positively correlated with it in graph . We denote the matrix optimized by TransE or Retrofitting with 1 ∈ /×' # .
(4) To construct the syntactic dense embeddings of document representation with the embeddings of linguistic features, we perform a linear mapping to project the document feature vectors onto the syntactic latent space / . Specifically, given a feature vector of document ! ∈ ' # , and an optimized syntactic matrix 1 ∈ /×' # , the projected document vector S ! in the syntactic latent space / is defined as: where 2 (1 ≤ ≤ ) is the projected value of the ( linguistic features of document ! at dimension p of / . We name S ! ∈ / the "syntactic dense embedding". To construct semantic dense embeddings for the documents, we learn the BERT average embedding representations following the original procedures as shown in Figure 2, where the final BERT representation is the average over all tokens. An alternative approach is to use the [CLS] token embedding to represent the text and fine-tune it for prediction. In our pilot study, we experimented rigorously with different finetuning strategies for each of the six datasets. The best finetuning results as compared with the original BERT average embeddings are reported in Appendix A. The sizes of our corpora are small ranging from 326 to 2500 as described later in Table 2. The finetuning process for BERT with 110M parameters may fit very well on training set but may not generalize well on test set. In the pilot study, we found that the overall performances of the finetuned BERT are not better than the original BERT. Therefore, we present experimentations with the original average BERT embeddings. With the BERT dense embeddings and the syntactic dense embeddings, we propose a DNN dual channel neural network model (i.e., Dual-Model) to predict the documents' readability levels. We first feed the BERT embeddings into a fourlayer network and the syntactic dense embeddings into a two-layer network. We then concatenate the outputs of the two channels into combined syntactic-semantic dense embeddings as input into another two-layer network, with MLP and SoftMax layers for readability classification. The Model architecture is provided in Figure 3.

Data Sets
To evaluate our proposed models, we use six readability data sets as shown in Table 2. We create three data sets for Chinese L1 and L2 and English L2 readability assessment. The Chinese L1 data sets are textbooks for first language learning for primary school, secondary school, and high- school education from three publishers. The Chinese L2 data sets are from 5 grades of 73 textbooks that are most widely used by 7 universities in China for teaching Chinese to international students, as described in Lu et al. (2019). The ENEW data set is of 4 grades of English textbooks from New Concept English series which is one of the most widely used English L2 textbooks in China. We followed the data preparation of ENCT in Jiang et al. (2018) to prepare ENEW corpus. The raw data of Chinese L1and ENEW data sets are publicly available from their textbook websites 2,3 .
In addition, we use three benchmark corpora. We obtain the WeeBit data for English L1 from the authors of Meurers (2012, 2014). We re-extract the text from the HTML files and discard  documents that are fill-in-the-blank tests or duplicate. We take the middle set of 500 documents by document length for each class to form a 2500document WeeBit corpus. We obtained the Cambridge Exam data set for English L2 readability assessment (Xia et al., 2016) from their website 4 . We found 5 duplicate documents in class FCE, therefore resulting in a total of 326 documents of five grade levels. We also downloaded the OneStopEnglish data set for English L2 learning from its website 5 (Vajjala and Lučić, 2018). Following the feature engineering methodology in previous work (Flesch, 1948;Gunning, 1969;Kincaid et al., 1975;Yang, 1970;Feng, 2010;Jiang et al., 2014;Sung et al., 2015;Qiu et al., 2017;Lu et al., 2019), we design 102 linguistic features for Chinese L1 and 111 features for Chinese L2 readability assessment. We design 33 features for English L2 referencing Vajjala and Meurers (2012). We use the feature extraction codes provided by Vajjala and Meurers (2012) to recalculate the 46 feature values for the 2500-document WeeBit corpus. We acquire the 155-feature calculation results from the OneStopEnglish corpus. We drop the features that have zero value for all documents and obtain the values of 140 features. In our pilot study with ENEW data set, we found that our 33feature design was effective and apply these to Cambridge corpus as well. We provide linguistic feature descriptions in Appendix B.

Model Evaluation
According to our methodologies, we have two implementations of the Dual-channel model to combine syntactic and semantic dense embeddings for ARA: GFE-TransE+BERT and GFE-Retrofit+BERT. Both have the same network architecture as in Figure 3. The difference is that Gaussian embedding of features are used in TransE 4 http://www.ilexir.co.uk/datasets/index.html and Retrofitting respectively to learn the optimized feature embedding based on correlation graph and then produce syntactic dense embeddings of documents. We compare our methodology with the following baselines: (1) SVM and LR with document feature vector ! ∈ ' # , which are typical classification methods based on manual features.
(2) BERT-only DNN: This is a BERT-DNN network which has the same architecture as the right-hand side BERT channel in Figure 3. Using BERT for representation has been found effective (Martinc et al., 2019). (3) Raw+BERT Model: This model concatenates the BERT DNN channel output with raw feature vectors ! ∈ ' # to form input into neural network for predictions. It is to verify if feature embedding is actually needed or if we could simply augment the BERT embedding with raw feature vectors for prediction. (4) G-Doc+BERT: Following Maddela and Xu (2018), for each feature " ( ≤ 1 ≤ ( ) in ! = ( " , … , # ! ) $ , we learn the Gaussian embedding * JJJJ⃑ and concatenate all of them into a document embedding representation. We use this syntactic dense embedding not trained by graph relations as the left-channel input in the Dual-DNN model in Figure 3 to compare with our proposed method. For evaluation of model effectiveness, we use Accuracy and Distance-1 Adjacent Accuracy. Adjacent Accuracy means that predicting a text to be within one level distance of the true label is still considered accurate (Heilman et al., 2008). We perform 5-fold stratified cross-validation and report average Accuracy and Adjacent Accuracy. We provide the hyper parameters of neural network models and the preprocessing procedures in Appendix C, and the test of correlation thresholds in Appendix D. Table 3. Model comparisons. BERT's performances better than machine learning, and other single-channel models are bolded. Dual-Channel Model's performances better than BERT-only model are bolded and italicized. Performances of Dual-Channel with Graph-based Feature Embedding models (i.e., our proposed methodology) better than BERT and other Dual-channel models are bolded and starred. The best performances for each data set are bolded and underlined.

Results and Analysis
We first present the comparison of BERT-only DNN model with two traditional machine learning models of SVM and Logistic Regression, and three other single channel DNN models. Table 3 shows the Accuracy and Adjacent Accuracy in the first and second row for each data set. We observe that BERT-only DNN performs the best in five out of the six data sets except for ENEW. This indicates that semantic embedding alone is very effective in ARA with neural network models which are better than traditional machine learning models with raw feature vector representations. This result is consistent with previous studies using neural network models (Martinc et al., 2019;Azpiazu and Pera, 2019).
Next, we compare BERT-only model with the Dual-channel DNN models with Raw+BERT and G-Doc+BERT. We find that augmenting BERT with raw feature value vector or document vector based on Gaussian embedding can slightly improve the performance of BERT, showing that the raw linguistic features contain additional structural information of the text that are marginally but consistently useful to the neural models for all data sets.
The performances of our proposed method are presented in the last two columns of Table 3. We observe that the two Dual-Models achieve the best performances among all 10 models in five out of six data sets (except for ENEW) and are better than the BERT-only and the other Dual-channel models. Moreover, except for Chinese L2 where the improvement is relatively smaller, the Dual-Model improvements are significant (with Student t-test at p<0.05 level) in the other four data sets of Chinese L1, WeeBit, OneStopEnglish and Cambridge. These results strongly support our earlier hypothesis that the correlations between linguistic features can provide additional useful information to learn syntactic dense embeddings that complement the semantic dense embeddings.
Comparing the last two columns of Table 3, we can observe generally similar performances in using TransE or Retrofitting on the feature graph. In theory, we impose a strict closeness constraint in Retrofitting, but let TransE learn the embedding for the correlation relation freely. The higher flexibility of TransE did not translate into better effectiveness. We speculate that the limited amount of training data may hinder our model from taking full advantage of the flexibility of TransE.    Figure 4 presents a comparison of t-SNE visualization of semantic and syntactic dense embeddings, and the concatenated embedding. The figure illustrates that the concatenated embedding can produce more closely clustered data points by grade levels.
To investigate how Dual-Model improves over BERT-only model in predicting different readability levels, we present analysis of True Positive Rate (TPR) at each grade level. For each data set, we select from cross validation the best GFE-TransE+BERT model and the BERT-only model and then apply them to the whole data set. We construct confusion matrices and calculate TPR for each grade level as: As shown in Figure 5, for Chinese L1 we observe that the largest improvements by the Dual Model are more spread out at Grade 3, 5, 8, and 10 than for Chinese L2 which are at both ends of grade of 1 and 5. In contrast, for the four English corpora, adding syntactic dense embedding improves the BERT-only model more in the middle and the higher grade levels. We also observe from Table 3 that the improvement on Chinese L1 is more pronounced. For example, the GFE-TransE+BERT model for Chinese L1 achieved an improvement of 19.4% over BERT-only (0.4732 vs. 0.3963), while Weebit achieved an improvement of 3.88% over BERT-only (0.8672 vs 0.8348).
We may speculate that the differences in the improvement might be caused by two factors among many others: (1) how important the syntactic structure is for building the foundational knowledge in learning a certain language; and (2) how the semantic and syntactic knowledge of a certain language is organized throughout the learning process in order to lead the language learners through grasping the language.
We construct the correlation graphs with positive correlation relationship only while we observe that there exist both positive and negative correlations among linguistic features. To investigate the effectiveness of learning embedding by considering negative correlation as well, we define an additional score function for negatively correlated features used in TransE as: We present performance comparisons with GFE-TransE+BERT model in Table 4. We find that both models perform similarly, showing that defining positive correlation alone is sufficient in learning dense embeddings. We speculate that in feature embedding, the most important is to make similar features closer in the latent space, while repulsing negatively correlated feature embeddings away may not make a better representation of the features, which could have already been well separated in the latent space. Table 4. Comparing performances with positive correlationonly graph and positive+negative correlation graph in GFE-TransE+BERT model

Conclusions
By combining the semantic dense embeddings and the syntactic dense embedding in a dualchannel neural network model, we propose a new methodology for readability assessment that capture both the semantic and the syntactic knowledge related to readability discrepancies. Experiments with six data sets and two proficiency levels show that our Dual-Model is better than the semantic-alone and the syntactic-alone baselines. We prove that complementing semantic dense embeddings with syntactic dense embeddings learned with correlation graph of linguistic features can produce better-informed representations for readability assessment. We will further improve our research by studying other applicable algorithms and linguistic phenomena that could benefit from learning syntactic latent space and syntactic dense embedding representations. Type-Token Ratio (TTR) 3