Cross-Lingual Leveled Reading Based on Language-Invariant Features

Leveled reading (LR) aims to automatically classify texts according to different reading capabilities and provide appropriate reading materials to readers. However, most state-of-theart LR methods rely on the availability of copious annotated resources, which prevents their adaptation to low-resource languages like Chinese. In our work, to tackle Chinese LR, we explore to perform different language transfer methods on English-Chinese LR. Specifically, we focus on adversarial training and cross-lingual pre-training method to transfer the LR knowledge learned from annotated data in the rich-resource English language to Chinese. For evaluation, we introduce the agebased standard to align datasets with different leveling standards, and conduct experiments in both zero-shot and few-shot settings. Experiments show that the cross-lingual pre-training method can capture language-invariant features more effectively than adversarial training. We also conduct analysis to propose further improvement in cross-lingual LR.


Introduction
Imagine searching the appropriate reading materials for a 10-year-old child in the bookstore: the Tale of Peter Rabbit is a bit outdated; Animal Farm, though sounds suitable, is too allegorical; the Harry Potter series may be just right for the age. Leveled reading (LR) provides such selection guides by automatically classifying texts with regard to the reading level appropriate for readers, which has proven to be of importance in multiple fields, including education (Lennon and Burdick, 2004), health (Petkovic et al., 2015) and advertisement (Chebat et al., 2003). Different from the traditional readability assessment (Aluisio et al., 2010;Madrazo Azpiazu and Pera, 2020) which is formulated as a binary classification problem, LR * Equal Contribution † Corresponding author can be regarded as a multi-class classification task that provides specific reading levels with regard to the cognitive reading level instead of text quality. This fine-grained leveling forms a fundamental component in downstream applications, since it is essential to label different levels even within the Harry Potter series when the stories get darker.
Most previous research focus on English by extracting language-specific features, ranging from traditional readability formulas to using machine learning methods. As the most widely studied language in LR, English holds mature LR standards with abundant reading materials, such as Lexile (Lennon and Burdick, 2004) and Accelerated Reader (Topping et al., 2008), and has recently developed a set of LR datasets for training automatic methods, such as the WeeBit (Vajjala and Meurers, 2012), NewSela (Xu et al., 2015) and On-eStopEnglish (Vajjala and Lučić, 2018a) corpus. By contrast, low-resource languages like Chinese lack both established LR standards and training data, which results in only a few LR research conducted in Chinese (Sun et al., 2020). Can we use the existing resources of English to guide the crosslingual LR of low-resource languages such as Chinese.
There has been a recent trend towards learning language-invariant features to ease the crosslingual generalization from high-resource languages to low-resource languages (Litschko et al., 2018;Kondratyuk and Straka, 2019). We hypothesize that these language-invariant features also exist in LR, especially in the equivalent level of reading among different languages, which may be automatically extracted through deep learning methods.
For example, the reading materials in different languages in the equivalent level may talk about the similar story and express same thoughtsand they may have similar changes in text structure and vocabulary as the level changes.
Thus, to verify our hypothesis and transfer En-glish LR knowledge into Chinese, we explore both adversarial training and cross-lingual pre-training method to extract language-invariant features for English and Chinese LR corpora to guide LR in Chinese. Overall, our contributions are summarized as follows: • We organize the available LR datasets and precess the new LR datasets, including three LR corpora for English, and a variety of textbooks across 12 grade levels and extracurricular books in Chinese. We re-classify the datasets according to age into a uniform standard of reading levels to map both Chinese and English datasets for transfer learning.
• We explore the performance of two transfer learning methods, adversarial training and multi-lingual pre-training, on our aligned LR datasets.
2 Related Works

Leveled Reading Methods
Early works on LR devised various readability formulas, such as the Gunning Fog Index (Gunning, 1952), Automated Readability Index (Senter and Smith, 1967) and Flesch Reading Ease (Kincaid et al., 1975), which mainly rely on shallow language features based on ratios of characters, phrases and sentences. Later work adopted statistical machine learning methods based on extensive feature engineering, which generally improved accuracy by capturing semantic and contextual features (Vajjala and Meurers, 2012;Xia et al., 2016;Vajjala and Lučić, 2018b). Recently, Martinc et al. (2019) and Deutsch et al. (2020) used deep neural networks to enhance LR and achieved the state-ofthe-art performances. Due to resource limitations, only a few works study LR in Chinese (Liu et al., 2017;Sun et al., 2020), which does not have copious annotated data like English.

Cross-Lingual Methods
Cross-lingual methods transfer knowledge from high-resource languages with abundant annotated data to low-resource target languages with limited or even no annotated data. Some works trained cross-lingual representations based on bilingual parallel corpora (Mikolov et al., 2013;Gouws et al., 2015); Other works used direct transfer methods by employing self-training  or unsupervised models like adversarial training (Chen et al., 2018) and heuristic initialization (Artetxe et al.). Madrazo Azpiazu and Pera (2020) first proposed to use cross-lingual strategy for enhancing readability assessment as a binary classification problem, which shows improvement in accuracy for low-resource languages.

LR Datasets
In this section, we elaborate on the LR datasets collected which are classified by the standards of gradeletterand number. And we re-align these datasets using an age-based standard because these standards have been designated approximate age range.  Different grades have corresponding age ranges. For example, grades 3-4 are suitable for Level D to P are suitable for children aged from 8 to 9. These datasets we used in our models are all processed text data instead of the printed book. So the physical manifestations of the printed book like text structure, page layout and illustration have been lost, which plays an important part in our task even in many NLP tasks. In the future, we are supposed to use the data more comprehensively to get more information. Standard Benchmarks: Since leveling standards vary across different datasets, previous methods are trained and evaluated independently on each dataset (Martinc et al., 2019;Deutsch et al., 2020). To align both English and Chinese datasets, we map each data sets into five reading levels with respect to different ages, as shown in Table 1. For example, the original standards of the WeeBit corpus overlap on the neighboring levels, and thus we take the lower boundary of each overlapping level as the standard level. And We re-classify the RAZ dataset according to age into three reading levels. For example, level D to P are suitable for children aged from 6 to 7.

Methodology
Adversarial training and pre-training are recently popular deep learning methods, which can better learn text representation, and have been applied to cross-lingual tasks to extract common features. Inspired by this, we also try to apply these two methods to our cross-lingual LR task.

Adversarial Model for LR
We extend the ADAN model in (Chen et al., 2018) to incorporate the language-invariant features, containing three main components in the network: a joint feature extractor F that maps the input to the shared feature space, a language discriminator D that predicts whether the input is from English or Chinese, and an LR classifier R that classifies the texts into its reading level,as shown in Figure 1. If the language discriminator can't distinguish between Chinese and English, then we can recognize that the model has learned language-independent features.

Pre-training Model for LR
Cross-lingual Language Model (XLM) (Conneau and Lample, 2019) is a transformer-based (Vaswani et al., 2017) model that has been pretrained on the Wikipedias of 104 languages using masked language model, achieving state-of-the-art results on multiple cross-lingual tasks (Ruder et al., 2019), especially for low-resource languages by training on the high-resource language. The model uses a shared vocabulary and adopt byte-pair encoding as the tokenizer. In our LR setting, we fine-tune XLM by adding a classification layer with softmax on top of XLM and for LR prediction. Dataset: We split the datasets described in 3 into training, validation and test set by 8:1:1. Specifically, Weebit and Raz are used as English training set during zeroshot training, and the CN-textbooks data is used for few-shot training. CN-extra books are only used as test datasets.

Experimental Settings
XLM: We use the pretrained XLM-RoBERTa (XLM-R) downloaded from Hugging Face 2 unmodified. We run 20 epochs with a batch size of 32 during zero-shot and few-shot training. We adopt Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 1e-5. Since the limited length of the pre-training model and all our data is long text, we divide each article in our datasets into one piece of data according to paragraphs. And we take only the first 512 tokens of each data to reduce the effects of the length limit in XLM.   ADAN: The feature extractor F, LR classifier R and Language discriminator D have three fullyconnected layers with the ReLU activation. We adopt Adam (Kingma and Ba, 2015) as the optimizer with a learning rate of 5e-4. Baselines: We use different existing English readability formulas to calculate the readability of Chinese text and we adopt two highly-recognized and more suitable readability formulas for comparison, the Gunning Fog Index (GFI) (Gunning, 1952) and the Flesch Reading Ease (FRE) (Kincaid et al., 1975). Since these readability formulas originated in English texts, we directly apply the formulas to the Chinese evaluation test set.

Results and Analysis
We show the experimental results of all methods in Table 2. From the table, we have the following three observations: (1) The readability formulas GFI and FRE perform the worst in both zero-shot and few-shot settings, which may result from the fact that word length is generally fixed for Chinese words, and thus is not an effective LR indicator.
(2) For the better performing ADAN and XLM, the results in the few-shot setting are generally better than in the zero-shot setting. (3)   the zero-shot and few-shot settings, respectively. The above results show that ADAN and XLM can indeed assist LR in low-resource languages. Concerning the advantage of XLM over ADAN, we speculate that XLM better captures high-level semantics like the topic and theme of the texts.
Since this paper mainly aims to explore different transfer methods on Chinese LR, we leave the investigation of different high-level semantics to future work. To explore the impact of different datasets, we evaluate using the best performing XLM methods. Since the datasets differ in covered levels, we conduct experiments in two settings, one is based on three levels from 1 to 3 for readers aged from 6 to 11, and the other is based on five levels from 1 to 5 for readers aged from 6 to 17. As shown in Table 3, we can find that XLM trained on the RAZ performs the best in both settings, indicating that RAZ has greater guiding significance for cross-lingual LR.
As shown in Figure 2, the overall experimental results show a clear trend, the results on the edge level are better than those on the middle level. We speculate that one of the reason of the diversion may be due to the fact that not all textbooks of one grade have the same difficulty. Specifically, RAZ covers all aspects of human geography, cognition, fairy tales, legends and novels, which may assist LR regarding the difference in theme. In the future work, it is beneficial to analyze the impact of different text types on LR and consider combining vocabulary, grammar, and other relevant information, which will provides better guidance for cross-lingual LR. In addition to improving the quality of the corpus and expanding the corpus, we can explore more low-resource and cross-lingual methods to apply to our tasks in the future. Furthermore, maybe we can add some additional knowledge about LR like vocabulary difficulty and topic information to our model.

Conclusion
In our work, we explore two methods to tackle Chinese LR using deep neural networks without any extra features, the adversarial training model ADAN and cross-lingual pre-trained language model XLM.
We organize and re-classify the LR datasets, including three LR corpora for English, and a variety of textbooks across 12 age levels and extracurricular books recommended in Chinese. To the best of our knowledge, this is the first attempt to integrate different corpora and leverage neural language models for cross-lingual LR. Experimental results show that cross-lingual Language model is more effective, and we can leverage only English corpus to predict the reading level of Chinese text, which solves the insufficient data problem in the low-resource Chinese language. After the summary of our experiment, there are still some flaws in both our datasets and methods, we have suggested some directions for future development.