Simin Rao
2022
Promoting Pre-trained LM with Linguistic Features on Automatic Readability Assessment
Shudi Hou
|
Simin Rao
|
Yu Xia
|
Sujian Li
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Automatic readability assessment (ARA) aims at classifying the readability level of a passage automatically. In the past, manually selected linguistic features are used to classify the passages. However, as the use of deep neural network surges, there is less work focusing on these linguistic features. Recently, many works integrate linguistic features with pre-trained language model (PLM) to make up for the information that PLMs are not good at capturing. Despite their initial success, insufficient analysis of the long passage characteristic of ARA has been done before. To further investigate the promotion of linguistic features on PLMs in ARA from the perspective of passage length, with commonly used linguistic features and abundant experiments, we find that: (1) Linguistic features promote PLMs in ARA mainly on long passages. (2) The promotion of the features on PLMs becomes less significant when the dataset size exceeds 750 passages. (3) By analyzing commonly used ARA datasets, we find Newsela is actually not suitable for ARA. Our code is available at https://github.com/recorderhou/linguistic-features-in-ARA.
2021
阅读分级相关研究综述(A Survey of Leveled Reading)
Simin Rao (饶思敏)
|
Hua Zheng (郑婳)
|
Sujian Li (李素建)
Proceedings of the 20th Chinese National Conference on Computational Linguistics
阅读分级的概念在二十世纪早期就被教育工作者提出,随着人们对阅读变得越来越重视,阅读分级引起了越来越多的关注,自动阅读分级技术也得到了一定程度的发展。本文总结了近年来的阅读分级领域的研究进展,首先介绍了阅读分级现有的标准和随之而产生的各种体系和语料资源。在此基础之上整理了在自动阅读分级工作已经广泛应用的三类方法:公式法、传统的机器学习方法和最近热门的深度学习方法,并结合实验结果梳理了三类方法存在的弊利,以及可以改进的方向。最后本文还对阅读分级的未来发展方向以及可以应用的领域进行了总结和展望。
Cross-Lingual Leveled Reading Based on Language-Invariant Features
Simin Rao
|
Hua Zheng
|
Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2021
Leveled reading (LR) aims to automatically classify texts by the cognitive levels of readers, which is fundamental in providing appropriate reading materials regarding different reading capabilities. However, most state-of-the-art LR methods rely on the availability of copious annotated resources, which prevents their adaptation to low-resource languages like Chinese. In our work, to tackle LR in Chinese, we explore how different language transfer methods perform on English-Chinese LR. Specifically, we focus on adversarial training and cross-lingual pre-training method to transfer the LR knowledge learned from annotated data in the resource-rich English language to Chinese. For evaluation, we first introduce the age-based standard to align datasets with different leveling standards. Then we conduct experiments in both zero-shot and few-shot settings. Comparing these two methods, quantitative and qualitative evaluations show that the cross-lingual pre-training method effectively captures the language-invariant features between English and Chinese. We conduct analysis to propose further improvement in cross-lingual LR.