Emma Tseng


pdf bib
Automated Classification of Written Proficiency Levels on the CEFR-Scale through Complexity Contours and RNNs
Elma Kerz | Daniel Wiechmann | Yu Qiao | Emma Tseng | Marcus Ströbel
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Automatically predicting the level of second language (L2) learner proficiency is an emerging topic of interest and research based on machine learning approaches to language learning and development. The key to the present paper is the combined use of what we refer to as ‘complexity contours’, a series of measurements of indices of L2 proficiency obtained by a computational tool that implements a sliding window technique, and recurrent neural network (RNN) classifiers that adequately capture the sequential information in those contours. We used the EF-Cambridge Open Language Database (Geertzen et al. 2013) with its labelled Common European Framework of Reference (CEFR) levels (Council of Europe 2018) to predict six classes of L2 proficiency levels (A1, A2, B1, B2, C1, C2) in the assessment of writing skills. Our experiments demonstrate that an RNN classifier trained on complexity contours achieves higher classification accuracy than one trained on text-average complexity scores. In a secondary experiment, we determined the relative importance of features from four distinct categories through a sensitivity-based pruning technique. Our approach makes an important contribution to the field of automated identification of language proficiency levels, more specifically, to the increasing efforts towards the empirical validation of CEFR levels.