Weixiong Zheng


2024

pdf bib
IndoCL: Benchmarking Indonesian Language Development Assessment
Nankai Lin | Hongyan Wu | Weixiong Zheng | Xingming Liao | Shengyi Jiang | Aimin Yang | Lixian Xiao
Findings of the Association for Computational Linguistics: EMNLP 2024

Recently, the field of language acquisition (LA) has significantly benefited from natural language processing technologies. A crucial task in LA involves tracking the evolution of language learners’ competence, namely language development assessment (LDA). However, the majority of LDA research focuses on high-resource languages, with limited attention directed toward low-resource languages. Moreover, existing methodologies primarily depend on linguistic rules and language characteristics, with a limited exploration of exploiting pre-trained language models (PLMs) for LDA. In this paper, we construct the IndoCL corpus (Indonesian Corpus of L2 Learners), which comprises compositions written by undergraduate students majoring in Indonesian language. Moreover, we propose a model for LDA tasks, which automatically extracts language-independent features, relieving laborious computation and reliance on specific language. The proposed model uses sequential information attention and similarity representation learning to capture the differences and common information from the first-written and second-written essays, respectively. It has demonstrated remarkable performance on both our self-constructed corpus and publicly available corpora. Our work could serve as a novel benchmark for Indonesian LDA tasks. We also explore the feasibility of using existing large-scale language models (LLMs) for LDA tasks. The results show significant potential for improving LLM performance in LDA tasks.