TallVocabL2Fi: A Tall Dataset of 15 Finnish L2 Learners’ Vocabulary

Frankie Robertson, Li-Hsin Chang, Sini Söyrinki


Abstract
Previous work concerning measurement of second language learners has tended to focus on the knowledge of small numbers of words, often geared towards measuring vocabulary size. This paper presents a “tall” dataset containing information about a few learners’ knowledge of many words, suitable for evaluating Vocabulary Inventory Prediction (VIP) techniques, including those based on Computerised Adaptive Testing (CAT). In comparison to previous comparable datasets, the learners are from varied backgrounds, so as to reduce the risk of overfitting when used for machine learning based VIP. The dataset contains both a self-rating test and a translation test, used to derive a measure of reliability for learner responses. The dataset creation process is documented, and the relationship between variables concerning the participants, such as their completion time, their language ability level, and the triangulated reliability of their self-assessment responses, are analysed. The word list is constructed by taking into account the extensive derivation morphology of Finnish, and infrequent words are included in order to account for explanatory variables beyond word frequency.
Anthology ID:
2022.lrec-1.685
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6377–6386
Language:
URL:
https://aclanthology.org/2022.lrec-1.685
DOI:
Bibkey:
Cite (ACL):
Frankie Robertson, Li-Hsin Chang, and Sini Söyrinki. 2022. TallVocabL2Fi: A Tall Dataset of 15 Finnish L2 Learners’ Vocabulary. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6377–6386, Marseille, France. European Language Resources Association.
Cite (Informal):
TallVocabL2Fi: A Tall Dataset of 15 Finnish L2 Learners’ Vocabulary (Robertson et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.685.pdf