A Multilingual Survey of Recent Lexical Complexity Prediction Resources through the Recommendations of the Complex 2.0 Framework

Matthew Shardlow, Kai North, Marcos Zampieri


Abstract
Lexical complexity prediction is the NLP task aimed at using machine learning to predict the difficulty of a target word in context for a given user or user group. Multiple datasets exist for lexical complexity prediction, many of which have been published recently in diverse languages. In this survey, we discuss nine recent datasets (2018-2024) all of which provide lexical complexity prediction annotations. Particularly, we identified eight languages (French, Spanish, Chinese, German, Russian, Japanese, Turkish and Portuguese) with at least one lexical complexity dataset. We do not consider the English datasets, which have already received significant treatment elsewhere in the literature. To survey these datasets, we use the recommendations of the Complex 2.0 Framework (Shardlow et al., 2022), identifying how the datasets differ along the following dimensions: annotation scale, context, multiple token instances, multiple token annotations, diverse annotators. We conclude with future research challenges arising from our survey of existing lexical complexity prediction datasets.
Anthology ID:
2024.determit-1.5
Volume:
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Giorgio Maria Di Nunzio, Federica Vezzani, Liana Ermakova, Hosein Azarbonyad, Jaap Kamps
Venues:
DeTermIt | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
51–59
Language:
URL:
https://aclanthology.org/2024.determit-1.5
DOI:
Bibkey:
Cite (ACL):
Matthew Shardlow, Kai North, and Marcos Zampieri. 2024. A Multilingual Survey of Recent Lexical Complexity Prediction Resources through the Recommendations of the Complex 2.0 Framework. In Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024, pages 51–59, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Multilingual Survey of Recent Lexical Complexity Prediction Resources through the Recommendations of the Complex 2.0 Framework (Shardlow et al., DeTermIt-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.determit-1.5.pdf