Multilingual Resources for Lexical Complexity Prediction: A Review

Matthew Shardlow; Kai North; Marcos Zampieri

Multilingual Resources for Lexical Complexity Prediction: A Review

Matthew Shardlow, Kai North, Marcos Zampieri

Abstract

Lexical complexity prediction is the NLP task aimed at using machine learning to predict the difficulty of a target word in context for a given user or user group. Multiple datasets exist for lexical complexity prediction, many of which have been published recently in diverse languages. In this survey, we discuss nine recent datasets (2018-2024) all of which provide lexical complexity prediction annotations. Particularly, we identified eight languages (French, Spanish, Chinese, German, Russian, Japanese, Turkish and Portuguese) with at least one lexical complexity dataset. We do not consider the English datasets, which have already received significant treatment elsewhere in the literature. To survey these datasets, we use the recommendations of the Complex 2.0 Framework (Shardlow et al., 2022), identifying how the datasets differ along the following dimensions: annotation scale, context, multiple token instances, multiple token annotations, diverse annotators. We conclude with future research challenges arising from our survey of existing lexical complexity prediction datasets.

Anthology ID:: 2024.determit-1.5
Volume:: Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Giorgio Maria Di Nunzio, Federica Vezzani, Liana Ermakova, Hosein Azarbonyad, Jaap Kamps
Venues:: DeTermIt | WS
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 51–59
Language:
URL:: https://aclanthology.org/2024.determit-1.5/
DOI:
Bibkey:
Cite (ACL):: Matthew Shardlow, Kai North, and Marcos Zampieri. 2024. Multilingual Resources for Lexical Complexity Prediction: A Review. In Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024, pages 51–59, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Multilingual Resources for Lexical Complexity Prediction: A Review (Shardlow et al., DeTermIt 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.determit-1.5.pdf

PDF Cite Search Fix data