An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework

Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, Rémi Cardon, Thomas François, Akio Hayakawa, Andrea Horbach, Anna Hülsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Peréz Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, Marcos Zampieri, Horacio Saggion


Abstract
We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.
Anthology ID:
2024.readi-1.4
Volume:
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Rodrigo Wilkens, Rémi Cardon, Amalia Todirascu, Núria Gala
Venues:
READI | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
38–46
Language:
URL:
https://aclanthology.org/2024.readi-1.4
DOI:
Bibkey:
Cite (ACL):
Matthew Shardlow, Fernando Alva-Manchego, Riza Batista-Navarro, Stefan Bott, Saul Calderon Ramirez, Rémi Cardon, Thomas François, Akio Hayakawa, Andrea Horbach, Anna Hülsing, Yusuke Ide, Joseph Marvin Imperial, Adam Nohejl, Kai North, Laura Occhipinti, Nelson Peréz Rojas, Nishat Raihan, Tharindu Ranasinghe, Martin Solis Salazar, et al.. 2024. An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework. In Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024, pages 38–46, Torino, Italia. ELRA and ICCL.
Cite (Informal):
An Extensible Massively Multilingual Lexical Simplification Pipeline Dataset using the MultiLS Framework (Shardlow et al., READI-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.readi-1.4.pdf