Complex Word Identification for Italian Language: A Dictionary–based Approach

Laura Occhipinti


Abstract
Assessing word complexity in Italian poses significant challenges, particularly due to the absence of a standardized dataset. This study introduces the first automatic model designed to identify word complexity for native Italian speakers. A dictionary of simple and complex words was constructed, and various configurations of linguistic features were explored to find the best statistical classifier based on Random Forest algorithm. Considering the probabilities of a word to belong to a class, a comparison between the models’ predictions and human assessments derived from a dataset annotated for complexity perception was made. Finally, the degree of accord between the model predictions and the human inter-annotator agreement was analyzed using Spearman correlation. Our findings indicate that a model incorporating both linguistic features and word embeddings performed better than other simpler models, also showing a value of correlation with the human judgements similar to the inter-annotator agreement. This study demonstrates the feasibility of an automatic system for detecting complexity in the Italian language with good performances and comparable effectiveness to humans in this subjective task.
Anthology ID:
2024.clib-1.12
Volume:
Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024)
Month:
September
Year:
2024
Address:
Sofia, Bulgaria
Venue:
CLIB
SIG:
Publisher:
Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Note:
Pages:
119–129
Language:
URL:
https://aclanthology.org/2024.clib-1.12/
DOI:
Bibkey:
Cite (ACL):
Laura Occhipinti. 2024. Complex Word Identification for Italian Language: A Dictionary–based Approach. In Proceedings of the Sixth International Conference on Computational Linguistics in Bulgaria (CLIB 2024), pages 119–129, Sofia, Bulgaria. Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences.
Cite (Informal):
Complex Word Identification for Italian Language: A Dictionary–based Approach (Occhipinti, CLIB 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.clib-1.12.pdf