Jenny A. Ortiz-Zambrano

Also published as: Jenny A. Ortiz Zambrano

2021

CLexIS2: A New Corpus for Complex Word Identification Research in Computing Studies
Jenny A. Ortiz Zambrano | Arturo Montejo-Ráez
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Reading is a complex process not only because of the words or sections that are difficult for the reader to understand. Complex word identification (CWI) is the task of detecting in the content of documents the words that are difficult or complex to understand by the people of a certain group. Annotated corpora for English learners are widely available, while they are less common for the Spanish language. In this article, we present CLexIS², a new corpus in Spanish to contribute to the advancement of research in the area of Lexical Simplification, specifically in the identification and prediction of complex words in computing studies. Several metrics used to evaluate the complexity of texts in Spanish were applied, such as LC, LDI, ILFW, SSR, SCI, ASL, CS. Furthermore, as a baseline of the primer, two experiments have been performed to predict the complexity of words: one using a supervised learning approach and the other using an unsupervised solution based on the frequency of words on a general corpus.

pdf bib abs

Complex words identification using word-level features for SemEval-2020 Task 1
Jenny A. Ortiz-Zambrano | Arturo Montejo-Ráez
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This article describes a system to predict the complexity of words for the Lexical Complexity Prediction (LCP) shared task hosted at SemEval 2021 (Task 1) with a new annotated English dataset with a Likert scale. Located in the Lexical Semantics track, the task consisted of predicting the complexity value of the words in context. A machine learning approach was carried out based on the frequency of the words and several characteristics added at word level. Over these features, a supervised random forest regression algorithm was trained. Several runs were performed with different values to observe the performance of the algorithm. For the evaluation, our best results reported a M.A.E score of 0.07347, M.S.E. of 0.00938, and R.M.S.E. of 0.096871. Our experiments showed that, with a greater number of characteristics, the precision of the classification increases.

Co-authors

Arturo Montejo-Ráez 2

Venues

RANLP1
SemEval1

Fix author