Complex Word Identification: Challenges in Data Annotation and System Performance

Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, Lucia Specia


Abstract
This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.
Anthology ID:
W17-5910
Volume:
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)
Month:
December
Year:
2017
Address:
Taipei, Taiwan
Editors:
Yuen-Hsien Tseng, Hsin-Hsi Chen, Lung-Hao Lee, Liang-Chih Yu
Venue:
NLP-TEA
SIG:
Publisher:
Asian Federation of Natural Language Processing
Note:
Pages:
59–63
Language:
URL:
https://aclanthology.org/W17-5910
DOI:
Bibkey:
Cite (ACL):
Marcos Zampieri, Shervin Malmasi, Gustavo Paetzold, and Lucia Specia. 2017. Complex Word Identification: Challenges in Data Annotation and System Performance. In Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pages 59–63, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Cite (Informal):
Complex Word Identification: Challenges in Data Annotation and System Performance (Zampieri et al., NLP-TEA 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-5910.pdf