CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups

Seid Muhie Yimam, Sanja Štajner, Martin Riedl, Chris Biemann


Abstract
Complex word identification (CWI) is an important task in text accessibility. However, due to the scarcity of CWI datasets, previous studies have only addressed this problem on Wikipedia sentences and have solely taken into account the needs of non-native English speakers. We collect a new CWI dataset (CWIG3G2) covering three text genres News, WikiNews, and Wikipedia) annotated by both native and non-native English speakers. Unlike previous datasets, we cover single words, as well as complex phrases, and present them for judgment in a paragraph context. We present the first study on cross-genre and cross-group CWI, showing measurable influences in native language and genre types.
Anthology ID:
I17-2068
Volume:
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Month:
November
Year:
2017
Address:
Taipei, Taiwan
Venue:
IJCNLP
SIG:
Publisher:
Asian Federation of Natural Language Processing
Note:
Pages:
401–407
Language:
URL:
https://aclanthology.org/I17-2068
DOI:
Bibkey:
Cite (ACL):
Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann. 2017. CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 401–407, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Cite (Informal):
CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups (Yimam et al., IJCNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/I17-2068.pdf