WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition

Abbas Ghaddar; Philippe Langlais

WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition

Abstract

We revisit the idea of mining Wikipedia in order to generate named-entity annotations. We propose a new methodology that we applied to English Wikipedia to build WiNER, a large, high quality, annotated corpus. We evaluate its usefulness on 6 NER tasks, comparing 4 popular state-of-the art approaches. We show that LSTM-CRF is the approach that benefits the most from our corpus. We report impressive gains with this model when using a small portion of WiNER on top of the CONLL training material. Last, we propose a simple but efficient method for exploiting the full range of WiNER, leading to further improvements.

Anthology ID:: I17-1042
Volume:: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Month:: November
Year:: 2017
Address:: Taipei, Taiwan
Editors:: Greg Kondrak, Taro Watanabe
Venue:: IJCNLP
SIG:
Publisher:: Asian Federation of Natural Language Processing
Note:
Pages:: 413–422
Language:
URL:: https://aclanthology.org/I17-1042/
DOI:
Bibkey:
Cite (ACL):: Abbas Ghaddar and Phillippe Langlais. 2017. WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 413–422, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Cite (Informal):: WiNER: A Wikipedia Annotated Corpus for Named Entity Recognition (Ghaddar & Langlais, IJCNLP 2017)
Copy Citation:
PDF:: https://aclanthology.org/I17-1042.pdf

PDF Cite Search Fix data