On the Strength of Character Language Models for Multilingual Named Entity Recognition

Xiaodong Yu, Stephen Mayhew, Mark Sammons, Dan Roth


Abstract
Character-level patterns have been widely used as features in English Named Entity Recognition (NER) systems. However, to date there has been no direct investigation of the inherent differences between name and nonname tokens in text, nor whether this property holds across multiple languages. This paper analyzes the capabilities of corpus-agnostic Character-level Language Models (CLMs) in the binary task of distinguishing name tokens from non-name tokens. We demonstrate that CLMs provide a simple and powerful model for capturing these differences, identifying named entity tokens in a diverse set of languages at close to the performance of full NER systems. Moreover, by adding very simple CLM-based features we can significantly improve the performance of an off-the-shelf NER system for multiple languages.
Anthology ID:
D18-1345
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3073–3077
Language:
URL:
https://aclanthology.org/D18-1345
DOI:
10.18653/v1/D18-1345
Bibkey:
Cite (ACL):
Xiaodong Yu, Stephen Mayhew, Mark Sammons, and Dan Roth. 2018. On the Strength of Character Language Models for Multilingual Named Entity Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3073–3077, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
On the Strength of Character Language Models for Multilingual Named Entity Recognition (Yu et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1345.pdf
Data
CoNLL 2003