An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora

K Saravanan, Monojit Choudhury, Raghavendra Udupa, A Kumaran


Abstract
Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies. In this paper, we systematically analyze the patterns of occurrence and co-occurrence of NEs in standard large English news corpora - providing valuable insight for the understanding of the corpus, and subsequently paving way for the development of technologies that rely critically on handling NEs. We use two distinctive approaches: normal statistical analysis that measure and report the occurrence patterns of NEs in terms of frequency, growth, etc., and a complex networks based analysis that measures the co-occurrence pattern in terms of connectivity, degree-distribution, small-world phenomenon, etc. Our analysis indicates that: (i) NEs form an open-set in corpora and grow linearly, (ii) presence of a kernel and peripheral NE's, with the large periphery occurring rarely, and (iii) a strong evidence of small-world phenomenon. Our findings may suggest effective ways for construction of NE lexicons to aid efficient development of several natural language technologies.
Anthology ID:
L12-1139
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3118–3125
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/305_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
K Saravanan, Monojit Choudhury, Raghavendra Udupa, and A Kumaran. 2012. An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3118–3125, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora (Saravanan et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/305_Paper.pdf