Simple Neologism Based Domain Independent Models to Predict Year of Authorship

Vivek Kulkarni, Yingtao Tian, Parth Dandiwala, Steve Skiena


Abstract
We present domain independent models to date documents based only on neologism usage patterns. Our models capture patterns of neologism usage over time to date texts, provide insights into temporal locality of word usage over a span of 150 years, and generalize to various domains like News, Fiction, and Non-Fiction with competitive performance. Quite intriguingly, we show that by modeling only the distribution of usage counts over neologisms (the model being agnostic of the particular words themselves), we achieve competitive performance using several orders of magnitude fewer features (only 200 input features) compared to state of the art models some of which use 200K features.
Anthology ID:
C18-1017
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Emily M. Bender, Leon Derczynski, Pierre Isabelle
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
202–212
Language:
URL:
https://aclanthology.org/C18-1017/
DOI:
Bibkey:
Cite (ACL):
Vivek Kulkarni, Yingtao Tian, Parth Dandiwala, and Steve Skiena. 2018. Simple Neologism Based Domain Independent Models to Predict Year of Authorship. In Proceedings of the 27th International Conference on Computational Linguistics, pages 202–212, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Simple Neologism Based Domain Independent Models to Predict Year of Authorship (Kulkarni et al., COLING 2018)
Copy Citation:
PDF:
https://aclanthology.org/C18-1017.pdf