Simple Neologism Based Domain Independent Models to Predict Year of Authorship
Vivek Kulkarni | Yingtao Tian | Parth Dandiwala | Steve Skiena
Proceedings of the 27th International Conference on Computational Linguistics
We present domain independent models to date documents based only on neologism usage patterns. Our models capture patterns of neologism usage over time to date texts, provide insights into temporal locality of word usage over a span of 150 years, and generalize to various domains like News, Fiction, and Non-Fiction with competitive performance. Quite intriguingly, we show that by modeling only the distribution of usage counts over neologisms (the model being agnostic of the particular words themselves), we achieve competitive performance using several orders of magnitude fewer features (only 200 input features) compared to state of the art models some of which use 200K features.