Yunita Sari
2018
Topic or Style? Exploring the Most Useful Features for Authorship Attribution
Yunita Sari
|
Mark Stevenson
|
Andreas Vlachos
Proceedings of the 27th International Conference on Computational Linguistics
Approaches to authorship attribution, the task of identifying the author of a document, are based on analysis of individuals’ writing style and/or preferred topics. Although the problem has been widely explored, no previous studies have analysed the relationship between dataset characteristics and effectiveness of different types of features. This study carries out an analysis of four widely used datasets to explore how different types of features affect authorship attribution accuracy under varying conditions. The results of the analysis are applied to authorship attribution models based on both discrete and continuous representations. We apply the conclusions from our analysis to an extension of an existing approach to authorship attribution and outperform the prior state-of-the-art on two out of the four datasets used.
2017
Continuous N-gram Representations for Authorship Attribution
Yunita Sari
|
Andreas Vlachos
|
Mark Stevenson
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
This paper presents work on using continuous representations for authorship attribution. In contrast to previous work, which uses discrete feature representations, our model learns continuous representations for n-gram features via a neural network jointly with the classification layer. Experimental results demonstrate that the proposed model outperforms the state-of-the-art on two datasets, while producing comparable results on the remaining two.
A Shallow Neural Network for Native Language Identification with Character N-grams
Yunita Sari
|
Muhammad Rifqi Fatchurrahman
|
Meisyarah Dwiastuti
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
This paper describes the systems submitted by GadjahMada team to the Native Language Identification (NLI) Shared Task 2017. Our models used a continuous representation of character n-grams which are learned jointly with feed-forward neural network classifier. Character n-grams have been proved to be effective for style-based identification tasks including NLI. Results on the test set demonstrate that the proposed model performs very well on essay and fusion tracks by obtaining more than 0.8 on both F-macro score and accuracy.