Syntax Encoding with Application in Authorship Attribution

Richong Zhang, Zhiyuan Hu, Hongyu Guo, Yongyi Mao


Abstract
We propose a novel strategy to encode the syntax parse tree of sentence into a learnable distributed representation. The proposed syntax encoding scheme is provably information-lossless. In specific, an embedding vector is constructed for each word in the sentence, encoding the path in the syntax tree corresponding to the word. The one-to-one correspondence between these “syntax-embedding” vectors and the words (hence their embedding vectors) in the sentence makes it easy to integrate such a representation with all word-level NLP models. We empirically show the benefits of the syntax embeddings on the Authorship Attribution domain, where our approach improves upon the prior art and achieves new performance records on five benchmarking data sets.
Anthology ID:
D18-1294
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
2742–2753
Language:
URL:
https://aclanthology.org/D18-1294
DOI:
10.18653/v1/D18-1294
Bibkey:
Cite (ACL):
Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao. 2018. Syntax Encoding with Application in Authorship Attribution. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2742–2753, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Syntax Encoding with Application in Authorship Attribution (Zhang et al., EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1294.pdf