Small-Scale Cross-Language Authorship Attribution on Social Media Comments

Benjamin Murauer, Gunther Specht


Abstract
Cross-language authorship attribution is the challenging task of classifying documents by bilingual authors where the training documents are written in a different language than the evaluation documents. Traditional solutions rely on either translation to enable the use of single-language features, or language-independent feature extraction methods. More recently, transformer-based language models like BERT can also be pre-trained on multiple languages, making them intuitive candidates for cross-language classifiers which have not been used for this task yet. We perform extensive experiments to benchmark the performance of three different approaches to a smallscale cross-language authorship attribution experiment: (1) using language-independent features with traditional classification models, (2) using multilingual pre-trained language models, and (3) using machine translation to allow single-language classification. For the language-independent features, we utilize universal syntactic features like part-of-speech tags and dependency graphs, and multilingual BERT as a pre-trained language model. We use a small-scale social media comments dataset, closely reflecting practical scenarios. We show that applying machine translation drastically increases the performance of almost all approaches, and that the syntactic features in combination with the translation step achieve the best overall classification performance. In particular, we demonstrate that pre-trained language models are outperformed by traditional models in small scale authorship attribution problems for every language combination analyzed in this paper.
Anthology ID:
2021.mtsummit-loresmt.2
Volume:
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
Month:
August
Year:
2021
Address:
Virtual
Editors:
John Ortega, Atul Kr. Ojha, Katharina Kann, Chao-Hong Liu
Venue:
LoResMT
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
11–19
Language:
URL:
https://aclanthology.org/2021.mtsummit-loresmt.2
DOI:
Bibkey:
Cite (ACL):
Benjamin Murauer and Gunther Specht. 2021. Small-Scale Cross-Language Authorship Attribution on Social Media Comments. In Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021), pages 11–19, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Small-Scale Cross-Language Authorship Attribution on Social Media Comments (Murauer & Specht, LoResMT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mtsummit-loresmt.2.pdf