Evaluating the role of non-lexical markers in GPT-2’s language modeling behavior
Roberta Rocca | Alejandro de la Vega
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems
Language as a fingerprint: Self-supervised learning of user encodings using transformers
Roberta Rocca | Tal Yarkoni
Findings of the Association for Computational Linguistics: EMNLP 2022
The way we talk carries information about who we are. Demographics, personality, clinical conditions, political preferences influence what we speak about and how, suggesting that many individual attributes could be inferred from adequate encodings of linguistic behavior. Conversely, conditioning text representations on author attributes has been shown to improve model performance in many NLP tasks. Previous research on individual differences and language representations has mainly focused on predicting selected attributes from text, or on conditioning text representations on such attributes for author-based contextualization. Here, we present a self-supervised approach to learning language-based user encodings using transformers. Using a large corpus of Reddit submissions, we fine-tune DistilBERT on user-based triplet loss. We show that fine-tuned models can pick up on complex linguistic signatures of users, and that they are able to infer rich information about them. Through a series of intrinsic analyses and probing tasks, we provide evidence that fine-tuning enhances models’ ability to abstract generalizable user information, which yields performance advantages for user-based downstream tasks. We discuss applications in language-based assessment and contextualized and personalized NLP.