Mihaela Bošnjak


pdf bib
PANDORA Talks: Personality and Demographics on Reddit
Matej Gjurković | Mladen Karan | Iva Vukojević | Mihaela Bošnjak | Jan Snajder
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Personality and demographics are important variables in social sciences and computational sociolinguistics. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first dataset of Reddit comments of 10k users partially labeled with three personality models and demographics (age, gender, and location), including 1.6k users labeled with the well-established Big 5 personality model. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.


pdf bib
Data Set for Stance and Sentiment Analysis from User Comments on Croatian News
Mihaela Bošnjak | Mladen Karan
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing

Nowadays it is becoming more important than ever to find new ways of extracting useful information from the evergrowing amount of user-generated data available online. In this paper, we describe the creation of a data set that contains news articles and corresponding comments from Croatian news outlet 24 sata. Our annotation scheme is specifically tailored for the task of detecting stances and sentiment from user comments as well as assessing if commentator claims are verifiable. Through this data, we hope to get a better understanding of the publics viewpoint on various events. In addition, we also explore the potential of applying supervised machine learning models toautomate annotation of more data.