WikiTalkEdit: A Dataset for modeling Editors’ behaviors on Wikipedia
Kokil Jaidka | Andrea Ceolin | Iknoor Singh | Niyati Chhaya | Lyle Ungar
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This study introduces and analyzes WikiTalkEdit, a dataset of conversations and edit histories from Wikipedia, for research in online cooperation and conversation modeling. The dataset comprises dialog triplets from the Wikipedia Talk pages, and editing actions on the corresponding articles being discussed. We show how the data supports the classic understanding of style matching, where positive emotion and the use of first-person pronouns predict a positive emotional change in a Wikipedia contributor. However, they do not predict editorial behavior. On the other hand, feedback invoking evidentiality and criticism, and references to Wikipedia’s community norms, is more likely to persuade the contributor to perform edits but is less likely to lead to a positive emotion. We developed baseline classifiers trained on pre-trained RoBERTa features that can predict editorial change with an F1 score of .54, as compared to an F1 score of .66 for predicting emotional change. A diagnostic analysis of persisting errors is also provided. We conclude with possible applications and recommendations for future work. The dataset is publicly available for the research community at https://github.com/kj2013/WikiTalkEdit/.


Identifying Locus of Control in Social Media Language
Masoud Rouhizadeh | Kokil Jaidka | Laura Smith | H. Andrew Schwartz | Anneke Buffone | Lyle Ungar
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Individuals express their locus of control, or “control”, in their language when they identify whether or not they are in control of their circumstances. Although control is a core concept underlying rhetorical style, it is not clear whether control is expressed by how or by what authors write. We explore the roles of syntax and semantics in expressing users’ sense of control –i.e. being “controlled by” or “in control of” their circumstances– in a corpus of annotated Facebook posts. We present rich insights into these linguistic aspects and find that while the language signaling control is easy to identify, it is more challenging to label it is internally or externally controlled, with lexical features outperforming syntactic features at the task. Our findings could have important implications for studying self-expression in social media.

Diachronic degradation of language models: Insights from social media
Kokil Jaidka | Niyati Chhaya | Lyle Ungar
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Natural languages change over time because they evolve to the needs of their users and the socio-technological environment. This study investigates the diachronic accuracy of pre-trained language models for downstream tasks in machine learning and user profiling. It asks the question: given that the social media platform and its users remain the same, how is language changing over time? How can these differences be used to track the changes in the affect around a particular topic? To our knowledge, this is the first study to show that it is possible to measure diachronic semantic drifts within social media and within the span of a few years.


Domain Adaptation from User-level Facebook Models to County-level Twitter Predictions
Daniel Rieman | Kokil Jaidka | H. Andrew Schwartz | Lyle Ungar
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Several studies have demonstrated how language models of user attributes, such as personality, can be built by using the Facebook language of social media users in conjunction with their responses to psychology questionnaires. It is challenging to apply these models to make general predictions about attributes of communities, such as personality distributions across US counties, because it requires 1. the potentially inavailability of the original training data because of privacy and ethical regulations, 2. adapting Facebook language models to Twitter language without retraining the model, and 3. adapting from users to county-level collections of tweets. We propose a two-step algorithm, Target Side Domain Adaptation (TSDA) for such domain adaptation when no labeled Twitter/county data is available. TSDA corrects for the different word distributions between Facebook and Twitter and for the varying word distributions across counties by adjusting target side word frequencies; no changes to the trained model are made. In the case of predicting the Big Five county-level personality traits, TSDA outperforms a state-of-the-art domain adaptation method, gives county-level predictions that have fewer extreme outliers, higher year-to-year stability, and higher correlation with county-level outcomes.


Guillaume Cabanac | Muthu Kumar Chandrasekaran | Ingo Frommholz | Kokil Jaidka | Min-Yen Kan | Philipp Mayr | Dietmar Wolfram
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

Overview of the CL-SciSumm 2016 Shared Task
Kokil Jaidka | Muthu Kumar Chandrasekaran | Sajal Rustagi | Min-Yen Kan
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)


Deconstructing Human Literature Reviews – A Framework for Multi-Document Summarization
Kokil Jaidka | Christopher Khoo | Jin-Cheon Na
Proceedings of the 14th European Workshop on Natural Language Generation