Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)
David Bamman | Dirk Hovy | David Jurgens | Katherine Keith | Brendan O'Connor | Svitlana Volkova
recent years, there has been increased interest in building predictive models that harness natural language processing and machine learning techniques to detect emotions from various text sources, including social media posts, micro-blogs or news articles. Yet, deployment of such models in real-world sentiment and emotion applications faces challenges, in particular poor out-of-domain generalizability. This is likely due to domain-specific differences (e.g., topics, communicative goals, and annotation schemes) that make transfer between different models of emotion recognition difficult. In this work we propose approaches for text-based emotion detection that leverage transformer models (BERT and RoBERTa) in combination with Bidirectional Long Short-Term Memory (BiLSTM) networks trained on a comprehensive set of psycholinguistic features. First, we evaluate the performance of our models within-domain on two benchmark datasets GoEmotion (Demszky et al., 2020) and ISEAR (Scherer and Wallbott, 1994). Second, we conduct transfer learning experiments on six datasets from the Unified Emotion Dataset (Bostan and Klinger, 2018) to evaluate their out-of-domain robustness. We find that the proposed hybrid models improve the ability to generalize to out-of-distribution data compared to a standard transformer-based approach. Moreover, we observe that these models perform competitively on in-domain data.’
Monitoring the development of labor market skill requirements is an information need that is more and more approached by applying text mining methods to job advertisement data. We present an approach for fine-grained extraction and classification of skill requirements from German-speaking job advertisements. We adapt pre-trained transformer-based language models to the domain and task of computing meaningful representations of sentences or spans. By using context from job advertisements and the large ESCO domain ontology we improve our similarity-based unsupervised multi-label classification results. Our best model achieves a mean average precision of 0.969 on the skill class level.
Emotion classification in NLP assigns emotions to texts, such as sentences or paragraphs. With texts like “I felt guilty when he cried”, focusing on the sentence level disregards the standpoint of each participant in the situation: the writer (“I”) and the other entity (“he”) could in fact have different affective states. The emotions of different entities have been considered only partially in emotion semantic role labeling, a task that relates semantic roles to emotion cue words. Proposing a related task, we narrow the focus on the experiencers of events, and assign an emotion (if any holds) to each of them. To this end, we represent each emotion both categorically and with appraisal variables, as a psychological access to explaining why a person develops a particular emotion. On an event description corpus, our experiencer-aware models of emotions and appraisals outperform the experiencer-agnostic baselines, showing that disregarding event participants is an oversimplification for the emotion detection task.
Fertility intentions as verbalized in surveys are a poor predictor of actual fertility outcomes, the number of children people have. This can partly be explained by the uncertainty people have in their intentions. Such uncertainties are hard to capture through traditional survey questions, although open-ended questions can be used to get insight into people’s subjective narratives of the future that determine their intentions. Analyzing such answers to open-ended questions can be done through Natural Language Processing techniques. Traditional topic models (e.g., LSA and LDA), however, often fail to do since they rely on co-occurrences, which are often rare in short survey responses. The aim of this study was to apply and evaluate topic models on demographic survey data. In this study, we applied neural topic models (e.g. BERTopic, CombinedTM) based on language models to responses from Dutch women on their fertility plans, and compared the topics and their coherence scores from each model to expert judgments. Our results show that neural models produce topics more in line with human interpretation compared to LDA. However, the coherence score could only partly reflect on this, depending on the corpus used for calculation. This research is important because, first, it helps us develop more informed strategies on model selection and evaluation for topic modeling on survey data; and second, it shows that the field of demography has much to gain from adopting NLP methods.
Gender bias may emerge from an unequal representation of agency and power, for example, by portraying women frequently as passive and powerless (“She accepted her future”) and men as proactive and powerful (“He chose his future”). When language models learn from respective texts, they may reproduce or even amplify the bias. An effective way to mitigate bias is to generate counterfactual sentences with opposite agency and power to the training. Recent work targeted agency-specific verbs from a lexicon to this end. We argue that this is insufficient, due to the interaction of agency and power and their dependence on context. In this paper, we thus develop a new rewriting model that identifies verbs with the desired agency and power in the context of the given sentence. The verbs’ probability is then boosted to encourage the model to rewrite both connotations jointly. According to automatic metrics, our model effectively controls for power while being competitive in agency to the state of the art. In our main evaluation, human annotators favored its counterfactuals in terms of both connotations, also deeming its meaning preservation better.
From the start of the COVID-19 pandemic in Germany, different groups have been protesting measures implemented by different government bodies in Germany to control the pandemic. It was widely claimed that many of the offline and online protests were driven by conspiracy narratives disseminated through groups and channels on the messenger app Telegram. We investigate this claim by measuring the frequency of conspiracy narratives in messages from open Telegram chat groups of the Querdenken movement, set up to organize protests against COVID-19 restrictions in Germany. We furthermore explore the content of these messages using topic modelling. To this end, we collected 822k text messages sent between April 2020 and May 2022 in 34 chat groups. By fine-tuning a Distilbert model, using self-annotated data, we find that 8.24% of the sent messages contain signs of conspiracy narratives. This number is not static, however, as the share of conspiracy messages grew while the overall number of messages shows a downward trend since its peak at the end of 2020. We further find a mix of known conspiracy narratives make up the topics in our topic model. Our findings suggest that the Querdenken movement is getting smaller over time, but its remaining members focus even more on conspiracy narratives.
Community-level linguistic variation is a core concept in sociolinguistics. In this paper, we use conditioned neural language models to learn vector representations for 510 online communities. We use these representations to measure linguistic variation between commu-nities and investigate the degree to which linguistic variation corresponds with social connections between communities. We find that our sociolinguistic embeddings are highly correlated with a social network-based representation that does not use any linguistic input.
Studies on interpersonal conflict have a long history and contain many suggestions for conflict typology. We use this as the basis of a novel annotation scheme and release a new dataset of situations and conflict aspect annotations. We then build a classifier to predict whether someone will perceive the actions of one individual as right or wrong in a given situation. Our analyses include conflict aspects, but also generated clusters, which are human validated, and show differences in conflict content based on the relationship of participants to the author. Our findings have important implications for understanding conflict and social norms.
Participants in political discourse employ rhetorical strategies—such as hedging, attributions, or denials—to display varying degrees of belief commitments to claims proposed by themselves or others. Traditionally, political scientists have studied these epistemic phenomena through labor-intensive manual content analysis. We propose to help automate such work through epistemic stance prediction, drawn from research in computational semantics, to distinguish at the clausal level what is asserted, denied, or only ambivalently suggested by the author or other mentioned entities (belief holders). We first develop a simple RoBERTa-based model for multi-source stance predictions that outperforms more complex state-of-the-art modeling. Then we demonstrate its novel application to political science by conducting a large-scale analysis of the Mass Market Manifestos corpus of U.S. political opinion books, where we characterize trends in cited belief holders—respected allies and opposed bogeymen—across U.S. political ideologies.
Customers are rapidly turning to social media for customer support. While brand agents on these platforms are motivated and well-intentioned to help and engage with customers, their efforts are often ignored if their initial response to the customer does not match a specific tone, style, or topic the customer is aiming to receive. The length of a conversation can reflect the effort and quality of the initial response made by a brand toward collaborating and helping consumers, even when the overall sentiment of the conversation might not be very positive. Thus, through this study, we aim to bridge this critical gap in the existing literature by analyzing language’s content and stylistic aspects such as expressed empathy, psycho-linguistic features, dialogue tags, and metrics for quantifying personalization of the utterances that can influence the engagement of an interaction. This paper demonstrates that we can predict engagement using initial customer and brand posts.
Scandinavian countries are perceived as role-models when it comes to gender equality. With the advent of pre-trained language models and their widespread usage, we investigate to what extent gender-based harmful and toxic content exists in selected Scandinavian language models. We examine nine models, covering Danish, Swedish, and Norwegian, by manually creating template-based sentences and probing the models for completion. We evaluate the completions using two methods for measuring harmful and toxic completions and provide a thorough analysis of the results. We show that Scandinavian pre-trained language models contain harmful and gender-based stereotypes with similar values across all languages. This finding goes against the general expectations related to gender equality in Scandinavian countries and shows the possible problematic outcomes of using such models in real-world settings. Warning: Some of the examples provided in this paper can be upsetting and offensive.
While implicit embeddings so far have been mostly concerned with creating an overall representation of the user, we evaluate a different approach. By only considering content directed at a specific topic, we create sub-user embeddings, and measure their usefulness on the tasks of sarcasm and hate speech detection. In doing so, we show that task-related topics can have a noticeable effect on model performance, especially when dealing with intended expressions like sarcasm, but less so for hate speech, which is usually labelled as such on the receiving end.
Gender discrimination in hiring is a pertinent and persistent bias in society, and a common motivating example for exploring bias in NLP. However, the manifestation of gendered language in application materials has received limited attention. This paper investigates the framing of skills and background in CVs of self-identified men and women. We introduce a data set of 1.8K authentic, English-language, CVs from the US, covering 16 occupations, allowing us to partially control for the confound occupation-specific gender base rates. We find that (1) women use more verbs evoking impressions of low power; and (2) classifiers capture gender signal even after data balancing and removal of pronouns and named entities, and this holds for both transformer-based and linear classifiers.
We address dissonant stance detection, classifying conflicting stance between two input statements. Computational models for traditional stance detection have typically been trained to indicate pro/con for a given target topic (e.g. gun control) and thus do not generalize well to new topics. In this paper, we systematically evaluate the generalizability of dissonant stance detection to situations where examples of the topic have not been seen at all or have only been seen a few times. We show that dissonant stance detection models trained on only 8 topics, none of which are the target topic, can perform as well as those trained only on a target topic. Further, adding non-target topics boosts performance further up to approximately 32 topics where accuracies start to plateau. Taken together, our experiments suggest dissonant stance detection models can generalize to new unanticipated topics, an important attribute for the social scientific study of social media where new topics emerge daily.
While acknowledgments are often overlooked and sometimes entirely missing from publications, this short section of a paper can provide insights on the state of a field. We characterize and perform a textual analysis of acknowledgments in NLP conference proceedings across the last 17 years, revealing broader trends in funding and research directions in NLP as well as interesting phenomena including career incentives and the influence of defaults.
Word embedding models have been used in prior work to extract associations of intersectional identities within discourse concerning institutions of power, but restricted its focus on narratives of the nineteenth-century U.S. south. This paper leverages this prior work and introduces an initial study on the association of intersected identities with discourse concerning social institutions within social media from Nigeria. Specifically, we use word embedding models trained on tweets from Nigeria and extract associations of intersected social identities with institutions (e.g., domestic, culture, etc.) to provide insight into the alignment of identities with institutions. Our initial experiments indicate that identities at the intersection of gender and economic status groups have significant associations with discourse about the economic, political, and domestic institutions.
Layout detection is an essential step for accurately extracting structured contents from historical documents. The intricate and varied layouts present in these document images make it expensive to label the numerous layout regions that can be densely arranged on each page. Current active learning methods typically rank and label samples at the image level, where the annotation budget is not optimally spent due to the overexposure of common objects per image. Inspired by recent progress in semi-supervised learning and self-training, we propose OLALA, an Object-Level Active Learning framework for efficient document layout Annotation. OLALA aims to optimize the annotation process by selectively annotating only the most ambiguous regions within an image, while using automatically generated labels for the rest. Central to OLALA is a perturbation-based scoring function that determines which objects require manual annotation. Extensive experiments show that OLALA can significantly boost model performance and improve annotation efficiency, facilitating the extraction of masses of structured text for downstream NLP applications.
Data scarcity is a common problem in NLP, especially when the annotation pertains to nuanced socio-linguistic concepts that require specialized knowledge. As a result, few-shot identification of these concepts is desirable. Few-shot in-context learning using pre-trained Large Language Models (LLMs) has been recently applied successfully in many NLP tasks. In this paper, we study few-shot identification of a psycho-linguistic concept, Morality Frames (Roy et al., 2021), using LLMs. Morality frames are a representation framework that provides a holistic view of the moral sentiment expressed in text, identifying the relevant moral foundation (Haidt and Graham, 2007) and at a finer level of granularity, the moral sentiment expressed towards the entities mentioned in the text. Previous studies relied on human annotation to identify morality frames in text which is expensive. In this paper, we propose prompting based approaches using pretrained Large Language Models for identification of morality frames, relying only on few-shot exemplars. We compare our models’ performance with few-shot RoBERTa and found promising results.
Sarcasm is prevalent in all corners of social media, posing many challenges within Natural Language Processing (NLP), particularly for sentiment analysis. Sarcasm detection remains a largely unsolved problem in many NLP tasks due to its contradictory and typically derogatory nature as a figurative language construct. With recent strides in NLP, many pre-trained language models exist that have been trained on data from specific social media platforms, i.e., Twitter. In this paper, we evaluate the efficacy of multiple sarcasm detection datasets using machine and deep learning models. We create two new datasets - a manually annotated gold standard Sarcasm Annotated Dataset (SAD) and a Silver-Standard Sarcasm-annotated Dataset (S3D). Using a combination of existing sarcasm datasets with SAD, we train a sarcasm detection model over a social-media domain pre-trained language model, BERTweet, which yields an F1-score of 78.29%. Using an Ensemble model with an underlying majority technique, we further label S3D to produce a weakly supervised dataset containing over 100,000 tweets. We publicly release all the code, our manually annotated and weakly supervised datasets, and fine-tuned models for further research.
The Stereotype Content model (SCM) states that we tend to perceive minority groups as cold, incompetent or both. In this paper we adapt existing work to demonstrate that the Stereotype Content model holds for contextualised word embeddings, then use these results to evaluate a fine-tuning process designed to drive a language model away from stereotyped portrayals of minority groups. We find the SCM terms are better able to capture bias than demographic agnostic terms related to pleasantness. Further, we were able to reduce the presence of stereotypes in the model through a simple fine-tuning procedure that required minimal human and computer resources, without harming downstream performance. We present this work as a prototype of a debiasing procedure that aims to remove the need for a priori knowledge of the specifics of bias in the model.
Language models such as GPT-3 have caused a furore in the research community. Some studies found that GPT-3 has some creative abilities and makes mistakes that are on par with human behaviour. This paper answers a related question: Who is GPT-3? We administered two validated measurement tools to GPT-3 to assess its personality, the values it holds and its self-reported demographics. Our results show that GPT-3 scores similarly to human samples in terms of personality and - when provided with a model response memory - in terms of the values it holds. We provide the first evidence of psychological assessment of the GPT-3 model and thereby add to our understanding of this language model. We close with suggestions for future research that moves social science closer to language models and vice versa.