Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large Language Models with SocKET Benchmark

Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks. While LLMs are increasingly deployed in many forms including conversational agents that interact with humans, we lack a grounded benchmark to measure how well LLMs understand \textit{social} language. Here, we introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge which we group into five categories: humor&sarcasm, offensiveness, sentiment&emotion, and trustworthiness. In tests on the benchmark, we demonstrate that current models attain only moderate performance but reveal significant potential for task transfer among different types and categories of tasks, which were predicted from theory. Through zero-shot evaluations, we show that pretrained models already possess some innate but limited capabilities of social language understanding and training on one category of tasks can improve zero-shot testing on others. Our benchmark provides a systematic way to analyze model performance on an important dimension of language and points to clear room for improvement to build more socially-aware LLMs. The associated resources are released at https://github.com/minjechoi/SOCKET.


Introduction
Interpersonal communication is more than just what is said.Understanding communication requires reasoning not only about the content of a message but the social implications drawn from that message (Halliday, 1995).As NLP systems, particularly Large Language Models (LLMs), are increasingly used in interpersonal settings, these models' abilities to understand social knowledge become critical.However, despite the recognized need for social knowledge (Hovy and Yang, 2021), the NLP field has limited abilities to test it.Here, we introduce SOCKET, a new benchmark for evaluating social knowledge.
Evaluating NLP systems has remained a key component for benchmarking the field's progress.Indeed, the rapid replacement of traditional models by LLM-based approaches was strongly motivated by substantial gains by LLMs on a variety of comprehensive Natural Language Understanding (NLU) benchmarks like SuperGLUE (Wang et al., 2019) and Natural Questions (Kwiatkowski et al., 2019).However, despite the fundamental social aspect of language, comprehensive benchmarks of social language remain absent.Instead, existing computational studies of social language have built individual datasets and models for specific types of information like empathy (Sharma et al., 2020), politeness (Danescu-Niculescu-Mizil et al., 2013), and humor (Van Hee et al., 2018).While beneficial, these semantic-level tasks omit broader social and narrative-level information (Li et al., 2021) and present only a narrow view of model performance.
We introduce SOCKET (Social Knowledge Evaluation Tests), a theory-grounded, systematic collection of 58 social language tasks.SOCKET covers five categories of social information: sentiment & emotion, trustworthiness, humor & sarcasm, offensiveness, and social factors, each motivated by specific theories.To examine models' generalizability, SOCKET includes four task formats: classification, regression, pairwise comparison, and span identification.This construction aims at assessing not only NLP models' performances on individual tasks but their ability to perform multiple task types and to productively benefit from related tasks and task categories during learning.
Our study offers the following three contributions to the research community.(1) We motivate a theoretically-grounded organization of social tasks ( §2) and subsequently introduce a new easyto-use benchmark, SOCKET, that systematically organizes 58 tasks ( §3). ( 2) We benchmark multiple current LLM approaches to multitask NLU via standard supervised training and zeroshot LLMs ( §4).Across all tests, our results show that baseline LLMs perform moderately, at best, but offer promising signs of being able to leverage task correlations.(3) We test the abilities of models to make use of cross-task transfer ( §5) showing multi-task training on strongly correlated tasks can maintain or even improve performance in specific tasks, but doing so on weakly correlated tasks can hurt the overall performance of LLMs ( §6).

Social Information in Natural
Language Processing Language is inherently social, as meaning is constructed through social interactions (Wittgenstein, 1953).A substantial body of research in linguistic theory and communication studies have examined how social knowledge is communicated via language understanding.Theories of language grounded in interaction and communication systems such as Systemic Functional Linguistics (SFL) by Halliday et al. (1989) assert that the function and appropriacy of language in a given context is the key to our understanding of language and its use (Eggins, 2004;Halliday et al., 1989;Halliday, 2004).We use these insights to probe linguistic models for their ability to capture social information, which we define as information conveyed through text about broader metatextual function and contextual appropriacy of the utterances in conversation.NLP Studies on Social Information Numerous studies have contributed to the development of datasets and models aimed toward identifying nuanced social information in language across diverse contexts.Computational linguists have modeled multiple forms of social information in language like sentiment (Buechel and Hahn, 2017), politeness (Fu et al., 2020), humor (Meaney et al., 2021), offensiveness (ElSherief et al., 2021), and intimacy (Pei and Jurgens, 2020), often achieving state-ofthe-art results close to human performance in their respective settings.Studies such as Park et al. (2021) have also leveraged given norms to train models to be more accurate in context-specific situations.
However, these plausible results may be achievable solely by focusing on the statistical and syntactical instead of the social aspect of language.
Whether to make advances in language understanding in research or to ensure reliability and safety in deployment, it is of vital importance to study whether models are truly capable of gaining a generalizable understanding of social factors before employing them for tasks that require such knowledge (Hovy and Yang, 2021).The necessity for such understanding is exemplified by studies showing that, when measuring the same concept, the performance of a model can vary greatly when tested on a different dataset due to factors such as changes in dialect, speaker demographics, and dataset domain (Miller et al., 2020;Blodgett et al., 2016;Wang et al., 2022a).
Despite this importance, efforts towards aggregating and synthesizing various datasets into themes has been less practiced.One notable exception is the work of Kang and Hovy (2021), where the authors combine existing datasets on different linguistic styles to introduce a benchmark which enables them to study cross-style language understanding.Similarly, we present a benchmark curated from over fifty different tasks on different aspects of social information, which we group into five distinctive categories.
Examining the social knowledge of LLMs In their current state, LLMs are ubiquitous in NLP applications and the study of computational linguistics.Their success is attributed to the ability to capture language characteristics from the immense amount of text data through pre-training and flexibly apply this information to enhance performance on downstream tasks through fine-tuning.Pre-trained LLMs can quickly adapt to new tasks, achieving state-of-the-art performances in language understanding tasks (Chung et al., 2022a).LLMs have demonstrated less success when solving tasks directly related to social knowledge.For tasks that require social information such as detecting sarcasm (Farha et al., 2022) or patronizing (Perez-Almendros et al., 2022), even the SOTA models exhibit only moderate performance.One major challenge is that compared to humans, LLMs have less capability to make predictions outside of the provided input and perform reasoning based on the innate social information (Sap et al., 2019b;Zhou et al., 2020).Yet it is this very social knowledge that is crucial for human interactions and conversations and is a milestone that must be reached for LLMs to make meaningful communications with humans (Mahowald et al., 2023).
More recently, general-purpose LLMs trained with instruction-based prompts have known to achieve strong performances, putting them at use in several practical domains such as summarization, question answering, and classification (Sanh et al., 2022).A newly emerging trend is to use curated prompts to identify the psychological capabilities of instruction-guided LLMs.Ruis et al. (2022) and Hu et al. (2022a) examine pragmatic understanding capabilities using prompts.Coupled with additional steps such as chain-of-thought (CoT) reasoning, this prompt-based approach has large potential for understanding whether LLMs can provide reasoning capabilities like humans.
In that sense, SOCKET can function as a safety check to evaluate the social information stored in the parameters of an LLM.Especially for models that will be deployed for interactions with humans, our benchmark can be used to examine the model's overall social understanding across multiple dimensions, assisting practitioners in the decision-making of whether the LLMs to be deployed contain sufficient social knowledge to fulfill the tasks at hand.

The Inter-relatedness of Social Information
Social language understanding requires accurately perceiving different dimensions and facets of communication that relate to one another.Interpersonal communication makes frequent use of humor (Schnurr, 2010), mitigation, also known as hedging, (Schneider, 2010), and swearing as a norm violation (Stapleton, 2003) in defining the contours of the social context for the speakers.Often, the pragmatics of these different dimensions of social language use are intertwined: communication with one dimension influences the interpretation of another, e.g., politeness and offensive speech (Culpeper, 2021), humor and politeness (Attardo, 2008), humor and offensiveness (Alberts, 1992), and mitigation and empathy (LI Hai-hui, 2019).Understanding one of these dimensions requires models to have the ability to recognize the related dimensions.While past computational work has largely focused on single dimensions, SOCKET fills a key gap by testing whether models can accurately recognize multiple, interrelated social dimensions-and whether models can benefit in their understanding from cross-task transfer.

SOCKET Task Selection
In this section, we describe the steps taken to create SOCKET, a unified benchmark designed to iden-tify social information embedded in language in interpersonal communication contexts.
Possible datasets and tasks were identified through a systematic review of datasets published at ACL, EMNLP, NAACL, EACL, LREC, and Se-mEval since 2015.In this first pass, we selected more than 100 datasets and tasks to detect different types of social information in language.Tasks were selected based on membership in five categories of social language (described next) that are motivated as core aspects of social language understanding.For each category, we include tasks of several distinct objectives: binary and multi-class classification, regression, pairwise similarity detection, and span identification.Where possible we aim for diversity within categories and ensure one task for each objective.Candidate tasks were removed if it was found that training a bert-base-uncased model on the task achieved test performance over 0.95, which would provide little insight into progress at recognizing social information.
Inspired by theories in interpersonal communication and interpersonal pragmatics, we provide a thematic organization of the tasks in SOCKET into five related categories of social knowledge: Humor & Sarcasm, Offensiveness, Sentiment & Emotion, Social Factors, and Trustworthiness.
Humor & Sarcasm The practice of humor in conversations and interactions plays a key role in maintaining and forming positive social relations (Holmes, 2006;Brown et al., 1987;Ziv, 2010).By nature, humor is a subjective concept that can differ depending on both demographic and contextual factors (Ruch, 2010), making humor detection a difficult task for LLMs.SOCKET includes a number of tasks on humor that can occur in various contexts such as in social media (Meaney et al., 2021), short jokes (Meaney et al., 2021), and news headlines (Hossain et al., 2020).We also include tasks that require detecting relevant concepts of humor such as sarcasm (Khodak et al., 2018) and irony (Van Hee et al., 2018).Offensiveness Detecting offensiveness using computational methods has gained large attraction in recent years due to the ubiquity of online communication and the necessity to implement automated content moderation to combat abusive behaviors (Spertus, 1997).However, most existing studies only focus on limited types of offensive languages (Jurgens et al., 2019).In this study, we consider offensiveness to be any explicit or implicit language directed towards individuals, entities, or groups (Waseem et al., 2017), and the tasks chosen are representative of this understanding.SOCKET includes a list of offensiveness detection tasks covering different levels of harmful content and abusive language including both explicit and implicit hate (ElSherief et al., 2021), abuse (Vidgen et al., 2021), and humor-related offensiveness (Meaney et al., 2021).We also include forms of bias directed towards people and groups, as social bias enforces harmful stereotypes (Sap et al., 2020).
Sentiment & Emotion Emotion is a core element of interpersonal communication that can be communicated through human language in several aspects (Majid, 2012).Social information is crucial in the ability to not only communicate, but also feel emotion.Theories of discretized emotion (Ekman, 1992) have been supported by empirical findings that humans use discrete labels learned through language to direct their emotional responses to stimuli (Lindquist and Barrett, 2008).In SOCKET, we include a wide range of tasks from various domains such as daily dialogue (Li et al., 2017), written responses to news stories (Buechel and Hahn, 2017), and tweets using textual syntax (Mohammad et al., 2018), and also emojis (Barbieri et al., 2018).
Trustworthiness People can detect cues in language that determine the trustworthiness of a message (Newman et al., 2003), leading to studies that aim to quantify the level of trust in text using computational methods (Choi et al., 2020).In particular, this direction has gained attention from NLP communities following increased needs to combat and mitigate potential harms coming from the generation and dissemination of false information in online spaces (Wu et al., 2019).In SOCKET we include tasks that require identifying perceived trust from several dimensions: impartiality (Pryzant et al., 2020), deception (Ott et al., 2011), propaganda (Martino et al., 2020), rumor (Ma et al., 2017) and bragging, as it is considered to be "unplain speaking" (Haiman, 1998;Jin et al., 2022).
Other Social Factors Finally, we include tasks of a more discursive and rhetorical type, that are understood to be more reliant on the contextual elements of social distance, power, and solidarity.In SOCKET, the tasks included are empathy (Buechel et al., 2018), politeness (Hayati et al., 2021;Fu et al., 2020), intimacy (Pei and Jurgens, 2020) and complaints (Preoţiuc-Pietro et al., 2019).Dataset Summary The final SOCKET benchmark contains 58 tasks from 35 datasets, grouped into the five categories shown in Figure 1.We represent multiple tasks belonging to the same dataset by adding the task name as a suffix following the dataset name and # symbol.
The collection of datasets chosen to be included in the SOCKET benchmark makes it the first of its kind as a comprehensive benchmark to measure language models' ability to capture underlying social information.Motivated by theories of systemic functional linguistics and interpersonal pragmatics, the SOCKET benchmark cuts across a number of dimensions of interpersonal communication, allowing it to also be a tool to better understand and interpret co-learning abilities and dependencies in sociolinguistic tasks.Having this ability allows researchers and users to more efficiently and effectively deploy NLP methods by providing empirical results regarding the limits and affordances of a variety of out-of-domain social language tasks.1) .A full comparison of all models across all settings can be found in Table 4 in the Appendix.The performances on each individual task using a DeBERTa-V3 model can be found in Table 7 in the Appendix.

Benchmarks on the Social Knowledge Capabilities of LLMs
We first train and evaluate several commonly used multitask LLMs on our datasets to obtain benchmark results, which provide a first glimpse of how good LLMs are at learning social knowledge tasks.Experiment details are described in Appendix §B.

BERT-based Finetuning
We first apply the standard process of fine-tuning on pretrained LLMs.We select two of the most popular LLMs -BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) -as well as two lightweight models known to achieve high performance on finetuning tasks -DeBERTa-V3 (He et al., 2021) and MiniLM (Wang et al., 2020).
Prompt-based finetuning Recently, prompt-based finetuning has emerged as a flexible and effective means of adapting models to downstream tasks (Wei et al., 2021).As a benchmark, we include the performances of a T5 model (Raffel et al., 2020) trained on each task via finetuning.We design manual prompts for each task.For classification tasks, we include all the labels in the prompts and for regression tasks, we adopt a method similar to Gao et al. (2020) in that we use two anchor words "yes" and "no" and consider the probability of predicting "yes" as the final score.For spanbased tasks, we train the model to directly generate the sequence outputs.A list of prompts can be found in Table 6 in the Appendix.
Zeroshot predictions We further apply our designed prompts to test the performances of LLMs in a zero-shot setting where no further finetuning is performed.Using the same prompts proposed in Table 6, we test SOCKET on several widely used LLMs: GPT (Radford et al., 2018), GPT-J-6B (Wang and Komatsuzaki, 2021), OPT (Zhang et al., 2022), T5 (Raffel et al., 2020), LLaMA (Touvron et al., 2023), BLOOM (Workshop et al., 2023), FLAN-T5 (Chung et al., 2022b), and Alpaca (Taori et al., 2023;Wang et al., 2022b).Samples for which a model does not provide an appropriate label are automatically marked as incorrect.For each LLM variant, we test zero-shot results for different model sizes ranging between 110M and 13B parameters, which we report in Table 4 in the Appendix.

Results
We compare model performances across category type and task type as shown in Table 2. DeBERTa-V3 achieves the best overall performance after full training on each of the SOCKET datasets, followed by other BERT-based models.The prompt-based finetuning of T5 performs worse than standard finetuning, especially on the pairwise classification and regression tasks.Meanwhile, zeroshot models experience close-to-baseline performances, indicating that prompts alone cannot lead to correct predictions in identifying social knowledge without further finetuning, and suggesting these models are less able to verbalize any inherent social knowledge.
Social knowledge can be hard to infer Our benchmark results reveal that even our best-performing model leaves significant room for improvement, scoring just above 0.7 overall-compared with the models' analogous performance on syntactic and discourse NLU tasks (He et al., 2021) which are often much higher.A comparison among categories of social knowledge reveal that humor & sarcasm is generally the easiest to detect, while trustworthiness is the hardest.This performance gap can be attributed to the level of understanding required for each dimension -while detecting humor or other social emotions can often be correlated with cues such as sentiment, detecting the level of trust within sentences requires more understanding of the context and may be harder to detect using computational models (Choi et al., 2020).At a task level, we observe that models struggle most in span detection tasks.This is a complex task due to its open-ended nature, and thus BERT-based finetuning does not perform as well as in other types of tasks.We highlight that learning the various aspects of social knowledge is indeed a challenge for current LLMs, and thus call for the need for future models with improved social capabilities.Supervised models significantly outperform zero-shot models Table 2 reveals that despite being much smaller in the number of parameters, finetuning supervised models such as BERT lead to much better performance than zero-shot models using state-of-the-art LLMs.Especially, all the LLMs that were tested in zero-shot settings performed poorly, many on par with random baselines, apart from FLAN-T5. Figure 1 reveals a detailed picture of how different modeling structures and parameter sizes affect a model's ability to comprehend social knowledge tasks in a zero-shot setting.Surprisingly, we find that of the various training schemes FLAN-T5 is by far the most effective for inferring social knowledge.This largely attributes to its instruction-finetuning scheme where the intial model is pretrained on more than 1,000 tasks.This allows for the model to interpret various types of prompts, some which may not have been included in the training data and are difficult for other mod-els to interpret.More parameters do not guarantee more social knowledge Another general trend we observe is a weak correlation between the number of parameters and overall performance within the same model architecture (ρ = 0.328, p = .10).This is to some extent determined by the model's ability to understand the task itself given an instruction prompt as well as a sample input, as larger models are capable of understanding a wider variety of tasks (refer to Table 5 in the Appendix for a comparison on parameter size and the percentage of samples that an LLM attempts to solve).Of course, it is also possible that larger LLMs could encode a greater amount of social knowledge through their greater parameter sizes.Interestingly, we observe that for some models, larger size does not always guarantee better performance.This is the case especially for BLOOM, T5 and GPT, where the largest model is not always the best performer within the group.
Table 5 in Appendix shows the ratio of instances where the model is able to follow the instructions.We show that instruction-tuned models like FLAN-T5 and Alpaca are generally able to follow the prompt instructions when answering the question, while other models may generate answers that are not provided in the options.Moreover, we found that pre-trained language models with more parameters might be more capable of following instructions even without instruction tuning.For example, the T5-3B model is able to follow the instructions for more instances than the smaller models (e.g., T5-large and t5-small).However, we also observe discrepancies of this trend where the T5-11B model is following fewer instructions than the t5-3b model, suggesting the need of instruction tuning for large language models.We also examine the zero-shot performances of different LLMs by restricting only to instances of which they were able to provide appropriate predictions, that is if the model understood the instruction prompts.Figure 3 reveals heterogeneity among different model groups regarding the interplay between model size, coverage and performance.Regarding LlaMA models for instance, while they are situated at the bottom half for understanding instructions and making predictions, the ones that they do make are of high accuracy compared to other models.Furthermore, changing from 7B to 13B parameters did not specifically increase the ratio of samples that the model could make inferences of, while the accuracy of those valid samples actually decreased with size.This is opposite to models such as OPT or FLAN-T5, where both coverage and performance indeed increase with size.Overall, our results suggest that while LLMs do contain potential for understanding social knowledge, sufficient additional steps such as finetuning or instruction tuning should be required to fully measure the understanding capabilities of these models.

Do we see Cross-task Transfer of Social
Knowledge?
In this section, we examine the relations and dependencies between tasks using the predictions of LLMs trained on different tasks and show proof of strong dependencies among tasks that are supported by theory.
Quantifying Task Dependency We quantify the dependency between two tasks as follows.We finetune a pretrained LLM on task t A to obtain a model m A , which is used to make predictions on the test set of another task t B .The correlation between the predicted values from model m A and the true labels of the test set of t B is considered as the task dependency that t A has on t B .We report the absolute correlation value, as negatively correlated tasks are still informative.We describe how the correlations are obtained across different task types in the Appendix ( §B.5).Span identification tasks are omitted from this analysis, resulting in 55 × 55 scores.We also measure the pairwise correlation between models m A and m B as well as task dependency to gain an additional perspective of task similarity.Details for the model correlation can be found in Appendix §B. in Figure 2, reveal salient block structures within the category, especially for the Offensiveness, Sentiment & Emotion, and Social Factors categories, suggesting the existence of shared knowledge within our thematically grouped tasks.These correlations align with existing findings from interpersonal pragmatics on the relationships between social knowledge.For instance, increased selfdisclosure or pain-related interactions are known to promote both intimacy (questionintimacy) and empathy (empathy) (Parks, 1981;Cano and Williams, 2010), two elements within the Social Factors category, while the usage of emojis (tweet_emoji) as effective symbols are indicative of emotional states such as valence (emobank#_valence) and arousal (emobank#_arousal) (Fischer and Herbert, 2021), which belong to the Sentiment & Emotion category.
The Offensiveness category shows mixed results in comparison with Arango et al. (2019), whose results show that hate speech datasets are often overfitted and do not generalize well to other similar datasets.Figures 2 & 4, however, show that of the seven datasets included in SOCKET, five of them included at least one task which showed comparable correlations when tested both within and out of domain.PersonDirectedAbuse, a task labeled for offensive language specifically directed towards an individual, is actually predicted better by models fine-tuned on jigsaw# tasks than it was on its own.Interestingly, correlations are scarce within the Humor & Sarcasm, and Trustworthiness categories.This is consistent with findings from (Hu et al., 2022b) which show that models without exposure to linguistic forms lack the requisite social information to perform well on non-literal pragmatic phenomena such as humor and deceit.
Another individual task that sticks out is hu-mor_rating from the Humor & Sarcasm dataset, which performs well as both the fine-tuning and predicted task alongside a number of tasks from the Emotion & Sentiment category-particularly discretized emotion tasks, as well as hateoffensive in the Offensiveness category-which labels comments as either "hateful," "offensive," or neither.While relationships between offensiveness and humor have been theorized as early as Freud (1960) and sentiment recognition has been shown to bolster offensive language detection (Liu, 2012), relatively little has been said regarding connections between the three categories and thus, this result presents an opportunity for further research in sociolinguistics.
We observe that politeness shows strong transfer with many of the offensive and hate speech detection tasks in the SOCKET benchmark.In particular, those tasks with high correlation within the offensive category are highly correlated in predicting the politeness classification task.This finding is supported by literature showing that impoliteness can fall under the umbrella of offensive language (B ączkowska, 2021) and, although key differences exist in the pragmatics of the two, the constructs are closely related (Parvaresh, 2023;Culpeper, 2021).
Interestingly, regression tasks (from the hahackathon, emobank, and empathy datasets) in general have strong correlations with several other tasks.This trend suggests that tasks labeled with continuous variables may have more expressive power compared to ordinal or nominal categorization, and thus have a higher potential for stronger task dependencies.This finding calls for a need for more datasets with continuous labels, which requires more effort but allows models to capture more fine-grained concepts of social knowledge.

Can Multi-task Training improve Social
Knowledge?
Our findings reveal significant task transfer, both within and across task categories, which hints at shared knowledge among tasks.Linguistics studies of social language also note the interrelated perceptions of different dimensions such as humor and offensiveness (Culpeper, 2021;Attardo, 2008;Alberts, 1992;LI Hai-hui, 2019).We now examine whether LLMs can learn a more robust sense of social knowledge by training on multiple tasks.
Recent studies have explored the possibility of multi-task training on LLMs, which is training a single model on several different tasks simultaneously, with effects of improving its performance on both seen and unseen tasks (Aghajanyan et al., 2021;Padmakumar et al., 2022).We apply multitask training on SOCKET, but make one clear distinction from prior work.Whereas previous studies have shown that multi-task training is especially effective when the grouped tasks are of similar types (Padmakumar et al., 2022), we introduce a new setting by grouping tasks instead on our defined categories of social knowledge.We expect that same-category tasks contain social knowledge that can be shared across tasks, resulting in LLMs that learn a more robust concept of the specific dimension than when trained on single tasks.
A popular method for multi-task training is prefinetuning (Aghajanyan et al., 2021;Shi et al., 2022), which involves a first stage of finetuning on multiple tasks using task-specific heads on a shared encoder, then re-using the encoder for downstream tasks.We apply pre-finetuning in two different settings: (1) category-wise tasks, where we perform pre-finetuning on tasks grouped to the same category, and (2) all tasks, where all tasks of SOCKET are included in the pre-finetuning stage.Consistent with prior work, we perform the second finetuning stage on individual tasks using the pre-finetuned model as initial weights (Aghajanyan et al., 2021).
Other training details are identical to §4.
We compare category-wise performances across the three settings, shown in Table 3.At an individual task level,1 we further see that multi-task training improves performance for several tasks, especially in the Offensiveness category.However, when aggregated at the category level, we actually observe that our multi-task training settings have significantly lower performances in the Humor & Sarcasm and Trustworthiness categories, which happen to have the lowest levels of withintask and cross-task dependencies ( §5).The performance drop is less evident in categories with high dependency, indicating that while multi-task training on similar tasks may not always improve performance, it at least helps preserve performance while also learning task-specific new concepts.On the other hand, multi-task training on unrelated tasks can hurt overall performance, which calls for a need to further investigate cases when applying multi-task training as a practice to improve the social knowledge of LLMs.

Conclusion
People increasingly interact with LLMs in natural conversation.To what degree are these models able to pick up on the social cues?To help answer this question, we introduce SOCKET, an NLP benchmark to evaluate how well model perform at learning and recognizing concepts of social knowledge.We provide benchmark results using several popular models and provide case studies of studying the inherent social capabilities of LLMs in a zero-shot setting.Surprisingly, LLMs perform moderately at best, with even large LLMs (>10b parameters) varying widely in their abilities.Additionally, we show that there exist significant task dependencies both within and across task categories, and that multi-task training on task categories can affect model performance.Our work contributes to the broader NLP community by fostering future efforts toward building and evaluating more socially responsible and coherent LLMs.

Limitations
Cross-cultural and multilingual expansions Culture and non-English languages are without doubt important aspects for understanding languages.In this study, however, we make a clear distinction between cultural knowledge and social knowledge, which is our focus for this study.Our work is grounded in social-psychological theory and the sociolinguistics on interpersonal communication, especially dyadic communication.Such studies are often aimed at phenomena that are widely shared across cultures while recognizing that cultural variation exists within how those phenomena are perceived.In contrast, work in anthropology or cultural studies provides a different perspective and grounding.Such work frequently focuses on cross-cultural perspectives and what is or is-not shared across cultures.For example, in language, the interpretation of whether something is polite can depend on gender norms (Mills, 2004) and cultural (Lorenzo-Dus and Bou-Franch, 2003), highlighting the potential context sensitivity.Similarly, the perception of toxicity can depend on the cultural identities of the reader (Sap et al., 2019a;Ghosh et al., 2021).While highly valuable to study, cultural knowledge is a separate construct from social knowledge (though interrelated) and not the focus of this benchmark, though we hope that our work inspires other benchmarks to help assess such differences.
Regarding multilingual data, SOCKET currently contains tasks based in English due to the limited availability of tasks in non-English.While there are a few datasets such as HAHA (Chiruzzo et al., 2020) in Spanish and DeTox (Demus et al., 2022) in German, we were not able to find sufficient numbers yet to provide a meaningful grouping.This highlights the importance of constructing datasets and frameworks capable of capturing social knowledge for a wide variety of languages, which we consider an important future step.

Additional dimensions of social knowledge
Even after our extensive literature review and data curation process, we acknowledge the existence of other dimensions of social knowledge that are not included in our current benchmark.Our aim was to focus on categories of social knowledge that have multiple tasks in order to get a broader assessment of model capabilities, e.g., multiple tests of a model's ability to recognize humor.Ultimately, some social aspects of language such as polysemy (Carston, 2021;Apresjan, 1974) and idioms (Strässler, 1982) either had too few similar datasets to form a theory-backed category, or there were no existing NLP datasets to test the construct.The latter is the case, especially in the case of linguistic techniques unique to identity or communityspecific dialects such as African-American English (Hyter et al., 2015;Rivers et al., 2012;Allan, 2007) and Queer Language (Barrett, 2006;Huebner, 2021;Harvey, 2000).Finally, because the work of task-selection and category-building was so dependent on literature in communication and pragmatics, the vast sea of socially informative linguistic features which have not been thoroughly studied in the literature are also not present.Thus, one potential room for improvement is the addition of new categories or constructs, as additional data becomes available.putting additional focus on conversation-driven aspects.Further inclusion of other dimensions and corresponding tasks should be an ongoing goal for our benchmark to remain effective.
Technical limitations One major limitation of the current benchmark is we only tested LLMs that have up to 13B parameters.Recent studies show that the LLMs may start to show emergent abilities when they are scaled up above a certain threshold (Wei et al., 2022).Due to limited computational and financial resources, we are not able to test all very large language models, though we welcome future researchers to work on our benchmark and evaluate the sociability of more LLMs.
Finally, we reiterate that our model performances using prompts was obtained by using curated prompts on pretrained models without any further finetuning.While it is widely known that instruction-based finetuning specific to downstream tasks can greatly improve performance, we deliberately chose not to do so.Finetuning LLMs with billions of parameters leave large amounts of carbon footprint, which we avoid for both financial and environmental reasons (Hu et al., 2021;Liu et al., 2022;Lester et al., 2021).

Ethical Considerations
The interpretation of social information in communication is highly subjective in that it can largely vary depending on demographic and contextual factors.Nevertheless, several NLP datasets are created via crowdsourcing, which raises concerns on whether the dataset's labels are truly representative of our society (Talat et al., 2022).Even within our benchmark, there is the possibility that for tasks such as offensiveness or humor the crowdsourced labels may undermine phrases that might disregard a specific demographic group, which may be inevitably picked up by LLMs that are trained and evaluated on these datasets.Improved versions of our benchmark should include datasets that are more inclusive in such contexts, which we call for future work.
There has been increasing concern over the amount of computing resources required for con-ducting deep learning research at scale, especially regarding LLMs where task performance is improved through larger datasets, increased model parameters, and longer training hours.The time and amount of computing resources required for training LLMs has become nontrivial (Bender et al., 2021), and it has been increasingly aware among machine learning practitioners to consider the carbon footprint of models and computing methods to minimize risks of global warming.This, combined with limited transparency of experiment results, may harm the very concept of open science.Keeping this in mind, we focused on conducting easily reproducible experiments that can be run on a single GPU within the time frame of hours or a couple of days at the longest.Some of our findings contribute towards this rightful direction, as can be seen in our investigation on multi-task training.
More importantly, we highlight the fact that the main contribution of our study is a thoroughly designed public framework of tasks for examining the social knowledge of LLMs.While it is indeed important to develop and improve LLMs that can perform better on several tasks, we believe that correctly evaluating the level of social knowledge engraved in these models is an equally important task.For without such scrutiny, the users of LLMs deployed in practical settings may be vulnerable to socially undesirable or unethical content.We sincerely hope that our efforts in producing SOCKET can ease difficulties on conducting future studies that aim to examine and improve the social understanding of LLMs.

A Details on dataset processing A.1 Benchmark construction ( §3)
The SOCKET dataset consists of 58 tasks from 35 unique, public datasets.The datasets which make up the benchmark dataset are processed in a way that is meant to balance uniformity across datasets and tasks while minimizing deviations from the original dataset.
For all datasets, key changes from the original dataset are twofold: • Duplicates and unlabeled items are removed from all datasets.If duplicates occur across data splits, the splits are recombined, reshuffled, and split.
All datasets were made compatible with the Hugging Face Datasets package.As our benchmark consists of four different task types -classification, regression, sentence pair detection, and span identification -we maintain a unified structure for each task where each sample is fed into the encoder of an LLM, and the output states are then fed into a task-specific head layer.For span detection tasks, we feed the last hidden layer into a bidirectional GRU (Chung et al., 2014), and then the output vectors of the GRU into a linear layer that transforms each vector into a dimension of 3, corresponding to the [B,I,O] labels for each token, following earlier work in span identification (Suman and Jain, 2021).For all other tasks, we feed the last hidden state of the encoder corresponding to the [CLS] token into a separate classifier/regression head consisting of two linear layers of hidden size 768 and a dropout probability of 0.1.We use the mean squared error loss for regression tasks and the cross-entropy loss for all other tasks.

B.3.2
Training strategies for language model finetuning ( §4, §6) When training models for the benchmark ( §4) and the multi-task ( §6) experiments, the learning rate was linearly increased for 6% of the training steps up to 1e-5 and linearly decreased afterward.All models were trained for a maximum of 10 epochs using three different seeds, with early stopping after validation performance did not increase for three consecutive epochs.
Our multi-task training in §6 requires two stages of training: (1) a pre-finetuning stage that simultaneously trains a model on multiple different tasks, and (2) a finetuning stage that loads the model trained from (1) and finetunes it to a single task.In the first stage, a single batch can include several different tasks and produce different types of losses.To obtain a unified loss that is differentiable, we aggregated the loss for each sample and sum them up, which we use for backpropagation.For both stages, we use the same aforementioned training steps and learning rate strategy.
For all settings, the training batch size was set to 32 with 16-bit precision enabled.Validation was made after each training epoch on the validation set using Pearson's r correlation added by 1 and divided by 2 for regression tasks and macro F1 score for all other tasks.If there were multiple tasks considered due to multi-task training, the average of all task performances was used as the final validation score.

B.4 Details on prompt-based finetuning ( §4, §5)
We use fix prompts fine-tuning for all the promptbased models.The batch size was set as 32 for training.For every single task, we set 10 as the max epoch and do early stopping based on the validation loss.The learning rate is set as 5e-5.
For classification tasks, the model is fine-tuned to generate the target label.For regression tasks, we first normalized the scores into (0,1) and then split the labels into two groups.The model is finetuned to predict "yes" or "no" regarding the prompt question.During inference, the probability of the "yes" token is used as the prediction score.For span  Figure 3: A comparison of the ratio of valid samples which the LLM was able to make an inference given the correct instruction prompt (x-axis) versus the overall scores when limited to the samples that the model was capable an inference (y-axis). tasks,

B.5 Details on zeroshot predictions ( §4, §5)
We use manually desinged prompts for all the zeroshot prediction tasks and the prompts are shown in Table 6.

B.6 Computing correlation scores of task dependencies ( §5)
Because our framework consists of several task types, it is challenging to obtain a unified metric of correlation across different task comparisons.We use the following rules to obtain correlation values: • Regression task & regression task: We compute the Pearson's correlation coefficient of the two arrays.

• Regression task & binary classification task:
We compute the point biserial correlation coefficient of a continuous array and a binary array.
• Regression task & multi-class classification task: We set up a linear regression task using the one-hot coded values of the multi-class array as independent variables and the continuous array as the dependent variable.We report the root of the R-squared value of the regression as correlation (Olsson et al., 1982).
• Binary classification task & binary classification task: We compute the Matthews' correlation coefficient (Matthews, 1975) from the two binary arrays.
• Binary or multi-task classification task & multi-class classification task: We compute the Cramer's V score (Cramér, 1999) from the two arrays of categorical variables.

B.7 Computing pairwise model similarities ( §5)
We quantify the model similarity between two tasks as follows.We finetune a pretrained LLM on task t A to obtain a model m A , and another LLM on task t B to obtain m B .We obtain pairwise model similarities by inferring both models on a sufficiently large dataset-in this case the entire test set of all tasks-and computing the correlation of the two   inferred arrays.We construct an undirected graph (Figure 5) where the thickness and color represent absolute correlation strength and polarity between the two models.The addition of polarity enables us to further discover strong negative correlations with task pairs such as politeness and offensiveness.

Figure 1 :
Figure1: A comparison of LLMs on the aggregated scores tested on SOCKET under zero-shot settings.The overall performances vary greatly by model architecture, while larger models do not always guarantee better performance

Figure 2 :
Figure 2: Heatmap of task dependency among all task pairs, annotated at category level.Each value represents the absolute strength of correlation between the true labels of the test set of a specific task (columns) and the predictions made on that task using a model trained on a different task (rows).We observe strong correlations, especially within the Offensiveness, Sentiment & Emotion, and Social Factors categories.A larger version labeled at the task level is shown in Appendix Figure 4.

B. 1
Computational resources ( §4, §5, §6) All of our experiments were conducted on an Ubuntu 22.04.1 machine installed with NVIDIA RTX A5000 and A6000 GPUs.The Python packages used in our experiments include Pytorch 1.13, Transformers 4.21.3, and Pytorch Lightning 1.6.4.B.2 Comparison of all models B.3 Details on language model finetuning ( §4, §5, §6) B.3.1 Task-specific heads ( §4, §5, §6) h a h a c k a th o n # is _ h u m o r tw e e t_ ir o n y h u m o r-p a ir s s a rc h a h a c k a th o n # h u m o r_ ra ti n g h a te o ff e n s iv e h a h a c k a th o n # o ff e n s e _ ra ti n g ji g s a w # to x ic ji g s a w # in s u lt ji g s a w # o b s c e n e o ff e n s iv e y n tw e e t_ o ff s iv e in te n ty n h a s b ia s e d im p li c a ti o n ji g s a w # id e n ti ty _ h a te im p li c it -h a te # e x p li c it _ h a te c o n te x tu a l-a b u s e # Id e n ti ty D ir e c te d A b u s e ji g s a w # s e v e re _ to x ic im p li c it -h a te # im p li c it _ h a te ji g s a w # th re a t s e x y n im p li c it -h a te # s te re o ty p ic a l_ h a te c o n te x tu a l-a b u s e # P e rs o n D ir e c te d A b u s e ta lk d o w n -p a ir s im p li c it -h a te # in fe ri o ri ty _ h a te im p li c it -h a te # w h it e _ g ri e v a n c e _ h a te im p li c it -h a te # in c it e m e n t_ h a te im p li c it -h a te # th re a te n in g _ h a te c ro w d fl o w e r tw e e t_ e m o ti o n tw e e t_ s e n ti m e n t e m o b a n k # v a le n c e tw e e t_ e m o ji e m o b a n k # d o m in a n c e s e n ti tr e e b a n k e m o b a n k # a ro u s a l e m p a th y # d is tr e s s d a il y d ia lo g e m p a th y # d is tr e s s _ b in s a m e -s id e -p a ir s e m p a th y # e m p a th y q u e s ti o n in ti m a c y h a y a ti _ p o li te n e s s c o m p la in ts s ta n fo rd p o li te n e s s e m p a th y # e m p a th y _ b in n e u tr a li z in g -b ia s -p a ir s ru m o r# ru m o r_ b o o l b ra g g in g # b ra g _ a c h ie v e m e n t h y p o -l b ra g g in g # b ra g _ tr a it b ra g g in g # b ra g _ p o s s e s s io n tw o -t o -l ie # s e n d e r_ tr u th tw o -t o -l ie # re c e iv e r_ tr u th b ra g g in g # b ra g _ a c ti o n

Figure 4 :
Figure4: A detailed heatmap of Figure2showing task dependency among all task pairs as well as task labels.Each value represents the absolute strength of correlation between the true labels of the test set of a specific task (columns) and the predictions made on that task using a model trained on a different task (rows).

Table 1 :
SOCKET benchmark covers 58 tasks in 5 categories of social information.SOCKET covers four types of tasks: classification (CLS), regression (REG), pair-wise comparison (PAIR), and span identification (SPAN).

Table 2 :
Humor & Sarcasm Offens.Sent.&Emo.Social Factors Trust.CLS PAIR REG SPAN Avg.A comparison of the benchmark performances of different models and training schemes.Best-performing instances are shown in bold.The best performing parameter size for each zero-shot model is shown (cf.Figure Humor & Sarcasm Offensiveness Sentiment & Emotion Social Factors Trustworthiness CLS PAIR REG SPAN Avg.