Nonverbal vocalizations are an essential component of human communication, conveying rich information without linguistic content. However, their computational analysis is hindered by a lack of lexical anchors in the data, compounded by biased and imbalanced data distributions. While disentangled representation learning has shown promise in isolating specific speech features, its application to nonverbal vocalizations remains unexplored. In this paper, we introduce N-CORE, a novel backbone-agnostic framework designed to disentangle intertwined features like emotion and speaker information from nonverbal vocalizations by leveraging N views of audio samples to learn invariance to specific transformations. N-CORE achieves competitive performance compared to state-of-the-art methods for emotion and speaker classification on the VIVAE, ReCANVo, and ReCANVo-Balanced datasets. We further propose an emotion perturbation function that disrupts affective information while preserving speaker information in audio signals for emotion-invariant speaker classification. Our work informs research directions on paralinguistic speech processing, including clinical diagnoses of atypical speech and longitudinal analysis of communicative development. Our code is available at https://github.com/SiddhantBikram/N-CORE.
The prevalence of toxic behavior in online gaming communities necessitates robust detection methods to ensure user safety. We introduce GameTox, a novel dataset comprising 53K game chat utterances annotated for toxicity detection through intent classification and slot filling. This dataset captures the complex relationship between user intent and specific linguistic features that contribute to toxic interactions. We extensively analyze the dataset to uncover key insights into the nature of toxic speech in gaming environments. Furthermore, we establish baseline performance metrics using state-of-the-art natural language processing and large language models, demonstrating the dataset’s contribution towards enhancing the detection of toxic behavior and revealing the limitations of contemporary models. Our results indicate that leveraging both intent detection and slot filling provides a significantly more granular and context-aware understanding of harmful messages. This dataset serves as a valuable resource to train advanced models that can effectively mitigate toxicity in online gaming and foster healthier digital spaces. Our dataset is publicly available at: https://github.com/shucoll/GameTox.
The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understanding of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the performance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.