This work presents the Sociolinguistic Corpus of WhatsApp Chats in Spanish among College Students, a corpus of raw data for general use. Its purpose is to offer data for the study of of language and interactions via Instant Messaging (IM) among bachelors. Our paper consists of an overview of both the corpus’s content and demographic metadata. Furthermore, it presents the current research being conducted with it —namely parenthetical expressions, orality traits, and code-switching. This work also includes a brief outline of similar corpora and recent studies in the field of IM.
Computational Humor involves several tasks, such as humor recognition, humor generation, and humor scoring, for which it is useful to have human-curated data. In this work we present a corpus of 27,000 tweets written in Spanish and crowd-annotated by their humor value and funniness score, with about four annotations per tweet, tagged by 1,300 people over the Internet. It is equally divided between tweets coming from humorous and non-humorous accounts. The inter-annotator agreement Krippendorff’s alpha value is 0.5710. The dataset is available for general usage and can serve as a basis for humor detection and as a first step to tackle subjectivity.
Code-mixing is a linguistic phenomenon where multiple languages are used in the same occurrence that is increasingly common in multilingual societies. Code-mixed content on social media is also on the rise, prompting the need for tools to automatically understand such content. Automatic Parts-of-Speech (POS) tagging is an essential step in any Natural Language Processing (NLP) pipeline, but there is a lack of annotated data to train such models. In this work, we present a unique language tagged and POS-tagged dataset of code-mixed English-Hindi tweets related to five incidents in India that led to a lot of Twitter activity. Our dataset is unique in two dimensions: (i) it is larger than previous annotated datasets and (ii) it closely resembles typical real-world tweets. Additionally, we present a POS tagging model that is trained on this dataset to provide an example of how this dataset can be used. The model also shows the efficacy of our dataset in enabling the creation of code-mixed social media POS taggers.
The exponential rise of social media websites like Twitter, Facebook and Reddit in linguistically diverse geographical regions has led to hybridization of popular native languages with English in an effort to ease communication. The paper focuses on the classification of offensive tweets written in Hinglish language, which is a portmanteau of the Indic language Hindi with the Roman script. The paper introduces a novel tweet dataset, titled Hindi-English Offensive Tweet (HEOT) dataset, consisting of tweets in Hindi-English code switched language split into three classes: non-offensive, abusive and hate-speech. Further, we approach the problem of classification of the tweets in HEOT dataset using transfer learning wherein the proposed model employing Convolutional Neural Networks is pre-trained on tweets in English followed by retraining on Hinglish tweets.
This paper describes an overview of the Dialogue Emotion Recognition Challenge, EmotionX, at the Sixth SocialNLP Workshop, which recognizes the emotion of each utterance in dialogues. This challenge offers the EmotionLines dataset as the experimental materials. The EmotionLines dataset contains conversations from Friends TV show transcripts (Friends) and real chatting logs (EmotionPush), where every dialogue utterance is labeled with emotions. Organizers provide baseline results. 18 teams registered in this challenge and 5 of them submitted their results successfully. The best team achieves the unweighted accuracy 62.48 and 62.5 on EmotionPush and Friends, respectively. In this paper we present the task definition, test collection, the evaluation results of the groups that participated in this challenge, and their approach.
In this paper, we propose a self-attentive bidirectional long short-term memory (SA-BiLSTM) network to predict multiple emotions for the EmotionX challenge. The BiLSTM exhibits the power of modeling the word dependencies, and extracting the most relevant features for emotion classification. Building on top of BiLSTM, the self-attentive network can model the contextual dependencies between utterances which are helpful for classifying the ambiguous emotions. We achieve 59.6 and 55.0 unweighted accuracy scores in the Friends and the EmotionPush test sets, respectively.
In this paper, we model emotions in EmotionLines dataset using a convolutional-deconvolutional autoencoder (CNN-DCNN) framework. We show that adding a joint reconstruction loss improves performance. Quantitative evaluation with jointly trained network, augmented with linguistic features, reports best accuracies for emotion prediction; namely joy, sadness, anger, and neutral emotion in text.
This paper describes the working note on “EmotionX” shared task. It is hosted by SocialNLP 2018. The objective of this task is to detect the emotions, based on each speaker’s utterances that are in English. Taking this as multiclass text classification problem, we have experimented to develop a model to classify the target class. The primary challenge in this task is to detect the emotions in short messages, communicated through social media. This paper describes the participation of SmartDubai_NLP team in EmotionX shared task and our investigation to detect the emotions from utterance using Neural networks and Natural language understanding.
This paper presents our system submitted to the EmotionX challenge. It is an emotion detection task on dialogues in the EmotionLines dataset. We formulate this as a hierarchical network where network learns data representation at both utterance level and dialogue level. Our model is inspired by Hierarchical Attention network (HAN) and uses pre-trained word embeddings as features. We formulate emotion detection in dialogues as a sequence labeling problem to capture the dependencies among labels. We report the performance accuracy for four emotions (anger, joy, neutral and sadness). The model achieved unweighted accuracy of 55.38% on Friends test dataset and 56.73% on EmotionPush test dataset. We report an improvement of 22.51% in Friends dataset and 36.04% in EmotionPush dataset over baseline results.
This paper addresses the problem of automatic recognition of emotions in conversational text datasets for the EmotionX challenge. Emotion is a human characteristic expressed through several modalities (e.g., auditory, visual, tactile). Trying to detect emotions only from the text becomes a difficult task even for humans. This paper evaluates several neural architectures based on Attention Models, which allow extracting relevant parts of the context within a conversation to identify the emotion associated with each utterance. Empirical results in the validation datasets demonstrate the effectiveness of the approach compared to the reference models for some instances, and other cases show better results with simpler models.
In this paper, we discuss the enrichment of a manually developed resource, OntoSenseNet for Telugu. OntoSenseNet is a sense annotated resource that marks each verb of Telugu with a primary and a secondary sense. The area of research is relatively recent but has a large scope of development. We provide an introductory work to enrich the OntoSenseNet to promote further research in Telugu. Classifiers are adopted to learn the sense relevant features of the words in the resource and also to automate the tagging of sense-types for verbs. We perform a comparative analysis of different classifiers applied on OntoSenseNet. The results of the experiment prove that automated enrichment of the resource is effective using SVM classifiers and Adaboost ensemble.
A large amount of social media data is generated during natural disasters, and identifying the relevant portions of this data is critical for researchers attempting to understand human behavior, the effects of information sources, and preparatory actions undertaken during these events. In order to classify human behavior during hazard events, we employ machine learning for two tasks: identifying hurricane related tweets and classifying user evacuation behavior during hurricanes. We show that feature-based and deep learning methods provide different benefits for tweet classification, and ensemble-based methods using linguistic, temporal, and geospatial features can effectively classify user behavior.
In this study we propose a new approach to analyse the political discourse in on-line social networks such as Twitter. To do so, we have built a discourse classifier using Convolutional Neural Networks. Our model has been trained using election manifestos annotated manually by political scientists following the Regional Manifestos Project (RMP) methodology. In total, it has been trained with more than 88,000 sentences extracted from more that 100 annotated manifestos. Our approach takes into account the context of the phrase in order to classify it, like what was previously said and the political affiliation of the transmitter. To improve the classification results we have used a simplified political message taxonomy developed within the Electronic Regional Manifestos Project (E-RMP). Using this taxonomy, we have validated our approach analysing the Twitter activity of the main Spanish political parties during 2015 and 2016 Spanish general election and providing a study of their discourse.