Vivek Srivastava


2021

pdf bib
Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text
Vivek Srivastava | Mayank Singh
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech. Identifying and filtering code-mixed text is a challenging task due to its co-existence with monolingual and noisy text. Over the years, several code-mixing metrics have been extensively used to identify and validate code-mixed text quality. This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.

pdf bib
PoliWAM: An Exploration of a Large Scale Corpus of Political Discussions on WhatsApp Messenger
Vivek Srivastava | Mayank Singh
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

WhatsApp Messenger is one of the most popular channels for spreading information with a current reach of more than 180 countries and 2 billion people. Its widespread usage has made it one of the most popular media for information propagation among the masses during any socially engaging event. In the recent past, several countries have witnessed its effectiveness and influence in political and social campaigns. We observe a high surge in information and propaganda flow during election campaigning. In this paper, we explore a high-quality large-scale user-generated dataset curated from WhatsApp comprising of 281 groups, 31,078 unique users, and 223,404 messages shared before, during, and after the Indian General Elections 2019, encompassing all major Indian political parties and leaders. In addition to the raw noisy user-generated data, we present a fine-grained annotated dataset of 3,848 messages that will be useful to understand the various dimensions of WhatsApp political campaigning. We present several complementary insights into the investigative and sensational news stories from the same period. Exploratory data analysis and experiments showcase several exciting results and future research opportunities. To facilitate reproducible research, we make the anonymized datasets available in the public domain.

pdf bib
Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text
Vivek Srivastava | Mayank Singh
Proceedings of the 14th International Conference on Natural Language Generation

In this shared task, we seek the participating teams to investigate the factors influencing the quality of the code-mixed text generation systems. We synthetically generate code-mixed Hinglish sentences using two distinct approaches and employ human annotators to rate the generation quality. We propose two subtasks, quality rating prediction and annotators’ disagreement prediction of the synthetic Hinglish dataset. The proposed subtasks will put forward the reasoning and explanation of the factors influencing the quality and human perception of the code-mixed text.

pdf bib
What BERTs and GPTs know about your brand? Probing contextual language models for affect associations
Vivek Srivastava | Stephen Pilli | Savita Bhat | Niranjan Pedanekar | Shirish Karande
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

Investigating brand perception is fundamental to marketing strategies. In this regard, brand image, defined by a set of attributes (Aaker, 1997), is recognized as a key element in indicating how a brand is perceived by various stakeholders such as consumers and competitors. Traditional approaches (e.g., surveys) to monitor brand perceptions are time-consuming and inefficient. In the era of digital marketing, both brand managers and consumers engage with a vast amount of digital marketing content. The exponential growth of digital content has propelled the emergence of pre-trained language models such as BERT and GPT as essential tools in solving myriads of challenges with textual data. This paper seeks to investigate the extent of brand perceptions (i.e., brand and image attribute associations) these language models encode. We believe that any kind of bias for a brand and attribute pair may influence customer-centric downstream tasks such as recommender systems, sentiment analysis, and question-answering, e.g., suggesting a specific brand consistently when queried for innovative products. We use synthetic data and real-life data and report comparison results for five contextual LMs, viz. BERT, RoBERTa, DistilBERT, ALBERT and BART.

pdf bib
MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation
Ayush Garg | Sammed Kagi | Vivek Srivastava | Mayank Singh
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

pdf bib
HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text
Vivek Srivastava | Mayank Singh
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

2020

pdf bib
IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment Classification Using Candidate Sentence Generation and Selection
Vivek Srivastava | Mayank Singh
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Code-mixing is the phenomenon of using multiple languages in the same utterance. It is a frequently used pattern of communication on social media sites such as Facebook, Twitter, etc. Sentiment analysis of the monolingual text is a well-studied task. Code-mixing adds to the challenge of analyzing the sentiment of the text on various platforms such as social media, online gaming, forums, product reviews, etc. We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier to classify the Hinglish code-mixed text into one of the three sentiment classes positive, negative, or neutral. The proposed candidate sentence generation and selection based approach show an improvement in the system performance as compared to the Bi-LSTM based neural classifier. We can extend the proposed method to solve other problems with code-mixing in the textual data, such as humor-detection, intent classification, etc.

pdf bib
PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation
Vivek Srivastava | Mayank Singh
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated code-mixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a well-studied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on PHINC demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.