Rakesh Gosangi


2021

pdf bib
GupShup: Summarizing Open-Domain Code-Switched Conversations
Laiba Mehnaz | Debanjan Mahata | Rakesh Gosangi | Uma Sushmitha Gunturi | Riya Jain | Gauri Gupta | Amardeep Kumar | Isabelle G. Lee | Anish Acharya | Rajiv Ratn Shah
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Code-switching is the communication phenomenon where the speakers switch between different languages during a conversation. With the widespread adoption of conversational agents and chat platforms, code-switching has become an integral part of written conversations in many multi-lingual communities worldwide. Therefore, it is essential to develop techniques for understanding and summarizing these conversations. Towards this objective, we introduce the task of abstractive summarization of Hindi-English (Hi-En) code-switched conversations. We also develop the first code-switched conversation summarization dataset - GupShup, which contains over 6,800 Hi-En conversations and their corresponding human-annotated summaries in English (En) and Hi-En. We present a detailed account of the entire data collection and annotation process. We analyze the dataset using various code-switching statistics. We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation. Our results show that multi-lingual mBART and multi-view seq2seq models obtain the best performances on this new dataset. We also conduct an extensive qualitative analysis to provide insight into the models and some of their shortcomings.

pdf bib
On the Use of Context for Predicting Citation Worthiness of Sentences in Scholarly Articles
Rakesh Gosangi | Ravneet Arora | Mohsen Gheisarieha | Debanjan Mahata | Haimin Zhang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we study the importance of context in predicting the citation worthiness of sentences in scholarly articles. We formulate this problem as a sequence labeling task solved using a hierarchical BiLSTM model. We contribute a new benchmark dataset containing over two million sentences and their corresponding labels. We preserve the sentence order in this dataset and perform document-level train/test splits, which importantly allows incorporating contextual information in the modeling process. We evaluate the proposed approach on three benchmark datasets. Our results quantify the benefits of using context and contextual embeddings for citation worthiness. Lastly, through error analysis, we provide insights into cases where context plays an essential role in predicting citation worthiness.

2020

pdf bib
Two-Step Classification using Recasted Data for Low Resource Settings
Shagun Uppal | Vivek Gupta | Avinash Swaminathan | Haimin Zhang | Debanjan Mahata | Rakesh Gosangi | Rajiv Ratn Shah | Amanda Stent
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

An NLP model’s ability to reason should be independent of language. Previous works utilize Natural Language Inference (NLI) to understand the reasoning ability of models, mostly focusing on high resource languages like English. To address scarcity of data in low-resource languages such as Hindi, we use data recasting to create NLI datasets for four existing text classification datasets. Through experiments, we show that our recasted dataset is devoid of statistical irregularities and spurious patterns. We further study the consistency in predictions of the textual entailment models and propose a consistency regulariser to remove pairwise-inconsistencies in predictions. We propose a novel two-step classification method which uses textual-entailment predictions for classification task. We further improve the performance by using a joint-objective for classification and textual entailment. We therefore highlight the benefits of data recasting and improvements on classification performance using our approach with supporting experimental results.

pdf bib
Semi-Supervised Iterative Approach for Domain-Specific Complaint Detection in Social Media
Akash Gautam | Debanjan Mahata | Rakesh Gosangi | Rajiv Ratn Shah
Proceedings of the 3rd Workshop on e-Commerce and NLP

In this paper, we present a semi-supervised bootstrapping approach to detect product or service related complaints in social media. Our approach begins with a small collection of annotated samples which are used to identify a preliminary set of linguistic indicators pertinent to complaints. These indicators are then used to expand the dataset. The expanded dataset is again used to extract more indicators. This process is applied for several iterations until we can no longer find any new indicators. We evaluated this approach on a Twitter corpus specifically to detect complaints about transportation services. We started with an annotated set of 326 samples of transportation complaints, and after four iterations of the approach, we collected 2,840 indicators and over 3,700 tweets. We annotated a random sample of 700 tweets from the final dataset and observed that nearly half the samples were actual transportation complaints. Lastly, we also studied how different features based on semantics, orthographic properties, and sentiment contribute towards the prediction of complaints.

pdf bib
A Preliminary Exploration of GANs for Keyphrase Generation
Avinash Swaminathan | Haimin Zhang | Debanjan Mahata | Rakesh Gosangi | Rajiv Ratn Shah | Amanda Stent
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce a new keyphrase generation approach using Generative Adversarial Networks (GANs). For a given document, the generator produces a sequence of keyphrases, and the discriminator distinguishes between human-curated and machine-generated keyphrases. We evaluated this approach on standard benchmark datasets. We observed that our model achieves state-of-the-art performance in the generation of abstractive keyphrases and is comparable to the best performing extractive techniques. Although we achieve promising results using GANs, they are not significantly better than the state-of-the-art generative models. To our knowledge, this is one of the first works that use GANs for keyphrase generation. We present a detailed analysis of our observations and expect that these findings would help other researchers to further study the use of GANs for the task of keyphrase generation.

pdf bib
An Annotated Dataset of Discourse Modes in Hindi Stories
Swapnil Dhanwal | Hritwik Dutta | Hitesh Nankani | Nilay Shrivastava | Yaman Kumar | Junyi Jessy Li | Debanjan Mahata | Rakesh Gosangi | Haimin Zhang | Rajiv Ratn Shah | Amanda Stent
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a new corpus consisting of sentences from Hindi short stories annotated for five different discourse modes argumentative, narrative, descriptive, dialogic and informative. We present a detailed account of the entire data collection and annotation processes. The annotations have a very high inter-annotator agreement (0.87 k-alpha). We analyze the data in terms of label distributions, part of speech tags, and sentence lengths. We characterize the performance of various classification algorithms on this dataset and perform ablation studies to understand the nature of the linguistic models suitable for capturing the nuances of the embedded discourse structures in the presented corpus.

pdf bib
MIDAS at SemEval-2020 Task 10: Emphasis Selection Using Label Distribution Learning and Contextual Embeddings
Sarthak Anand | Pradyumna Gupta | Hemant Yadav | Debanjan Mahata | Rakesh Gosangi | Haimin Zhang | Rajiv Ratn Shah
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper presents our submission to the SemEval 2020 - Task 10 on emphasis selection in written text. We approach this emphasis selection problem as a sequence labeling task where we represent the underlying text with various contextual embedding models. We also employ label distribution learning to account for annotator disagreements. We experiment with the choice of model architectures, trainability of layers, and different contextual embeddings. Our best performing architecture is an ensemble of different models, which achieved an overall matching score of 0.783, placing us 15th out of 31 participating teams. Lastly, we analyze the results in terms of parts of speech tags, sentence lengths, and word ordering.