Ranbir Sanasam


2024

pdf bib
IndiSentiment140: Sentiment Analysis Dataset for Indian Languages with Emphasis on Low-Resource Languages using Machine Translation
Saurabh Kumar | Ranbir Sanasam | Sukumar Nandi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Sentiment analysis, a fundamental aspect of Natural Language Processing (NLP), involves the classification of emotions, opinions, and attitudes in text data. In the context of India, with its vast linguistic diversity and low-resource languages, the challenge is to support sentiment analysis in numerous Indian languages. This study explores the use of machine translation to bridge this gap. The investigation examines the feasibility of machine translation for creating sentiment analysis datasets in 22 Indian languages. Google Translate, with its extensive language support, is employed for this purpose in translating the Sentiment140 dataset. The study aims to provide insights into the practicality of using machine translation in the context of India’s linguistic diversity for sentiment analysis datasets. Our findings indicate that a dataset generated using Google Translate has the potential to serve as a foundational framework for tackling the low-resource challenges commonly encountered in sentiment analysis for Indian languages.

2023

pdf bib
IndiSocialFT: Multilingual Word Representation for Indian languages in code-mixed environment
Saurabh Kumar | Ranbir Sanasam | Sukumar Nandi
Findings of the Association for Computational Linguistics: EMNLP 2023

The increasing number of Indian language users on the internet necessitates the development of Indian language technologies. In response to this demand, our paper presents a generalized representation vector for diverse text characteristics, including native scripts, transliterated text, multilingual, code-mixed, and social media-related attributes. We gather text from both social media and well-formed sources and utilize the FastText model to create the “IndiSocialFT” embedding. Through intrinsic and extrinsic evaluation methods, we compare IndiSocialFT with three popular pretrained embeddings trained over Indian languages. Our findings show that the proposed embedding surpasses the baselines in most cases and languages, demonstrating its suitability for various NLP applications.

2022

pdf bib
TEAM: A multitask learning based Taxonomy Expansion approach for Attach and Merge
Bornali Phukon | Anasua Mitra | Ranbir Sanasam | Priyankoo Sarmah
Findings of the Association for Computational Linguistics: NAACL 2022

Taxonomy expansion is a crucial task. Most of Automatic expansion of taxonomy are of two types, attach and merge. In a taxonomy like WordNet, both merge and attach are integral parts of the expansion operations but majority of study consider them separately. This paper proposes a novel mult-task learning-based deep learning method known as Taxonomy Expansion with Attach and Merge (TEAM) that performs both the merge and attach operations. To the best of our knowledge this is the first study which integrates both merge and attach operations in a single model. The proposed models have been evaluated on three separate WordNet taxonomies, viz., Assamese, Bangla, and Hindi. From the various experimental setups, it is shown that TEAM outperforms its state-of-the-art counterparts for attach operation, and also provides highly encouraging performance for the merge operation.