NLU (Natural Language Understanding) has considerable difficulties in identifying multiple intentions across different domains in languages with limited resources. Our contributions involve utilizing pivot languages with similar semantics for NLU tasks, creating a vector database for efficient retrieval and indexing of language embeddings in high-resource languages for Retrieval Augmented Generation (RAG) in low-resource languages, and thoroughly investigating the effect of segmentbased strategies on complex user utterances across multiple domains and intents in the development of a Chain of Thought Prompting (COT) combined with Retrieval Augmented Generation. The study investigated recursive approaches to identify the most effective zeroshot instances for segment-based prompting. A comparison analysis was conducted to compare the effectiveness of sentence-based prompting vs segment-based prompting across different domains and multiple intents. This research offers a promising avenue to address the formidable challenges of NLU in low-resource languages, with potential applications in conversational agents and dialogue systems and a broader impact on linguistic understanding and inclusivity.
Trolling refers to any user behaviour on the internet to intentionally provoke or instigate conflict predominantly in social media. This paper aims to classify troll meme captions in Tamil-English code-mixed form. Embeddings are obtained for raw code-mixed text and the translated and transliterated version of the text and their relative performances are compared. Furthermore, this paper compares the performances of 11 different classification algorithms using Accuracy and F1- Score. We conclude that we were able to achieve a weighted F1 score of 0.74 through MuRIL pretrained model.
Code-switching refers to the textual or spoken data containing multiple languages. Application of natural language processing (NLP) tasks like sentiment analysis is a harder problem on code-switched languages due to the irregularities in the sentence structuring and ordering. This paper shows the experiment results of building a Kernel based Extreme Learning Machines(ELM) for sentiment analysis for code-switched Dravidian languages with English. Our results show that ELM performs better than traditional machine learning classifiers on various metrics as well as trains faster than deep learning models. We also show that Polynomial kernels perform better than others in the ELM architecture. We were able to achieve a median AUC of 0.79 with a polynomial kernel.
Event Detection has been one of the research areas in Text Mining that has attracted attention during this decade due to the widespread availability of social media data specifically twitter data. Twitter has become a major source for information about real-world events because of the use of hashtags and the small word limit of Twitter that ensures concise presentation of events. Previous works on event detection from tweets are either applicable to detect localized events or breaking news only or miss out on many important events. This paper presents the problems associated with event detection from tweets and a tweet-segmentation based system for event detection called SEDTWik, an extension to a previous work, that is able to detect newsworthy events occurring at different locations of the world from a wide range of categories. The main idea is to split each tweet and hash-tag into segments, extract bursty segments, cluster them, and summarize them. We evaluated our results on the well-known Events2012 corpus and achieved state-of-the-art results. Keywords: Event detection, Twitter, Social Media, Microblogging, Tweet segmentation, Text Mining, Wikipedia, Hashtag.