Hosahalli Lakshmaiah Shashirekha

2025

Sentiment analysis is an essential task for interpreting subjective opinions and emotions in textual data, with significant implications across commercial and societal applications. This paper provides an overview of the shared task on Sentiment Analysis in Tamil and Tulu, organized as part of DravidianLangTech@NAACL 2025. The task comprises two components: one addressing Tamil and the other focusing on Tulu, both designed as multi-class classification challenges, wherein the sentiment of a given text must be categorized as positive, negative, neutral and unknown. The dataset was diligently organized by aggregating user-generated content from social media platforms such as YouTube and Twitter, ensuring linguistic diversity and real-world applicability. Participants applied a variety of computational approaches, ranging from classical machine learning algorithms such as Traditional Machine Learning Models, Deep Learning Models, Pre-trained Language Models and other Feature Representation Techniques to tackle the challenges posed by linguistic code-mixing, orthographic variations, and resource scarcity in these low resource languages.

pdf bib abs

NAYEL@DravidianLangTech-2025: Character N-gram and Machine Learning Coordination for Fake News Detection in Dravidian Languages
Hamada Nayel | Mohammed Aldawsari | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This paper introduces the detailed description of the submitted model by the team NAYEL to Fake News Detection in Dravidian Languages shared task. The proposed model uses a simple character n-gram TF-IDF as a feature extraction approach integrated with an ensemble of various classical machine learning classification algorithms. While the simplicity of the proposed model structure, although it outperforms other complex structure models as the shared task results observed. The proposed model achieved a f1-score of 87.5% and secured the 5th rank.

2024

This paper provides a comprehensive summary of the “Homophobia and Transphobia Detection in Social Media Comments” shared task, which was held at the LT-EDI@EACL 2024. The objective of this task was to develop systems capable of identifying instances of homophobia and transphobia within social media comments. This challenge was extended across ten languages: English, Tamil, Malayalam, Telugu, Kannada, Gujarati, Hindi, Marathi, Spanish, and Tulu. Each comment in the dataset was annotated into three categories. The shared task attracted significant interest, with over 60 teams participating through the CodaLab platform. The submission of prediction from the participants was evaluated with the macro F1 score.

2023

pdf bib abs

MUCSD@DravidianLangTech2023: Predicting Sentiment in Social Media Text using Machine Learning Techniques
Sharal Coelho | Asha Hegde | Pooja Lamani | Kavya G | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

User-generated social media texts are a blend of resource-rich languages like English and low-resource Dravidian languages like Tamil, Kannada, Tulu, etc. These texts referred to as code-mixing texts are enriching social media since they are written in two or more languages using either a common language script or various language scripts. However, due to the complex nature of the code-mixed text, in this paper, we - team MUCSD, describe a Machine learning (ML) models submitted to “Sentiment Analysis in Tamil and Tulu” shared task at DravidianLangTech@RANLP 2023. The proposed methodology makes use of ML models such as Linear Support Vector Classifier (LinearSVC), LR, and ensemble model (LR, DT, and SVM) to perform SA in Tamil and Tulu languages. The proposed LinearSVC model’s predictions submitted to the shared tasks, obtained 8th and 9th rank for Tamil-English and Tulu-English respectively.

pdf bib abs

In recent years, there has been a growing focus on Sentiment Analysis (SA) of code-mixed Dravidian languages. However, the majority of social media text in these languages is code-mixed, presenting a unique challenge. Despite this, there is currently lack of research on SA specifically tailored for code-mixed Dravidian languages, highlighting the need for further exploration and development in this domain. In this view, “Sentiment Analysis in Tamil and Tulu- DravidianLangTech” shared task at Recent Advances in Natural Language Processing (RANLP)- 2023 is organized. This shred consists two language tracks: code-mixed Tamil and Tulu and Tulu text is first ever explored in public domain for SA. We describe the task, its organization, and the submitted systems followed by the results. 57 research teams registered for the shared task and We received 27 systems each for code-mixed Tamil and Tulu texts. The performance of the systems (developed by participants) has been evaluated in terms of macro average F1 score. The top system for code-mixed Tamil and Tulu texts scored macro average F1 score of 0.32, and 0.542 respectively. The high quality and substantial quantity of submissions demonstrate a significant interest and attention in the analysis of code-mixed Dravidian languages. However, the current state of the art in this domain indicates the need for further advancements and improvements to effectively address the challenges posed by code-mixed Dravidian language SA.

pdf bib abs

MUNLP@DravidianLangTech2023: Learning Approaches for Sentiment Analysis in Code-mixed Tamil and Tulu Text
Asha Hegde | Kavya G | Sharal Coelho | Pooja Lamani | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment Analysis (SA) examines the subjective content of a statement, such as opinions, assessments, feelings, or attitudes towards a subject, person, or a thing. Though several models are developed for SA in high-resource languages like English, Spanish, German, etc., uder-resourced languages like Dravidian languages are less explored. To address the challenges of SA in low resource Dravidian languages, in this paper, we team MUNLP describe the models submitted to “Sentiment Analysis in Tamil and Tulu- DravidianLangTech” shared task at Recent Advances in Natural Language Processing (RANLP)-2023. n-gramsSA, EmbeddingsSA and BERTSA are the models proposed for SA shared task. Among all the models, BERTSA exhibited a maximum macro F1 score of 0.26 for code-mixed Tamil texts securing 2nd place in the shared task. EmbeddingsSA exhibited maximum macro F1 score of 0.53 securing 2nd place for Tulu code-mixed texts.

pdf bib abs

KT2: Kannada-Tulu Parallel Corpus Construction for Neural Machine Translation
Asha Hegde | Hosahalli Lakshmaiah Shashirekha
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

In the last decade, Neural Machine Translation (NMT) has experienced substantial advances. However, its widespread success has revealed a limitation in terms of reduced proficiency when dealing with under-resourced language pairs, mainly due to the lack of parallel corpora in comparison to high-resourced language pairs like English-German, EnglishSpanish, and English-French. As a result, researchers have increasingly focused on implementing NMT techniques tailored to underresourced language pairs and thereby, the construction/collection of parallel corpora. In view of the scarcity of parallel corpus for underresourced languages, the strategies for building a Kannada-Tulu parallel corpus and baseline models for Machine Translation (MT) of Kannada-Tulu are described in this paper. Both Kannada and Tulu languages are under-resourced due to lack of processing tools and digital resources, especially parallel corpora, which are critical for MT development. Kannada-Tulu parallel corpus is constructed in two ways: i) Manual Translation and ii) Automatic Text Generation (ATG). Various encoderdecoder based NMT approaches, including Recurrent Neural Network (RNN), Bidirectional RNN (BiRNN), and transformer-based architectures, trained with Gated Recurrent Units (GRU) and Long Short Term Memory (LSTM) units, are explored as baseline models for Kannada to Tulu (Kan-Tul) and Tulu to Kannada (Kan-Tul) sentence-level translations. Additionally, the study explores sub-word tokenization techniques for Kannada-Tulu language pairs, and the performances of these NMT models are evaluated using Character n-gram Fscore (CHRF) and Bilingual Evaluation Understudy (BLEU) scores. Among the baselines, the transformer-based models outperformed other models with BLEU scores of 0.241 and 0.341 and CHRF scores of 0.502 and 0.598 for KanTul and Kan-Tul sentence-level translations, respectively.

pdf bib abs

MUCS@LT-EDI2023: Detecting Signs of Depression in Social Media Text
Sharal Coelho | Asha Hegde | Kavya G | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Depression can lead to significant changes in individuals’ posts on social media which is a important task to identify. Automated techniques must be created for the identification task as manually analyzing the growing volume of social media data is time-consuming. To address the signs of depression posts on social media, in this paper, we - team MUCS, describe a Transfer Learning (TL) model and Machine Learning (ML) models submitted to “Detecting Signs of Depression from Social Media Text” shared task organised by DepSign-LT-EDI@RANLP-2023. The TL model is trained using raw text Bidirectional Encoder Representations from Transformers (BERT) and the ML model is trained using Term Frequency-Inverse Document Frequency (TF-IDF) features separately. Among these three models, the TL model performed better with a macro averaged F1-score of 0.361 and placed 20th rank in the shared task.

pdf bib abs

MUCS@LT-EDI2023: Learning Approaches for Hope Speech Detection in Social Media Text
Asha Hegde | Kavya G | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Hope plays a significant role in shaping human thoughts and actions and hope content has received limited attention in the realm of social media data analysis. The exploration of hope content helps to uncover the valuable insights into users’ aspirations, expectations, and emotional states. By delving into the analysis of hope content on social media platforms, researchers and analysts can gain a deeper understanding of how hope influences individuals’ behaviors, decisions, and overall well-being in the digital age. However, this area is rarely explored even for resource-high languages. To address the identification of hope text in social media platforms, this paper describes the models submitted by the team MUCS to “Hope Speech Detection for Equality, Diversity, and Inclusion (LT-EDI)” shared task organized at Recent Advances in Natural Language Processing (RANLP) - 2023. This shared task aims to classify a comment/post in English and code-mixed texts in three languages, namely, Bulgarian, Spanish, and Hindi into one of the two predefined categories, namely, “Hope speech” and “Non Hope speech”. Two models, namely: i) Hope_BERT - Linear Support Vector Classifier (LinearSVC) model trained by combining Bidirectional Encoder Representations from Transformers (BERT) embeddings and Term Frequency-Inverse Document Frequency (TF-IDF) of character n-grams with word boundary (char_wb) for English and ii) Hope_mBERT - LinearSVC model trained by combining Multilingual BERT (mBERT) embeddings and TF-IDF of char_wb for Bulgarian, Spanish, and Hindi code-mixed texts are proposed for the shared task to classify the given text into Hope or Non-Hope categories. The proposed models obtained 1st, 1st, 2nd, and 5th ranks for Spanish, Bulgarian, Hindi, and English texts respectively.

pdf bib abs

MUCS@LT-EDI2023: Homophobic/Transphobic Content Detection in Social Media Text using mBERT
Asha Hegde | Kavya G | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Homophobic/Transphobic (H/T) content includes hate speech, discrimination text, and abusive comments against Gay, Lesbian, Bisexual, Transgender, Queer, and Intersex (LGBTQ) individuals. With the increase in user generated text in social media, there has been an increase in code-mixed H/T content, which poses challenges for efficient analysis and detection of H/T content on social media. The complex nature of code-mixed text necessitates the development of advanced tools and techniques to effectively tackle this issue in social media platforms. To tackle this issue, in this paper, we - team MUCS, describe the transformer based models submitted to “Homophobia/Transphobia Detection in social media comments” shared task in Language Technology for Equality, Diversity and Inclusion (LT-EDI) at Recent Advances in Natural Language Processing (RANLP)-2023. The proposed methodology makes use of resampling the training data to handle the data imbalance and this resampled data is used to fine-tune the Multilingual Bidirectional Encoder Representations from Transformers (mBERT) models. These models obtained 11th, 5th, 3rd, 3rd, and 7th ranks for English, Tamil, Malayalam, Spanish, and Hindi respectively in Task A and 8th, 2nd, and 2nd ranks for English, Tamil, and Malayalam respectively in Task B.

pdf bib abs

MUCS@DravidianLangTech2023: Leveraging Learning Models to Identify Abusive Comments in Code-mixed Dravidian Languages
Asha Hegde | Kavya G | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Abusive language detection in user-generated online content has become a pressing concern due to its negative impact on users and challenges for policy makers. Online platforms are faced with the task of moderating abusive content to mitigate societal harm, adhere to legal requirements, and foster inclusivity. Despite numerous methods developed for automated detection of abusive language, the problem continues to persist. This ongoing challenge necessitates further research and development to enhance the effectiveness of abusive content detection systems and implement proactive measures to create safer and more respectful online spaces. To address the automatic detection of abusive languages in social media platforms, this paper describes the models submitted by our team - MUCS to the shared task “Abusive Comment Detection in Tamil and Telugu” at DravidianLangTech - in Recent Advances in Natural Language Processing (RANLP) 2023. This shared task addresses the abusive comment detection in code-mixed Tamil, Telugu, and romanized Tamil (Tamil-English) texts. Two distinct models: i) AbusiveML - a model implemented utilizing Linear Support Vector Classifier (LinearSVC) algorithm fed with n-grams of words and character sequences within word boundary (char_wb) features and ii) AbusiveTL - a Transfer Learning (TL ) model with three different Bidirectional Encoder Representations from Transformers (BERT) models along with random oversampling to deal with data imbalance, are submitted to the shared task for detecting abusive language in the given code-mixed texts. The AbusiveTL model fared well among these two models, with macro F1 scores of 0.46, 0.74, and 0.49 for code-mixed Tamil, Telugu, and Tamil-English texts respectively.

pdf bib abs

MUCS@DravidianLangTech2023: Malayalam Fake News Detection Using Machine Learning Approach
Sharal Coelho | Asha Hegde | Kavya G | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Social media is widely used to spread fake news, which affects a larger population. So it is considered as a very important task to detect fake news spread on social media platforms. To address the challenges in the identification of fake news in the Malayalam language, in this paper, we - team MUCS, describe the Machine Learning (ML) models submitted to “Fake News Detection in Dravidian Languages” at DravidianLangTech@RANLP 2023 shared task. Three different models, namely, Multinomial Naive Bayes (MNB), Logistic Regression (LR), and Ensemble model (MNB, LR, and SVM) are trained using Term Frequency - Inverse Document Frequency (TF-IDF) of word unigrams. Among the three models ensemble model performed better with a macro F1-score of 0.83 and placed 3rd rank in the shared task.

2022

pdf bib abs

MUCS@MixMT: IndicTrans-based Machine Translation for Hinglish Text
Asha Hegde | Hosahalli Lakshmaiah Shashirekha
Proceedings of the Seventh Conference on Machine Translation (WMT)

Code-mixing is the phenomena of mixing various linguistic units such as paragraphs, sentences, phrases, words, etc., of one language with that of the other language in any text. This code-mixing is predominantly used by social media users who know more than one language. Processing code-mixed text is challenging because of its characteristics and lack of tools that supports such data. Further, pretrained models can be used for the formal text and not for the informal text such as code-mixed. Developing efficient Machine Translation (MT) systems for code-mixed text is challenging due to lack of code-mixed training data. Further, existing MT systems developed to translate monolingual data are not portable to translate code-mixed text mainly due to its informal nature. To address the MT challenges of code-mixed text, this paper describes the proposed MT models submitted by our team MUCS, to the Code-mixed Machine Translation (MixMT) shared task in the Workshop on Machine Translation (WMT) organized in connection with Empirical models in Natural Language Processing (EMNLP) 2022. This shared has two subtasks: i) subtask 1 - to translate English sentences and their corresponding Hindi translations into Hinglish text and ii) subtask 2 - to translate Hinglish text into English text. The proposed models that translate the code-mixed English text to Hinglish (English-Hindli code-mixed text) and vice-versa, comprises of i) transliterating Hinglish text from Latin to Devanagari script and vice-versa, ii) pseudo translation generation using existing models, and iii) efficient target generation by combining the pseudo translations along with the training data provided by the shared task organizers. The proposed models obtained 5^th and 3^rd rank with Recall-Oriented Under-study for Gisting Evaluation (ROUGE) scores of 0.35806 and 0.55453 for subtask 1 and subtask 2 respectively.

pdf bib abs

Corpus Creation for Sentiment Analysis in Code-Mixed Tulu Text
Asha Hegde | Mudoor Devadas Anusha | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha | Bharathi Raja Chakravarthi
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

Sentiment Analysis (SA) employing code-mixed data from social media helps in getting insights to the data and decision making for various applications. One such application is to analyze users’ emotions from comments of videos on YouTube. Social media comments do not adhere to the grammatical norms of any language and they often comprise a mix of languages and scripts. The lack of annotated code-mixed data for SA in a low-resource language like Tulu makes the SA a challenging task. To address the lack of annotated code-mixed Tulu data for SA, a gold standard trlingual code-mixed Tulu annotated corpus of 7,171 YouTube comments is created. Further, Machine Learning (ML) algorithms are employed as baseline models to evaluate the developed dataset and the performance of the ML algorithms are found to be encouraging.

2021

pdf bib abs

MUCIC at ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Using N-grams and Multilingual Sentence Encoders
Fazlourrahman Balouchzahi | Oxana Vitman | Hosahalli Lakshmaiah Shashirekha | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

Social media analytics are widely being explored by researchers for various applications. Prominent among them are identifying and blocking abusive contents especially targeting individuals and communities, for various reasons. The increasing abusive contents and the increasing number of users on social media demands automated tools to detect and filter the abusive contents as it is highly impossible to handle this manually. To address the challenges of detecting abusive contents, this paper describes the approaches proposed by our team MUCIC for Multilingual Gender Biased and Communal Language Identification shared task (ComMA@ICON) at International Conference on Natural Language Processing (ICON) 2021. This shared task dataset consists of code-mixed multi-script texts in Meitei, Bangla, Hindi as well as in Multilingual (a combination of Meitei, Bangla, Hindi, and English). The shared task is modeled as a multi-label Text Classification (TC) task combining word and char n-grams with vectors obtained from Multilingual Sentence Encoder (MSE) to train the Machine Learning (ML) classifiers using Pre-aggregation and Post-aggregation of labels. These approaches obtained the highest performance in the shared task for Meitei, Bangla, and Multilingual texts with instance-F1 scores of 0.350, 0.412, and 0.380 respectively using Pre-aggregation of labels.

pdf bib abs

MUM at ComMA@ICON: Multilingual Gender Biased and Communal Language Identification Using Supervised Learning Approaches
Asha Hegde | Mudoor Devadas Anusha | Sharal Coelho | Hosahalli Lakshmaiah Shashirekha
Proceedings of the 18th International Conference on Natural Language Processing: Shared Task on Multilingual Gender Biased and Communal Language Identification

Due to the rapid rise of social networks and micro-blogging websites, communication between people from different religion, caste, creed, cultural and psychological backgrounds has become more direct leading to the increase in cyber conflicts between people. This in turn has given rise to more and more hate speech and usage of abusive words to the point that it has become a serious problem creating negative impacts on the society. As a result, it is imperative to identify and filter such content on social media to prevent its further spread and the damage it is going to cause. Further, filtering such huge data requires automated tools since doing it manually is labor intensive and error prone. Added to this is the complex code-mixed and multi-scripted nature of social media text. To address the challenges of abusive content detection on social media, in this paper, we, team MUM, propose Machine Learning (ML) and Deep Learning (DL) models submitted to Multilingual Gender Biased and Communal Language Identification (ComMA@ICON) shared task at International Conference on Natural Language Processing (ICON) 2021. Word uni-grams, char n-grams, and emoji vectors are combined as features to train a ML Elastic-net regression model and multi-lingual Bidirectional Encoder Representations from Transformers (mBERT) is fine-tuned for a DL model. Out of the two, fine-tuned mBERT model performed better with an instance-F1 score of 0.326, 0.390, 0.343, 0.359 for Meitei, Bangla, Hindi, Multilingual texts respectively.

Venues

WMT1

Fix author