H. L. Shashirekha

Also published as: H.l Shashirekha, H L Shashirekha, H.l. Shashirekha, H. L Shashirekha

2024

Transformer-driven Multi-task Learning for Fake and Hateful Content Detection
Asha Hegde | H L Shashirekha
Proceedings of the 21st International Conference on Natural Language Processing (ICON): Shared Task on Decoding Fake Narratives in Spreading Hateful Stories (Faux-Hate)

Social media has revolutionized communica-tion these days in addition to facilitating thespread of fake and hate content. While fakecontent is the manipulation of facts by disin-formation, hate content is textual violence ordiscrimination targeting a group or an individ-ual. Fake narratives have the potential to spreadhate content making people aggressive or hurt-ing the sentiments of an individual or a group.Further, false narratives often dominate discus-sions on sensitive topics, amplifying harmfulmessages contributing to the rise of hate speech.Hence, understanding the relationship betweenhate speech driven by fake narratives is cru-cial in this digital age making it necessary todevelop automatic tools to identify fake andhate content. In this direction, Decoding FakeNarratives in Spreading Hateful Stories (Faux-Hate) - a shared task organized at the Inter-national Conference on Natural Language Pro-cessing (ICON) 2024, invites researchers totackle both fake and hate detection in socialmedia comments, with additional emphasis onidentifying the target and severity of hatefulspeech. The shared task consists of two sub-tasks - Task A (Identifying fake and hate con-tent) and Task B (Identifying the target andseverity of hateful speech). In this paper, we -team MUCS, describe the models proposed toaddress the challenges of this shared task. Wepropose two models: i) Hing_MTL - a Multi-task Learning (MTL) model implemented us-ing pre-trained Hinglish Bidirectional EncoderRepresentations from Transformers (Hinglish-BERT), and ii) Ensemble_MTL - a MTL modelimplemented by ensembling two pre-trainedmodels (HinglishBERT, and Multilingual Dis-tiled version of BERT (MDistilBERT)), to de-tect fake and hate content and identify the targetand severity of hateful speech. Ensemble_MTLmodel outperformed Hing_MTL model withmacro F1 scores of 0.7589 and 0.5746 for TaskA and Task B respectively, securing 6th placein both subtasks.

pdf bib abs

Misogynistic memes are a category of memes which contain disrespectful language targeting women on social media platforms. Hence, detecting such memes is necessary in order to maintain a healthy social media environment. To address the challenges of detecting misogynistic memes, “Multitask Meme classification - Unraveling Misogynistic and Trolls in Online Memes: LT-EDI@EACL 2024” shared task organized at European Chapter of the Association for Computational Linguistics (EACL) 2024, invites researchers to develop models to detect misogynistic memes in Tamil and Malayalam. The shared task has two subtasks, and in this paper, we - team MUCS, describe the learning models submitted to Task 1 - Identification of Misogynistic Memes in Tamil and Malayalam. As memes represent multi-modal data of image and text, three models: i) Bidirectional Encoder Representations from Transformers (BERT)+Residual Network (ResNet)-50, ii) Multilingual Representations for Indian Languages (MuRIL)+ResNet-50, and iii) multilingual BERT (mBERT)+ResNet50, are proposed based on joint representation of text and image, for detecting misogynistic memes in Tamil and Malayalam. Among the proposed models, mBERT+ResNet-50 and MuRIL+ ResNet-50 models obtained macro F1 scores of 0.73 and 0.87 for Tamil and Malayalam datasets respectively securing 1st rank for both the datasets in the shared task.

pdf bib abs

MUCS@DravidianLangTech-2024: Role of Learning Approaches in Strengthening Hate-Alert Systems for code-mixed text
Manavi K K | Sonali | Gauthamraj | Kavya G | Asha Hegde | H L Shashirekha
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

Hate and offensive language detection is the task of detecting hate and/or offensive content targetting a person or a group of people. Despite many efforts to detect hate and offensive content on social media platforms, the problem remains unsolved till date due to the ever growing social media users and their creativity to create and spread hate and offensive content. To address the automatic detection of hate and offensive content on social media platforms, this paper describes the learning models submitted by our team - MUCS to “Hate and Offensive Language Detection in Telugu Codemixed Text (HOLD-Telugu): DravidianLangTech@EACL” - a shared task organized at European Chapter of the Association for Computational Linguistics (EACL) 2024 invites the research community to address the challenges of detecting hate and offensive language in Telugu language. In this paper, we - team MUCS, describe the learning models submitted to the above mentioned shared task. Three models: Three models: i) LR model - a Machine Learning (ML) algorithm fed with TF-IDF of n-grams of subword, word and char_wb are in the range (1, 3), (1, 3), and (1, 5), ii) TL- a pretrained BERT models which makes use of Hate-speech-CNERG/bert-base-uncased-hatexplain model and iii) Ensemble model which is the combination of ML classifieres( MNB, LR, GNB) trained CountVectorizer with word and char ngrams of range (1, 3) and (1, 5) respectively. Proposed LR model trained with TF-IDF of subword, word and char n-grams outperformed the other models with macro F1 scores of 0.6501 securing 15th rankin the shared task for Telugu text.

pdf bib abs

MUCS@LT-EDI-2024: Learning Approaches to Empower Homophobic/Transphobic Comment Identification
Sonali | Nethravathi Gidnakanala | Raksha G | Kavya G | Asha Hegde | H L Shashirekha
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Homophobic/Transphobic (H/T) content includes hatred and discriminatory comments directed at Lesbian, Gay, Bisexual, Transgender, Queer (LGBTQ) individuals on social media platforms. As this unfavourable perception towards LGBTQ individuals may affect them physically and mentally, it is necessary to detect H/T content on social media. This demands automated tools to identify and address H/T content. In view of this, in this paper, we - team MUCS describe the learning models submitted to “Homophobia/Transphobia Detection in social media comments:LT-EDI@EACL 2024” shared task at European Chapter of the Association for Computational Linguistics (EACL) 2024. The learning models: i) Homo_Ensemble - an ensemble of Machine Learning (ML) algorithms trained with Term Frequency-Inverse Document Frequency (TFIDF) of syllable n-grams in the range (1, 3), ii) Homo_TL - a model based on Transfer Learning (TL) approach with Bidirectional Encoder Representations from Transformers (BERT) models, iii) Homo_probfuse - an ensemble of ML classifiers with soft voting trained using sentence embeddings (except for Hindi), and iv) Homo_FSL - Few-Shot Learning (FSL) models using Sentence Transformer (ST) (only for Tulu), are proposed to detect H/T content in the given languages. Among the models submitted to the shared task, the models that performed better for each language include: i) Homo_Ensemble model obtained macro F1 score of 0.95 securing 4th rank for Telugu language, ii) Homo_TL model obtained macro F1 scores of 0.49, 0.53, 0.45, 0.94, and 0.95 securing 2nd, 2nd, 1st, 1st, and 4th ranks for English, Marathi, Hindi, Kannada, and Gujarathi languages, respectively, iii) Homo_probfuse model obtained macro F1 scores of 0.86, 0.87, and 0.53 securing 2nd, 6th, and 2nd ranks for Tamil, Malayalam, and Spanish languages respectively, and iv) Homo_FSL model obtained a macro F1 score of 0.62 securing 2nd rank for Tulu dataset.

2023

pdf bib abs

MUCS@DravidianLangTech2023: Sentiment Analysis in Code-mixed Tamil and Tulu Texts using fastText
Rachana K | Prajnashree M | Asha Hegde | H. L Shashirekha
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

Sentiment Analysis (SA) is a field of computational study that focuses on analyzing and understanding people’s opinions, attitudes, and emotions towards an entity. An entity could be an individual, an event, a topic, a product etc., which is most likely to be covered by reviews and such reviews can be found in abundance on social media platforms. The increase in the number of social media users and the growing amount of user-generated code-mixed content such as reviews, comments, posts etc., on social media have resulted in a rising demand for efficient tools capable of effectively analyzing such content to detect the sentiments. However, SA of social media text is challenging due to the complex nature of the code-mixed text. To tackle this issue, in this paper, we team MUCS, describe learning models submitted to “Sentiment Analysis in Tamil and Tulu” -DravidianLangTech@Recent Advances In Natural Language Processing (RANLP) 2023. Using fastText embeddings to train the Machine Learning (ML) models to perform SA in code-mixed Tamil and Tulu texts, the proposed methodology exhibited F1 scores of 0.14 and 0.204 securing 13th and 15th rank for Tamil and Tulu texts respectively.

2022

pdf bib abs

Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts at ICON 2022
F. Balouchzahi | S. Butt | A. Hegde | N. Ashraf | H.l. Shashirekha | Grigori Sidorov | Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts

The task of Language Identification (LI) in text processing refers to automatically identifying the languages used in a text document. LI task is usually been studied at the document level and often in high-resource languages while giving less importance to low-resource languages. However, with the recent advance- ment in technologies, in a multilingual country like India, many low-resource language users post their comments using English and one or more language(s) in the form of code-mixed texts. Combination of Kannada and English is one such code-mixed text of mixing Kannada and English languages at various levels. To address the word level LI in code-mixed text, in CoLI-Kanglish shared task, we have focused on open-sourcing a Kannada-English code-mixed dataset for word level LI of Kannada, English and mixed-language words written in Roman script. The task includes classifying each word in the given text into one of six predefined categories, namely: Kannada (kn), English (en), Kannada-English (kn-en), Name (name), Lo-cation (location), and Other (other). Among the models submitted by all the participants, the best performing model obtained averaged-weighted and averaged-macro F1 scores of 0.86 and 0.62 respectively.

2021

pdf bib abs

MUCS@DravidianLangTech-EACL2021:COOLI-Code-Mixing Offensive Language Identification
Fazlourrahman Balouchzahi | Aparna B K | H L Shashirekha
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper describes the models submitted by the team MUCS for Offensive Language Identification in Dravidian Languages-EACL 2021 shared task that aims at identifying and classifying code-mixed texts of three language pairs namely, Kannada-English (Kn-En), Malayalam-English (Ma-En), and Tamil-English (Ta-En) into six predefined categories (5 categories in Ma-En language pair). Two models, namely, COOLI-Ensemble and COOLI-Keras are trained with the char sequences extracted from the sentences combined with words as features. Out of the two proposed models, COOLI-Ensemble model (best among our models) obtained first rank for Ma-En language pair with 0.97 weighted F1-score and fourth and sixth ranks with 0.75 and 0.69 weighted F1-score for Ta-En and Kn-En language pairs respectively.

pdf bib abs

LA-SACo: A Study of Learning Approaches for Sentiments Analysis inCode-Mixing Texts
Fazlourrahman Balouchzahi | H L Shashirekha
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Substantial amount of text data which is increasingly being generated and shared on the internet and social media every second affect the society positively or negatively almost in any aspect of online world and also business and industries. Sentiments/opinions/reviews’ of users posted on social media are the valuable information that have motivated researchers to analyze them to get better insight and feedbacks about any product such as a video in Instagram, a movie in Netflix, or even new brand car introduced by BMW. Sentiments are usually written using a combination of languages such as English which is resource rich and regional languages such as Tamil, Kannada, Malayalam, etc. which are resource poor. However, due to technical constraints, many users prefer to pen their opinions in Roman script. These kinds of texts written in two or more languages using a common language script or different language scripts are called code-mixing texts. Code-mixed texts are increasing day-by-day with the increase in the number of users depending on various online platforms. Analyzing such texts pose a real challenge for the researchers. In view of the challenges posed by the code-mixed texts, this paper describes three proposed models namely, SACo-Ensemble, SACo-Keras, and SACo-ULMFiT using Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL) approaches respectively for the task of Sentiments Analysis in Tamil-English and Malayalam-English code-mixed texts.

pdf bib abs

MUCS@LT-EDI-EACL2021:CoHope-Hope Speech Detection for Equality, Diversity, and Inclusion in Code-Mixed Texts
Fazlourrahman Balouchzahi | Aparna B K | H L Shashirekha
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

This paper describes the models submitted by the team MUCS for “Hope Speech Detection for Equality, Diversity, and Inclusion-EACL 2021” shared task that aims at classifying a comment / post in English and code-mixed texts in two language pairs, namely, Tamil-English (Ta-En) and Malayalam-English (Ma-En) into one of the three predefined categories, namely, “Hope_speech”, “Non_hope_speech”, and “other_languages”. Three models namely, CoHope-ML, CoHope-NN, and CoHope-TL based on Ensemble of classifiers, Keras Neural Network (NN) and BiLSTM with Conv1d model respectively are proposed for the shared task. CoHope-ML, CoHope-NN models are trained on a feature set comprised of char sequences extracted from sentences combined with words for Ma-En and Ta-En code-mixed texts and a combination of word and char ngrams along with syntactic word ngrams for English text. CoHope-TL model consists of three major parts: training tokenizer, BERT Language Model (LM) training and then using pre-trained BERT LM as weights in BiLSTM-Conv1d model. Out of three proposed models, CoHope-ML model (best among our models) obtained 1st, 2nd, and 3rd ranks with weighted F1-scores of 0.85, 0.92, and 0.59 for Ma-En, English and Ta-En texts respectively.

pdf bib abs

MUCS@ - Machine Translation for Dravidian Languages using Stacked Long Short Term Memory
Asha Hegde | Ibrahim Gashaw | H. L. Shashirekha
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Dravidian language family is one of the largest language families in the world. In spite of its uniqueness, Dravidian languages have gained very less attention due to scarcity of resources to conduct language technology tasks such as translation, Parts-of-Speech tagging, Word Sense Disambiguation etc,. In this paper, we, team MUCS, describe sequence-to-sequence stacked Long Short Term Memory (LSTM) based Neural Machine Translation (NMT) model submitted to “Machine Translation in Dravidian languages”, a shared task organized by EACL-2021. The NMT model was applied on translation using English-Tamil, EnglishTelugu, English-Malayalam and Tamil-Telugu corpora provided by the organizers. Standard evaluation metrics namely Bilingual Evaluation Understudy (BLEU) and human evaluations are used to evaluate the model. Our models exhibited good accuracy for all the language pairs and obtained 2nd rank for TamilTelugu language pair.

2020

pdf bib abs

MUCS@Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation
Asha Hegde | H.l. Shashirekha
Proceedings of the 17th International Conference on Natural Language Processing (ICON): Adap-MT 2020 Shared Task

Machine Translation (MT) is the task of automatically converting the text in source language to text in target language by preserving the meaning. MT usually require large corpus for training the translation models. Due to scarcity of resources very less attention is given to translating into low resource languages and in particular into Indic languages. In this direction, a shared task called “Adap-MT 2020: Low Resource Domain Adaptation for Indic Machine Translation” is organized to illustrate the capability of general domain MT when translating into Indic languages and low resource domain adaptation of MT systems. In this paper, we, team MUCS, describe a simple word extraction based domain adaptation approach applied to English-Hindi MT only. MT in the proposed model is carried out using Open-NMT - a popular Neural Machine Translation tool. A general domain corpus is built effectively combining the available English-Hindi corpora and removing the duplicate sentences. Further, domain specific corpus is updated by extracting the sentences from generic corpus that contains the words given in the domain specific corpus. The proposed model exhibited satisfactory results for small domain specific AI and CHE corpora provided by the organizers in terms of BLEU score with 1.25 and 2.72 respectively. Further, this methodology is quite generic and can easily be extended to other low resource language pairs as well.

pdf bib abs

MUCS@TechDOfication using FineTuned Vectors and n-grams
Fazlourrahman Balouchzahi | M D Anusha | H L Shashirekha
Proceedings of the 17th International Conference on Natural Language Processing (ICON): TechDOfication 2020 Shared Task

The increase in domain specific text processing applications are demanding tools and techniques for domain specific Text Classification (TC) which may be helpful in many downstream applications like Machine Translation, Summarization, Question Answering etc. Further, many TC algorithms are applied on globally recognized languages like English giving less importance for local languages particularly Indian languages. To boost the research for technical domains and text processing activities in Indian languages, a shared task named ”TechDOfication2020” is organized by ICON’20. The objective of this shared task is to automatically identify the technical domain of a given text which provides information about coarse grained technical domains and fine grained subdomains in eight languages. To tackle this challenge we, team MUCS have proposed three models, namely, DL-FineTuned model applied for all subtasks, and VC-FineTuned and VC-ngrams models applied only for some subtasks. n-grams and word embedding with a step of fine-tuning are used as features and machine learning and deep learning algorithms are used as classifiers in the proposed models. The proposed models outperformed in most of subtasks and also obtained first rank in subTask1b (Bangla) and subTask1e (Malayalam) with f1 score of 0.8353 and 0.3851 respectively using DL-FineTuned model for both the subtasks.

2019

pdf bib abs

Language Modelling with NMT Query Translation for Amharic-Arabic Cross-Language Information Retrieval
Ibrahim Gashaw | H.l Shashirekha
Proceedings of the 16th International Conference on Natural Language Processing

This paper describes our first experiment on Neural Machine Translation (NMT) based query translation for Amharic-Arabic Cross-Language Information Retrieval (CLIR) task to retrieve relevant documents from Amharic and Arabic text collections in response to a query expressed in the Amharic language. We used a pre-trained NMT model to map a query in the source language into an equivalent query in the target language. The relevant documents are then retrieved using a Language Modeling (LM) based retrieval algorithm. Experiments are conducted on four conventional IR models, namely Uni-gram and Bi-gram LM, Probabilistic model, and Vector Space Model (VSM). The results obtained illustrate that the proposed Uni-gram LM outperforms all other models for both Amharic and Arabic language document collections.

H. L. Shashirekha

2024

2023

2022

2021

2020

2019

2018

2017

Co-authors

Venues