Bharathi B


2024

pdf bib
Findings of the Shared Task on Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL)@DravidianLangTech 2024
Premjith B | Jyothish G | Sowmya V | Bharathi Raja Chakravarthi | K Nandhini | Rajeswari Natarajan | Abirami Murugappan | Bharathi B | Saranya Rajiakodi | Rahul Ponnusamy | Jayanth Mohan | Mekapati Reddy
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

This paper presents the findings of the shared task on multimodal sentiment analysis, abusive language detection and hate speech detection in Dravidian languages. Through this shared task, researchers worldwide can submit models for three crucial social media data analysis challenges in Dravidian languages: sentiment analysis, abusive language detection, and hate speech detection. The aim is to build models for deriving fine-grained sentiment analysis from multimodal data in Tamil and Malayalam, identifying abusive and hate content from multimodal data in Tamil. Three modalities make up the multimodal data: text, audio, and video. YouTube videos were gathered to create the datasets for the tasks. Thirty-nine teams took part in the competition. However, only two teams, though, turned in their findings. The macro F1-score was used to assess the submissions

pdf bib
Wit Hub@DravidianLangTech-2024:Multimodal Social Media Data Analysis in Dravidian Languages using Machine Learning Models
Anierudh S | Abhishek R | Ashwin Sundar | Amrit Krishnan | Bharathi B
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

The main objective of the task is categorised into three subtasks. Subtask-1 Build models to determine the sentiment expressed in multimodal posts (or videos) in Tamil and Malayalam languages, leveraging textual, audio, and visual components. The videos are labelled into five categories: highly positive, positive, neutral, negative and highly negative. Subtask-2 Design machine models that effectively identify and classify abusive language within the multimodal context of social media posts in Tamil. The data are categorized into abusive and non-abusive categories. Subtask-3 Develop advanced models that accurately detect and categorize hate speech and offensive language in multimodal social media posts in Dravidian languages. The data points are categorized into Caste, Offensive, Racist and Sexist classes. In this session, the focus is primarily on Tamil language text data analysis. Various combination of machine learning models have been used to perform each tasks and do oversampling techniques to train models on biased dataset.

pdf bib
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
Bharathi Raja Chakravarthi | Bharathi B | Paul Buitelaar | Thenmozhi Durairaj | György Kovács | Miguel Ángel García Cumbreras
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

pdf bib
Overview of the Third Shared Task on Speech Recognition for Vulnerable Individuals in Tamil
Bharathi B | Bharathi Raja Chakravarthi | Sripriya N | Rajeswari Natarajan | Suhasini S
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

The overview of the shared task on speech recognition for vulnerable individuals in Tamil (LT-EDI-2024) is described in this paper. The work comes with a Tamil dataset that was gath- ered from elderly individuals who identify as male, female, or transgender. The audio sam- ples were taken in public places such as marketplaces, vegetable shops, hospitals, etc. The training phase and the testing phase are when the dataset is made available. The task required of the participants was to handle audio signals using various models and techniques, and then turn in their results as transcriptions of the pro- vided test samples. The participant’s results were assessed using WER (Word Error Rate). The transformer-based approach was employed by the participants to achieve automatic voice recognition. This overview paper discusses the findings and various pre-trained transformer- based models that the participants employed.

pdf bib
SSN-Nova@LT-EDI 2024: Leveraging Vectorisation Techniques in an Ensemble Approach for Stress Identification in Low-Resource Languages
A Reddy | Ann Thomas | Pranav Moorthi | Bharathi B
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

This paper presents our submission for Shared task on Stress Identification in Dravidian Languages: StressIdent LT-EDI@EACL2024. The objective of this task is to identify stress levels in individuals based on their social media content. The system is tasked with analysing posts written in a code-mixed language of Tamil and Telugu and categorising them into two labels: “stressed” or “not stressed.” Our approach aimed to leverage feature extraction and juxtapose the performance of widely used traditional, deep learning and transformer models. Our research highlighted that building a pipeline with traditional classifiers proved to significantly improve their performance (0.98 and 0.93 F1-scores in Telugu and Tamil respectively), surpassing the baseline as well as deep learning and transformer models.

pdf bib
SSN-Nova@LT-EDI 2024: POS Tagging, Boosting Techniques and Voting Classifiers for Caste And Migration Hate Speech Detection
A Reddy | Ann Thomas | Pranav Moorthi | Bharathi B
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

This paper presents our submission for the shared task on Caste and Migration Hate Speech Detection: LT-EDI@EACL 20241 . This text classification task aims to foster the creation of models capable of identifying hate speech related to caste and migration. The dataset comprises social media comments, and the goal is to categorize them into negative and positive sentiments. Our approach explores back-translation for data augmentation to address sparse datasets in low-resource Dravidian languages. While Part-of-Speech (POS) tagging is valuable in natural language processing, our work highlights its ineffectiveness in Dravidian languages, with model performance drastically reducing from 0.73 to 0.67 on application. In analyzing boosting and ensemble methods, the voting classifier with traditional models outperforms others and the boosting techniques, underscoring the efficacy of simper models on low-resource data despite augmentation.

pdf bib
DRAVIDIAN LANGUAGE@ LT-EDI 2024:Pretrained Transformer based Automatic Speech Recognition system for Elderly People
Abirami. J | Aruna Devi. S | Dharunika Sasikumar | Bharathi B
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

In this paper, the main goal of the study is to create an automatic speech recognition (ASR) system that is tailored to the Tamil language. The dataset that was employed includes audio recordings that were obtained from vulnerable populations in the Tamil region, such as elderly men and women and transgender individuals. The pre-trained model Rajaram1996/wav2vec2- large-xlsr-53-tamil is used in the engineering of the ASR system. This existing model is finetuned using a variety of datasets that include typical Tamil voices. The system is then tested with a specific test dataset, and the transcriptions that are produced are sent in for assessment. The Word Error Rate is used to evaluate the system’s performance. Our system has a WER of 37.733.

pdf bib
Algorithm Alliance@LT-EDI-2024: Caste and Migration Hate Speech Detection
Saisandeep Sangeetham | Shreyamanisha Vinay | Kavin Rajan G | Abishna A | Bharathi B
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Caste and Migration speech refers to the use of language that distinguishes the offense, violence, and distress on their social, caste, and migration status. Here, caste hate speech targets the imbalance of an individual’s social status and focuses mainly on the degradation of their caste group. While the migration hate speech imposes the differences in nationality, culture, and individual status. These speeches are meant to affront the social status of these people. To detect this hate in the speech, our task on Caste and Migration Hate Speech Detection has been created which classifies human speech into genuine or stimulate categories. For this task, we used multiple classification models such as the train test split model to split the dataset into train and test data, Logistic regression, Support Vector Machine, MLP (multi-layer Perceptron) classifier, Random Forest classifier, KNN classifier, and Decision tree classification. Among these models, The SVM gave the highest macro average F1 score of 0.77 and the average accuracy for these models is around 0.75.

pdf bib
ASR TAMIL SSN@ LT-EDI-2024: Automatic Speech Recognition system for Elderly People
Suhasini S | Bharathi B
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

The results of the Shared Task on Speech Recognition for Vulnerable Individuals in Tamil (LT-EDI-2024) are discussed in this paper. The goal is to create an automated system for Tamil voice recognition. The older population that speaks Tamil is the source of the dataset used in this task. The proposed ASR system is designed with pre-trained model akashsivanandan/wav2vec2-large-xls-r300m-tamil-colab-final. The Tamil common speech dataset is utilized to fine-tune the pretrained model that powers our system. The suggested system receives the test data that was released from the task; transcriptions are then created for the test samples and delivered to the task. Word Error Rate (WER) is the evaluation statistic used to assess the provided result based on the task. Our Proposed system attained a WER of 29.297%.

2023

pdf bib
Findings of the Shared Task on Multimodal Abusive Language Detection and Sentiment Analysis in Tamil and Malayalam
Premjith B | Jyothish Lal G | Sowmya V | Bharathi Raja Chakravarthi | Rajeswari Natarajan | Nandhini K | Abirami Murugappan | Bharathi B | Kaushik M | Prasanth Sn | Aswin Raj R | Vijai Simmon S
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

This paper summarizes the shared task on multimodal abusive language detection and sentiment analysis in Dravidian languages as part of the third Workshop on Speech and Language Technologies for Dravidian Languages at RANLP 2023. This shared task provides a platform for researchers worldwide to submit their models on two crucial social media data analysis problems in Dravidian languages - abusive language detection and sentiment analysis. Abusive language detection identifies social media content with abusive information, whereas sentiment analysis refers to the problem of determining the sentiments expressed in a text. This task aims to build models for detecting abusive content and analyzing fine-grained sentiment from multimodal data in Tamil and Malayalam. The multimodal data consists of three modalities - video, audio and text. The datasets for both tasks were prepared by collecting videos from YouTube. Sixty teams participated in both tasks. However, only two teams submitted their results. The submissions were evaluated using macro F1-score.

pdf bib
NLP_SSN_CSE@DravidianLangTech: Fake News Detection in Dravidian Languages using Transformer Models
Varsha Balaji | Shahul Hameed T | Bharathi B
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

The proposed system procures a systematic workflow in fake news identification utilizing machine learning classification in order to recognize and distinguish between real and made-up news. Using the Natural Language Toolkit (NLTK), the procedure starts with data preprocessing, which includes operations like text cleaning, tokenization, and stemming. This guarantees that the data is translated into an analytically-ready format. The preprocessed data is subsequently supplied into transformer models like M-BERT, Albert, XLNET, and BERT. By utilizing their extensive training on substantial datasets to identify complex patterns and significant traits that discriminate between authentic and false news pieces, these transformer models excel at capturing contextual information. The most successful model among those used is M-BERT, which boasts an astounding F1 score of 0.74. This supports M-BERT’s supremacy over its competitors in the field of fake news identification, outperforming them in terms of performance. The program can draw more precise conclusions and more effectively counteract the spread of false information because of its comprehension of contextual nuance. Organizations and platforms can strengthen their fake news detection systems and their attempts to stop the spread of false information by utilizing M-BERT’s capabilities.

pdf bib
Overview of the Second Shared Task on Speech Recognition for Vulnerable Individuals in Tamil
Bharathi B | Bharathi Raja Chakravarthi | Subalalitha Cn | Sripriya Natarajan | Rajeswari Natarajan | S Suhasini | Swetha Valli
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

This paper manifest the overview of the shared task on Speech Recognition for Vulnerable individuals in Tamil(LT-EDI-ACL2023). Task is provided with an Tamil dataset, which is collected from elderly people of three different genders, male, female and transgender. The audio samples were recorded from the public locations like hospitals, markets, vegetable shop, etc. The dataset is released in two phase, training and testing phase. The partcipants were asked to use different models and methods to handle audio signals and submit the result as transcription of the test samples given. The result submitted by the participants was evaluated using WER (Word Error Rate). The participants used the transformer-based model for automatic speech recognition. The results and different pre-trained transformer based models used by the participants is discussed in this overview paper.

pdf bib
SANBAR@LT-EDI-2023:Automatic Speech Recognition: vulnerable old-aged and transgender people in Tamil
Saranya S | Bharathi B
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

An Automatic Speech Recognition systems for Tamil are designed to convert spoken lan- guage or speech signals into written Tamil text. Seniors go to banks, clinics and authoritative workplaces to address their regular necessities. A lot of older people are not aware of the use of the facilities available in public places or office. They need a person to help them. Like- wise, transgender people are deprived of pri- mary education because of social stigma, so speaking is the only way to help them meet their needs. In order to build speech enabled systems, spontaneous speech data is collected from seniors and transgender people who are deprived of using these facilities for their own benefit. The proposed system is developed with pretraind models are IIT Madras transformer ASR model and akashsivanandan/wav2vec2- large-xls-r-300m-tamil model. Both pretrained models are used to evaluate the test speech ut- terances, and obtainted the WER as 37.7144% and 40.55% respectively.

pdf bib
ASR_SSN_CSE@LTEDI- 2023: Pretrained Transformer based Automatic Speech Recognition system for Elderly People
Suhasini S | Bharathi B
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Submission of the paper for the result submitted in Shared Task on Speech Recognition for Vulnerable Individuals in Tamil- LT-EDI-2023. The task is to develop an automatic speech recognition system for Tamil language. The dataset provided in the task is collected from the elderly people who converse in Tamil language. The proposed ASR system is designed with pre-trained model. The pre-trained model used in our system is fine-tuned with Tamil common voice dataset. The test data released from the task is given to the proposed system, now the transcriptions are generated for the test samples and the generated transcriptions is submitted to the task. The result submitted is evaluated by task, the evaluation metric used is Word Error Rate (WER). Our Proposed system attained a WER of 39.8091%.

pdf bib
CSE_SPEECH@LT-EDI-2023Automatic Speech Recognition vulnerable old-aged and transgender people in Tamil
Varsha Balaji | Archana Jp | Bharathi B
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

This paper centers on utilizing Automatic Speech Recognition (ASR) for defenseless old-aged and transgender people in Tamil. The Amrrs/wav2vec2-large-xlsr-53-tamil show accomplishes a Word Error Rate (WER) of 40%. By leveraging this demonstration, ASR innovation upgrades availability and inclusivity, helping those with discourse impedances, hearing impedances, and cognitive inabilities. Assist refinements are vital to diminish error and move forward the client involvement. This inquiry emphasizes the significance of ASR, particularly the Amrrs/wav2vec2-large-xlsr-53-tamil show, in encouraging successful communication and availability for defenseless populaces in Tamil.

pdf bib
TERCET@LT-EDI-2023: Hope Speech Detection for Equality, Diversity, and Inclusion
Priyadharshini Thandavamurthi | Samyuktaa Sivakumar | Shwetha Sureshnathan | Thenmozhi D. | Bharathi B | Gayathri Gl
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

Hope is a cheerful and optimistic state of mind which has its basis in the expectation of positive outcomes. Hope speech reflects the same as they are positive words that can motivate and encourage a person to do better. Non-hope speech reflects the exact opposite. They are meant to ridicule or put down someone and affect the person negatively. The shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI - RANLP 2023 was created with data sets in English, Spanish, Bulgarian and Hindi. The purpose of this task is to classify human-generated comments on the platform, YouTube, as Hope speech or non-Hope speech. We employed multiple traditional models such as SVM (support vector machine), Random Forest classifier, Naive Bayes and Logistic Regression. Support Vector Machine gave the highest macro average F1 score of 0.49 for the training data set and a macro average F1 score of 0.50 for the test data set.

pdf bib
Tercet@LT-EDI-2023: Homophobia/Transphobia Detection in social media comment
Shwetha Sureshnathan | Samyuktaa Sivakumar | Priyadharshini Thandavamurthi | Thenmozhi D. | Bharathi B | Kiruthika Chandrasekaran
Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion

The advent of social media platforms has revo- lutionized the way we interact, share, learn , ex- press and build our views and ideas. One major challenge of social media is hate speech. Homo- phobia and transphobia encompasses a range of negative attitudes and feelings towards people based on their sexual orientation or gender iden- tity. Homophobia refers to the fear, hatred, or prejudice against homosexuality, while trans- phobia involves discrimination against trans- gender individuals. Natural Language Process- ing can be used to identify homophobic and transphobic texts and help make social media a safer place. In this paper, we explore us- ing Support Vector Machine , Random Forest Classifier and Bert Model for homophobia and transphobia detection. The best model was a combination of LaBSE and SVM that achieved a weighted F1 score of 0.95.

2022

pdf bib
PANDAS@TamilNLP-ACL2022: Emotion Analysis in Tamil Text using Language Agnostic Embeddings
Divyasri K | Gayathri G L | Krithika Swaminathan | Thenmozhi Durairaj | Bharathi B | Senthil Kumar B
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

As the world around us continues to become increasingly digital, it has been acknowledged that there is a growing need for emotion analysis of social media content. The task of identifying the emotion in a given text has many practical applications ranging from screening public health to business and management. In this paper, we propose a language-agnostic model that focuses on emotion analysis in Tamil text. Our experiments yielded an F1-score of 0.010.

pdf bib
PANDAS@Abusive Comment Detection in Tamil Code-Mixed Data Using Custom Embeddings with LaBSE
Gayathri G L | Krithika Swaminathan | Divyasri K | Thenmozhi Durairaj | Bharathi B
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Abusive language has lately been prevalent in comments on various social media platforms. The increasing hostility observed on the internet calls for the creation of a system that can identify and flag such acerbic content, to prevent conflict and mental distress. This task becomes more challenging when low-resource languages like Tamil, as well as the often-observed Tamil-English code-mixed text, are involved. The approach used in this paper for the classification model includes different methods of feature extraction and the use of traditional classifiers. We propose a novel method of combining language-agnostic sentence embeddings with the TF-IDF vector representation that uses a curated corpus of words as vocabulary, to create a custom embedding, which is then passed to an SVM classifier. Our experimentation yielded an accuracy of 52% and an F1-score of 0.54.

pdf bib
SSNCSE_NLP@TamilNLP-ACL2022: Transformer based approach for Emotion analysis in Tamil language
Bharathi B | Josephine Varsha
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Emotion analysis is the process of identifying and analyzing the underlying emotions expressed in textual data. Identifying emotions from a textual conversation is a challenging task due to the absence of gestures, vocal intonation, and facial expressions. Once the chatbots and messengers detect and report the emotions of the user, a comfortable conversation can be carried out with no misunderstandings. Our task is to categorize text into a predefined notion of emotion. In this thesis, it is required to classify text into several emotional labels depending on the task. We have adopted the transformer model approach to identify the emotions present in the text sequence. Our task is to identify whether a given comment contains emotion, and the emotion it stands for. The datasets were provided to us by the LT-EDI organizers (CITATION) for two tasks, in the Tamil language. We have evaluated the datasets using the pretrained transformer models and we have obtained the micro-averaged F1 scores as 0.19 and 0.12 for Task1 and Task 2 respectively.

pdf bib
SSNCSE NLP@TamilNLP-ACL2022: Transformer based approach for detection of abusive comment for Tamil language
Bharathi B | Josephine Varsha
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

Social media platforms along with many other public forums on the Internet have shown a significant rise in the cases of abusive behavior such as Misogynism, Misandry, Homophobia, and Cyberbullying. To tackle these concerns, technologies are being developed and applied, as it is a tedious and time-consuming task to identify, report and block these offenders. Our task was to automate the process of identifying abusive comments and classify them into appropriate categories. The datasets provided by the DravidianLangTech@ACL2022 organizers were a code-mixed form of Tamil text. We trained the datasets using pre-trained transformer models such as BERT,m-BERT, and XLNET and achieved a weighted average of F1 scores of 0.96 for Tamil-English code mixed text and 0.59 for Tamil text.

pdf bib
Findings of the Shared Task on Multimodal Sentiment Analysis and Troll Meme Classification in Dravidian Languages
Premjith B | Bharathi Raja Chakravarthi | Malliga Subramanian | Bharathi B | Soman Kp | Dhanalakshmi V | Sreelakshmi K | Arunaggiri Pandian | Prasanna Kumaresan
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents the findings of the shared task on Multimodal Sentiment Analysis and Troll meme classification in Dravidian languages held at ACL 2022. Multimodal sentiment analysis deals with the identification of sentiment from video. In addition to video data, the task requires the analysis of corresponding text and audio features for the classification of movie reviews into five classes. We created a dataset for this task in Malayalam and Tamil. The Troll meme classification task aims to classify multimodal Troll memes into two categories. This task assumes the analysis of both text and image features for making better predictions. The performance of the participating teams was analysed using the F1-score. Only one team submitted their results in the Multimodal Sentiment Analysis task, whereas we received six submissions in the Troll meme classification task. The only team that participated in the Multimodal Sentiment Analysis shared task obtained an F1-score of 0.24. In the Troll meme classification task, the winning team achieved an F1-score of 0.596.

pdf bib
SUH_ASR@LT-EDI-ACL2022: Transformer based Approach for Speech Recognition for Vulnerable Individuals in Tamil
Suhasini S | Bharathi B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

An Automatic Speech Recognition System is developed for addressing the Tamil conversational speech data of the elderly people andtransgender. The speech corpus used in this system is collected from the people who adhere their communication in Tamil at some primary places like bank, hospital, vegetable markets. Our ASR system is designed with pre-trained model which is used to recognize the speechdata. WER(Word Error Rate) calculation is used to analyse the performance of the ASR system. This evaluation could help to make acomparison of utterances between the elderly people and others. Similarly, the comparison between the transgender and other people isalso done. Our proposed ASR system achieves the word error rate as 39.65%.

pdf bib
SSNCSE_NLP@LT-EDI-ACL2022:Hope Speech Detection for Equality, Diversity and Inclusion using sentence transformers
Bharathi B | Dhanya Srinivasan | Josephine Varsha | Thenmozhi Durairaj | Senthil Kumar B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

In recent times, applications have been developed to regulate and control the spread of negativity and toxicity on online platforms. The world is filled with serious problems like political & religious conflicts, wars, pandemics, and offensive hate speech is the last thing we desire. Our task was to classify a text into ‘Hope Speech’ and ‘Non-Hope Speech’. We searched for datasets acquired from YouTube comments that offer support, reassurance, inspiration, and insight, and the ones that don’t. The datasets were provided to us by the LTEDI organizers in English, Tamil, Spanish, Kannada, and Malayalam. To successfully identify and classify them, we employed several machine learning transformer models such as m-BERT, MLNet, BERT, XLMRoberta, and XLM_MLM. The observed results indicate that the BERT and m-BERT have obtained the best results among all the other techniques, gaining a weighted F1- score of 0.92, 0.71, 0.76, 0.87, and 0.83 for English, Tamil, Spanish, Kannada, and Malayalam respectively. This paper depicts our work for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LTEDI 2021.

pdf bib
SSNCSE_NLP@LT-EDI-ACL2022: Homophobia/Transphobia Detection in Multiple Languages using SVM Classifiers and BERT-based Transformers
Krithika Swaminathan | Bharathi B | Gayathri G L | Hrishik Sampath
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Over the years, there has been a slow but steady change in the attitude of society towards different kinds of sexuality. However, on social media platforms, where people have the license to be anonymous, toxic comments targeted at homosexuals, transgenders and the LGBTQ+ community are not uncommon. Detection of homophobic comments on social media can be useful in making the internet a safer place for everyone. For this task, we used a combination of word embeddings and SVM Classifiers as well as some BERT-based transformers. We achieved a weighted F1-score of 0.93 on the English dataset, 0.75 on the Tamil dataset and 0.87 on the Tamil-English Code-Mixed dataset.

pdf bib
SSNCSE_NLP@LT-EDI-ACL2022: Speech Recognition for Vulnerable Individuals in Tamil using pre-trained XLSR models
Dhanya Srinivasan | Bharathi B | Thenmozhi Durairaj | Senthil Kumar B
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Automatic speech recognition is a tool used to transform human speech into a written form. It is used in a variety of avenues, such as in voice commands, customer, service and more. It has emerged as an essential tool in the digitisation of daily life. It has been known to be of vital importance in making the lives of elderly and disabled people much easier. In this paper we describe an automatic speech recognition model, determined by using three pre-trained models, fine-tuned from the Facebook XLSR Wav2Vec2 model, which was trained using the Common Voice Dataset. The best model for speech recognition in Tamil is determined by finding the word error rate of the data. This work explains the submission made by SSNCSE_NLP in the shared task organized by LT-EDI at ACL 2022. A word error rate of 39.4512 is achieved.

pdf bib
Findings of the Shared Task on Speech Recognition for Vulnerable Individuals in Tamil
Bharathi B | Bharathi Raja Chakravarthi | Subalalitha Cn | Sripriya N | Arunaggiri Pandian | Swetha Valli
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

This paper illustrates the overview of the sharedtask on automatic speech recognition in the Tamillanguage. In the shared task, spontaneousTamil speech data gathered from elderly andtransgender people was given for recognitionand evaluation. These utterances were collected from people when they communicatedin the public locations such as hospitals, markets, vegetable shop, etc. The speech corpusincludes utterances of male, female, and transgender and was split into training and testingdata. The given task was evaluated using WER(Word Error Rate). The participants used thetransformer-based model for automatic speechrecognition. Different results using differentpre-trained transformer models are discussedin this overview paper.

2021

pdf bib
SSNCSE_NLP@DravidianLangTech-EACL2021: Offensive Language Identification on Multilingual Code Mixing Text
Bharathi B | Agnusimmaculate Silvia A
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Social networks made a huge impact in almost all fields in recent years. Text messaging through the Internet or cellular phones has become a major medium of personal and commercial communication. Everyday we have to deal with texts, emails or different types of messages in which there are a variety of attacks and abusive phrases. It is the moderator’s decision which comments to remove from the platform because of violations and which ones to keep but an automatic software for detecting abusive languages would be useful in recent days. In this paper we describe an automatic offensive language identification from Dravidian languages with various machine learning algorithms. This is work is shared task in DravidanLangTech-EACL2021. The goal of this task is to identify offensive language content of the code-mixed dataset of comments/posts in Dravidian Languages ( (Tamil-English, Malayalam-English, and Kannada-English)) collected from social media. This work explains the submissions made by SSNCSE_NLP in DravidanLangTech-EACL2021 Code-mix tasks for Offensive language detection. We achieve F1 scores of 0.95 for Malayalam, 0.7 for Kannada and 0.73 for task2-Tamil on the test-set.

pdf bib
SSNCSE_NLP@DravidianLangTech-EACL2021: Meme classification for Tamil using machine learning approach
Bharathi B | Agnusimmaculate Silvia A
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Social media are interactive platforms that facilitate the creation or sharing of information, ideas or other forms of expression among people. This exchange is not free from offensive, trolling or malicious contents targeting users or communities. One way of trolling is by making memes. A meme is an image or video that represents the thoughts and feelings of a specific audience. The challenge of dealing with memes is that they are region-specific and their meaning is often obscured in humour or sarcasm. A meme is a form of media that spreads an idea or emotion across the internet. The multi modal nature of memes, postings of hateful memes or related events like trolling, cyberbullying are increasing day by day. Memes make it even more challenging since they express humour and sarcasm in an implicit way, because of which the meme may not be offensive if we only consider the text or the image. In this paper we proposed a approach for meme classification for Tamil language that considers only the text present in the meme. This work explains the submissions made by SSNCSE NLP in DravidanLangTechEACL2021 task for meme classification in Tamil language. We achieve F1 scores of 0.50 using the proposed approach using the test-set.

2020

pdf bib
Semi-supervised Fine-grained Approach for Arabic dialect detection task
Nitin Nikamanth Appiah Balaji | Bharathi B
Proceedings of the Fifth Arabic Natural Language Processing Workshop

Arabic being a language with numerous different dialects, it becomes extremely important to device a technique to distinguish each dialect efficiently. This paper focuses on the fine-grained country level and province level classification of Arabic dialects. The experiments in this paper are submissions done to the NADI 2020 shared Dialect detection task. Various text feature extraction techniques such as TF-IDF, AraVec, multilingual BERT and Fasttext embedding models are studied. We thereby, propose an approach of text embedding based model with macro average F1 score of 0.2232 for task1 and 0.0483 for task2, with the help of semi supervised learning approach.