Anitha R

2024

pdf bib abs
Enhancing Trust and Interpretability in Malayalam Sentiment Analysis with Explainable AI
Anitha R | Rajeev R R | Meharuniza Nazeem | Navaneeth S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Natural language processing (NLP) has seen a rise in the use of explainable AI, especially for low-resource languages like Malayalam. This study builds on our earlier research on sentiment analysis which uses identified views to classify and understand the context. Support Vector Machine (SVM) and Random Forest (RF) classifiers are two machine learning approaches that we used to do sentiment analysis on the Kerala political opinion corpus. Using Bag-of-Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF) features, we construct feature vectors for sentiment analysis. In this, analysis of the Random Forest classifier’s performance shows that it outperforms SVM in terms of accuracy and efficiency, with an accuracy of 85.07 %. Using Local Interpretable Model-Agnostic Explanations (LIME) as a foundation, we address the interpretability of text classification and sentiment analysis models. This integration increases user confidence and model use by offering concise and understandable justifications for model predictions. The study lays the groundwork for future developments in the area by demonstrating the significance of explainable AI in NLP for low-resource languages.

pdf bib abs
Comprehensive Plagiarism Detection in Malayalam Texts Through Web and Database Integration
Meharuniza Nazeem | Parvathy Raj | Rajeev R. R | Anitha R | Navaneeth S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Plagiarism detection techniques have become essential for recognizing instances of plagiarism, particularly in the domain of academics where scientific papers and documents are of prime importance. We propose an application that offers a comprehensive solution for detecting plagiarism in scholarly articles written in Malayalam, enabling users to submit texts, analyze them for plagiarism, and review the results interactively. With the increasing accessibility of digital content, maintaining originality in academic writing has become more tedious. Our research addresses this challenge by providing a solution tailored to the Malayalam language. The application aids researchers and academic institutions in detecting potential plagiarism by accessing web-based content and algorithmic text analysis. The study significantly contributes to the field of plagiarism detection for low resource language such as malayalam and offers a practical way to preserve the originality of Malayalam scholarly work. The performance of four algorithms SequenceMatcher, N-Grams, Rabin-Karp, and Cosine Similarity is thoroughly evaluated. Cosine Similarity, with a 92.45% detection rate, outperformed the others, significantly surpassing Rabin-Karp(65.3%), N-Grams(58.7%) and SequenceMatcher(51.4%). Using this improved efficiency, a user-friendly web application was developed that integrates web search and database comparison features with the Cosine Similarity algorithm.

pdf bib abs
Open-Source OCR Libraries: A Comprehensive Study for Low Resource Language
Meharuniza Nazeem | Anitha R | Navaneeth S | Rajeev R. R
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

This paper reviews numerous OCR programs and libraries employed for optical character recognition tasks. Tesser- act OCR, an open-source program that supports multiple lan- guages and image formats, is highlighted for its accuracy and adaptability. Python-based libraries like EasyOCR, MMOCR, and PaddleOCR are also mentioned, which provide user-friendly interfaces and trained models for text extraction, detection, and recognition. EasyOCR emphasizes ease of use and sim- plicity, while MMOCR and PaddleOCR offer comprehensive OCR capabilities and support for a wide range of languages. According to our study, which evaluates various OCR libraries, Tesseract OCR performs remarkably well in terms of accuracy for Indian languages like Malayalam. We focused on five OCR libraries—Tesseract OCR, MMOCR, PaddleOCR, EasyOCR, and Keras OCR—and tested them across several languages, including English, Hindi, Arabic, Tamil, and Malayalam. During our comparison, we found that Tesseract OCR was the only library that supported the Malayalam language. While the other libraries did not support Malayalam, Tesseract OCR performed well across all tested languages, achieving accuracy rates of 92% in English, 93% in Hindi, 78% in Tamil, 74% in Arabic, and 93% in Malayalam.

Co-authors

Venues

icon3

Fix data