2024
pdf
bib
abs
Empowering SW Security: CodeBERT and Machine Learning Approaches to Vulnerability Detection
Lov Kumar
|
Vikram Singh
|
Srivalli Patel
|
Pratyush Mishra
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Software (SW) systems experience faults after deployment, raising concerns about reliability and leading to financial losses, reputational damage, and safety risks. This paper presents a novel approach using CodeBERT, a state-of-the-art neural code representation model pre-trained in multi-programming languages and employs various code metrics to predict SW faults. The study comprehensively evaluates trained models by analyzing publicly available codebase and employing diverse machine learning models, feature selection techniques, and class balancing through SMOTE. The results show that SMOTE significantly enhances vulnerability detection performance, particularly in accuracy, AUC, sensitivity, and specificity. The EXTR classifier consistently outperforms others, with an average AUC of 0.82, and the features selected using the GA feature selection technique, despite achieving a mean AUC of 0.84. Interestingly, among employed embedding techniques, SW metrics combined with CodeBERT (SMCBERT) stand out as top performers, achieving the highest mean AUC score of 0.80, making models trained on SMCBERT the best for SW vulnerability prediction.
pdf
bib
abs
Mocktails of Translation, Ensemble Learning and Embeddings to tackle Hinglish NLP challenges
Lov Kumar
|
Vikram Singh
|
Proksh
|
Pratyush Mishra
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Social media has become a global platform where users express opinions on diverse contemporary topics, often blending dominant languages with native tongues, leading to code-mixed, context-rich content. A typical example is Hinglish, where Hindi elements are embedded in English texts. This linguistic mixture challenges traditional NLP systems, which rely on monolingual resources and need help to process multilingual content. Sentiment analysis for code-mixed data, mainly involving Indian languages, remains largely unexplored. This paper introduces a novel approach for sentiment analysis of code-mixed Hinglish data, combining translation, different stacking classifier architectures, and embedding techniques. We utilize pre-trained LoRA weights of a fine-tuned Gemma-2B model to translate Hinglish into English, followed by the employment of four pre-trained meta embeddings: GloVe-T, Word2Vec, TF-IDF, and fastText. SMOTE is applied to balance skewed data, and dimensionality reduction is performed before implementing machine learning models and stacking classifier ensembles. Three ensemble architectures, combining 22 base classifiers with a Logistic Regression meta-classifier, test different meta-embedding combinations. Experimental results show that the TF-W2V-FST (TF-IDF, Word2Vec, fastText) combination performs best, with SVM radial bias achieving the highest accuracy 91.53% and AUC (0.96). This research contributes a novel and effective technique to sentiment analysis for code-mixed data.
2022
pdf
bib
abs
BPHC@DravidianLangTech-ACL2022-A comparative analysis of classical and pre-trained models for troll meme classification in Tamil
Achyuta V
|
Mithun Kumar S R
|
Aruna Malapati
|
Lov Kumar
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
Trolling refers to any user behaviour on the internet to intentionally provoke or instigate conflict predominantly in social media. This paper aims to classify troll meme captions in Tamil-English code-mixed form. Embeddings are obtained for raw code-mixed text and the translated and transliterated version of the text and their relative performances are compared. Furthermore, this paper compares the performances of 11 different classification algorithms using Accuracy and F1- Score. We conclude that we were able to achieve a weighted F1 score of 0.74 through MuRIL pretrained model.
pdf
bib
abs
Sentiment Analysis on Code-Switched Dravidian Languages with Kernel Based Extreme Learning Machines
Mithun Kumar S R
|
Lov Kumar
|
Aruna Malapati
Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages
Code-switching refers to the textual or spoken data containing multiple languages. Application of natural language processing (NLP) tasks like sentiment analysis is a harder problem on code-switched languages due to the irregularities in the sentence structuring and ordering. This paper shows the experiment results of building a Kernel based Extreme Learning Machines(ELM) for sentiment analysis for code-switched Dravidian languages with English. Our results show that ELM performs better than traditional machine learning classifiers on various metrics as well as trains faster than deep learning models. We also show that Polynomial kernels perform better than others in the ELM architecture. We were able to achieve a median AUC of 0.79 with a polynomial kernel.
2021
pdf
bib
abs
Prediction of Video Game Development Problems Based on Postmortems using Different Word Embedding Techniques
Anirudh A
|
Aman RAJ Singh
|
Anjali Goyal
|
Lov Kumar
|
N L Bhanu Murthy
Proceedings of the 18th International Conference on Natural Language Processing (ICON)
The interactive entertainment industry is being actively involved with the development, marketing and sale of video games in the past decade. The increasing interest in video games has led to an increase in video game development techniques and methods. It has emerged as an immensely large sector, and now it has grown to be larger than the movie and music industry combined. The postmortem of a game outlines and analyzes the game’s history, team goals, what went right, and what went wrong with the game. Despite its significance, there is little understanding related to the challenges encountered by the programmers. Postmortems are not properly maintained and are informally written, leading to a lack of trustworthiness. In this study, we perform a systematic analysis on different problems faced in the video game development. The need for automation and ML techniques arises because it could help game developers easily identify the exact problem from the description, and hence be able to easily find a solution. This work could also help developers in identifying frequent mistakes that could be avoided, and will provide researchers a beginning point to further consider game development in context of software engineering.