Pratyush Mishra


2024

pdf bib
Empowering SW Security: CodeBERT and Machine Learning Approaches to Vulnerability Detection
Lov Kumar | Vikram Singh | Srivalli Patel | Pratyush Mishra
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Software (SW) systems experience faults after deployment, raising concerns about reliability and leading to financial losses, reputational damage, and safety risks. This paper presents a novel approach using CodeBERT, a state-of-the-art neural code representation model pre-trained in multi-programming languages and employs various code metrics to predict SW faults. The study comprehensively evaluates trained models by analyzing publicly available codebase and employing diverse machine learning models, feature selection techniques, and class balancing through SMOTE. The results show that SMOTE significantly enhances vulnerability detection performance, particularly in accuracy, AUC, sensitivity, and specificity. The EXTR classifier consistently outperforms others, with an average AUC of 0.82, and the features selected using the GA feature selection technique, despite achieving a mean AUC of 0.84. Interestingly, among employed embedding techniques, SW metrics combined with CodeBERT (SMCBERT) stand out as top performers, achieving the highest mean AUC score of 0.80, making models trained on SMCBERT the best for SW vulnerability prediction.

pdf bib
Mocktails of Translation, Ensemble Learning and Embeddings to tackle Hinglish NLP challenges
Lov Kumar | Vikram Singh | Proksh | Pratyush Mishra
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Social media has become a global platform where users express opinions on diverse contemporary topics, often blending dominant languages with native tongues, leading to code-mixed, context-rich content. A typical example is Hinglish, where Hindi elements are embedded in English texts. This linguistic mixture challenges traditional NLP systems, which rely on monolingual resources and need help to process multilingual content. Sentiment analysis for code-mixed data, mainly involving Indian languages, remains largely unexplored. This paper introduces a novel approach for sentiment analysis of code-mixed Hinglish data, combining translation, different stacking classifier architectures, and embedding techniques. We utilize pre-trained LoRA weights of a fine-tuned Gemma-2B model to translate Hinglish into English, followed by the employment of four pre-trained meta embeddings: GloVe-T, Word2Vec, TF-IDF, and fastText. SMOTE is applied to balance skewed data, and dimensionality reduction is performed before implementing machine learning models and stacking classifier ensembles. Three ensemble architectures, combining 22 base classifiers with a Logistic Regression meta-classifier, test different meta-embedding combinations. Experimental results show that the TF-W2V-FST (TF-IDF, Word2Vec, fastText) combination performs best, with SVM radial bias achieving the highest accuracy 91.53% and AUC (0.96). This research contributes a novel and effective technique to sentiment analysis for code-mixed data.