Prince Kumar


2024

pdf bib
DocCGen: Document-based Controlled Code Generation
Sameer Pimparkhede | Mehant Kammakomati | Srikanth G. Tamilselvam | Prince Kumar | Ashok Pon Kumar | Pushpak Bhattacharyya
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent developments show that Large Language Models (LLMs) produce state-of-the-art performance on natural language (NL) to code generation for resource-rich general-purpose languages like C++, Java, and Python. However, their practical usage for structured domain-specific languages (DSLs) such as YAML, JSON is limited due to domain-specific schema, grammar, and customizations generally unseen by LLMs during pre-training. Efforts have been made to mitigate this challenge via in-context learning through relevant examples or by fine-tuning. However, it suffers from problems, such as limited DSL samples and prompt sensitivity but enterprises maintain good documentation of the DSLs. Therefore, we propose DocCGen, a framework that can leverage such rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process. First, it detects the correct libraries using the library documentation that best matches the NL query. Then, it utilizes schema rules extracted from the documentation of these libraries to constrain the decoding. We evaluate our framework for two complex structured languages, Ansible YAML and Bash command, consisting of two settings: Out-of-domain (OOD) and In domain (ID). Our extensive experiments show that DocCGen consistently improves different sized language models across all six evaluation metrics, reducing syntactic and semantic errors in structured code.

pdf bib
Read between the lines - Functionality Extraction From READMEs
Prince Kumar | Srikanth Tamilselvam | Dinesh Garg
Findings of the Association for Computational Linguistics: NAACL 2024

While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively.

2023

pdf bib
Prompting with Pseudo-Code Instructions
Mayank Mishra | Prince Kumar | Riyaz Bhat | Rudra Murthy | Danish Contractor | Srikanth Tamilselvam
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Prompting with natural language instructions has recently emerged as a popular method of harnessing the capabilities of large language models (LLM). Given the inherent ambiguity present in natural language, it is intuitive to consider the possible advantages of prompting with less ambiguous prompt styles, like pseudo-code. In this paper, we explore if prompting via pseudo-code instructions helps improve the performance of pre-trained language models. We manually create a dataset of pseudo-code prompts for 132 different tasks spanning classification, QA, and generative language tasks, sourced from the Super-NaturalInstructions dataset. Using these prompts along with their counterparts in natural language, we study their performance on two LLM families - BLOOM, CodeGen. Our experiments show that using pseudo-code instructions leads to better results, with an average increase (absolute) of 7-16 points in F1 scores for classification tasks and an improvement (relative) of 12-38% in aggregate ROUGE-L scores across all tasks. We include detailed ablation studies which indicate that code comments, docstrings, and the structural clues encoded in pseudo-code all contribute towards the improvement in performance. To the best of our knowledge, our work is the first to demonstrate how pseudo-code prompts can be helpful in improving the performance of pre-trained LMs.

2021

pdf bib
CFILT IIT Bombay@LT-EDI-EACL2021: Hope Speech Detection for Equality, Diversity, and Inclusion using Multilingual Representation fromTransformers
Pankaj Singh | Prince Kumar | Pushpak Bhattacharyya
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

With the internet becoming part and parcel of our lives, engagement in social media has increased a lot. Identifying and eliminating offensive content from social media has become of utmost priority to prevent any kind of violence. However, detecting encouraging, supportive and positive content is equally important to prevent misuse of censorship targeted to attack freedom of speech. This paper presents our system for the shared task Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI, EACL 2021. The data for this shared task is provided in English, Tamil, and Malayalam which was collected from YouTube comments. It is a multiclass classification problem where each data instance is categorized into one of the three classes: ‘Hope speech’, ‘Not hope speech’, and ‘Not in intended language’. We propose a system that employs multilingual transformer models to obtain the representation of text and classifies it into one of the three classes. We explored the use of multilingual models trained specifically for Indian languages along with generic multilingual models. Our system was ranked 2nd for English, 2nd for Malayalam, and 7th for the Tamil language in the final leader board published by organizers and obtained a weighted F1-score of 0.92, 0.84, 0.55 respectively on the hidden test dataset used for the competition. We have made our system publicly available at GitHub.