2024
pdf
bib
abs
The Counterfeit Conundrum: Can Code Language Models Grasp the Nuances of Their Incorrect Generations?
Alex Gu
|
Wen-Ding Li
|
Naman Jain
|
Theo Olausson
|
Celine Lee
|
Koushik Sen
|
Armando Solar-Lezama
Findings of the Association for Computational Linguistics: ACL 2024
While language models are increasingly more proficient at code generation, they still frequently generate incorrect programs. Many of these programs are obviously wrong, but others are more subtle and pass weaker correctness checks such as being able to compile. In this work, we focus on these counterfeit samples: programs sampled from a language model that 1) have a high enough log-probability to be generated at a moderate temperature and 2) pass weak correctness checks. Overall, we discover that most models have a very shallow understanding of counterfeits through three clear failure modes. First, models mistakenly classify them as correct. Second, models are worse at reasoning about the execution behaviour of counterfeits and often predict their execution results as if they were correct. Third, when asking models to fix counterfeits, the likelihood of a model successfully repairing a counterfeit is often even lower than that of sampling a correct program from scratch. Counterfeits also have very unexpected properties: first, counterfeit programs for problems that are easier for a model to solve are not necessarily easier to detect and only slightly easier to execute and repair. Second, counterfeits from a given model are just as confusing to the model itself as they are to other models. Finally, both strong and weak models are able to generate counterfeit samples that equally challenge all models. In light of our findings, we recommend that care and caution be taken when relying on models to understand their own samples, especially when no external feedback is incorporated.
2020
pdf
bib
abs
What’s in a Name? Are BERT Named Entity Representations just as Good for any other Name?
Sriram Balasubramanian
|
Naman Jain
|
Gaurav Jindal
|
Abhijeet Awasthi
|
Sunita Sarawagi
Proceedings of the 5th Workshop on Representation Learning for NLP
We evaluate named entity representations of BERT-based NLP models by investigating their robustness to replacements from the same typed class in the input. We highlight that on several tasks while such perturbations are natural, state of the art trained models are surprisingly brittle. The brittleness continues even with the recent entity-aware BERT models. We also try to discern the cause of this non-robustness, considering factors such as tokenization and frequency of occurrence. Then we provide a simple method that ensembles predictions from multiple replacements while jointly modeling the uncertainty of type annotations and label predictions. Experiments on three NLP tasks shows that our method enhances robustness and increases accuracy on both natural and adversarial datasets.
pdf
bib
abs
A Multi-Dimensional View of Aggression when voicing Opinion
Arjit Srivastava
|
Avijit Vajpayee
|
Syed Sarfaraz Akhtar
|
Naman Jain
|
Vinay Singh
|
Manish Shrivastava
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
The advent of social media has immensely proliferated the amount of opinions and arguments voiced on the internet. These virtual debates often present cases of aggression. While research has been focused largely on analyzing aggression and stance in isolation from each other, this work is the first attempt to gain an extensive and fine-grained understanding of patterns of aggression and figurative language use when voicing opinion. We present a Hindi-English code-mixed dataset of opinion on the politico-social issue of ‘2016 India banknote demonetisation‘ and annotate it across multiple dimensions such as aggression, hate speech, emotion arousal and figurative language usage (such as sarcasm/irony, metaphors/similes, puns/word-play).
2016
pdf
bib
abs
A House United: Bridging the Script and Lexical Barrier between Hindi and Urdu
Riyaz A. Bhat
|
Irshad A. Bhat
|
Naman Jain
|
Dipti Misra Sharma
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
In Computational Linguistics, Hindi and Urdu are not viewed as a monolithic entity and have received separate attention with respect to their text processing. From part-of-speech tagging to machine translation, models are separately trained for both Hindi and Urdu despite the fact that they represent the same language. The reasons mainly are their divergent literary vocabularies and separate orthographies, and probably also their political status and the social perception that they are two separate languages. In this article, we propose a simple but efficient approach to bridge the lexical and orthographic differences between Hindi and Urdu texts. With respect to text processing, addressing the differences between the Hindi and Urdu texts would be beneficial in the following ways: (a) instead of training separate models, their individual resources can be augmented to train single, unified models for better generalization, and (b) their individual text processing applications can be used interchangeably under varied resource conditions. To remove the script barrier, we learn accurate statistical transliteration models which use sentence-level decoding to resolve word ambiguity. Similarly, we learn cross-register word embeddings from the harmonized Hindi and Urdu corpora to nullify their lexical divergences. As a proof of the concept, we evaluate our approach on the Hindi and Urdu dependency parsing under two scenarios: (a) resource sharing, and (b) resource augmentation. We demonstrate that a neural network-based dependency parser trained on augmented, harmonized Hindi and Urdu resources performs significantly better than the parsing models trained separately on the individual resources. We also show that we can achieve near state-of-the-art results when the parsers are used interchangeably.
2014
pdf
bib
Language Identification in Code-Switching Scenario
Naman Jain
|
Riyaz Ahmad Bhat
Proceedings of the First Workshop on Computational Approaches to Code Switching
pdf
bib
Adapting Predicate Frames for Urdu PropBanking
Riyaz Ahmad Bhat
|
Naman Jain
|
Ashwini Vaidya
|
Martha Palmer
|
Tafseer Ahmed Khan
|
Dipti Misra Sharma
|
James Babani
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants
2013
pdf
bib
Effective Parsing for Human Aided NLP Systems
Naman Jain
|
Sambhav Jain
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)
pdf
bib
Exploring Semantic Information in Hindi WordNet for Hindi Dependency Parsing
Sambhav Jain
|
Naman Jain
|
Aniruddha Tammewar
|
Riyaz Ahmad Bhat
|
Dipti Sharma
Proceedings of the Sixth International Joint Conference on Natural Language Processing
2012
pdf
bib
Two-stage Approach for Hindi Dependency Parsing Using MaltParser
Naman Jain
|
Karan Singla
|
Aniruddha Tammewar
|
Sambhav Jain
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages