Somak Aditya


2023

pdf bib
SYNC: A Structurally Guided Hard Negative Curricula for Generalizable Neural Code Search
Atharva Naik | Soumitra Das | Jyothi Vedurada | Somak Aditya
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Prover: Generating Intermediate Steps for NLI with Commonsense Knowledge Retrieval and Next-Step Prediction
Deepanway Ghosal | Somak Aditya | Monojit Choudhury
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Robust Information-Masking Approach for Domain Counterfactual Generation
Pengfei Hong | Rishabh Bhardwaj | Navonil Majumder | Somak Aditya | Soujanya Poria
Findings of the Association for Computational Linguistics: ACL 2023

Domain shift is a big challenge in NLP. Many approaches, thus, resort to learning domain-invariant features to mitigate the hurdles of domain shift during inference. Such methods, however, inexorably fail to leverage the domain-specific nuances relevant to the task at hand. To avoid such drawbacks, domain counterfactual generation has recently been proposed that aims to transform a text from the source domain to a given target domain. To achieve this, the existing method uses a frequency-based approach to identify and mask the source-domain-specific tokens in a text. A pretrained LM is then prompted to fill the masks with target-domain-specific tokens. We, however, have observed that, due to limitations of the available data, such a frequency-based method may either miss some domain-token associations or lead to some spurious domain-token associations. To this end, we additionally employ attention norm-based scores to identify additional token-domain associations from a domain classifier. To minimize spurious associations, we also devise an iterative unmasking heuristic that unmasks the masked tokens to minimize the confidence of a domain classifier in the source domain. Our experiments empirically show that the counterfactual samples sourced from our masked text lead to improved domain transfer across various classification tasks. The proposed approach outperforms the baselines on 10 out of 12 domain-counterfactual classification settings with an average of 1.7% improvement in accuracy metric.

2022

pdf bib
Vector Space Interpolation for Query Expansion
Deepanway Ghosal | Somak Aditya | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Topic-sensitive query set expansion is an important area of research that aims to improve search results for information retrieval. It is particularly crucial for queries related to sensitive and emerging topics. In this work, we describe a method for query set expansion about emerging topics using vector space interpolation. We use a transformer model called OPTIMUS, which is suitable for vector space manipulation due to its variational autoencoder nature. One of our proposed methods – Dirichlet interpolation shows promising results for query expansion. Our methods effectively generate new queries about the sensitive topic by incorporating set-level diversity, which is not captured by traditional sentence-level augmentation methods such as paraphrasing or back-translation.

pdf bib
Multilingual CheckList: Generation and Evaluation
Karthikeyan K | Shaily Bhatt | Pankaj Singh | Somak Aditya | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm –Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance. We release the code of TEA and the CheckLists created at aka.ms/multilingualchecklist

2021

pdf bib
Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance
Karthikeyan K | Aalok Sathe | Somak Aditya | Monojit Choudhury
Proceedings of the 1st Workshop on Multilingual Representation Learning

Multilingual language models achieve impressive zero-shot accuracies in many languages in complex tasks such as Natural Language Inference (NLI). Examples in NLI (and equivalent complex tasks) often pertain to various types of sub-tasks, requiring different kinds of reasoning. Certain types of reasoning have proven to be more difficult to learn in a monolingual context, and in the crosslingual context, similar observations may shed light on zero-shot transfer efficiency and few-shot sample selection. Hence, to investigate the effects of types of reasoning on transfer performance, we propose a category-annotated multilingual NLI dataset and discuss the challenges to scale monolingual annotations to multiple languages. We statistically observe interesting effects that the confluence of reasoning types and language similarities have on transfer performance.

2020

pdf bib
TaxiNLI: Taking a Ride up the NLU Hill
Pratik Joshi | Somak Aditya | Aalok Sathe | Monojit Choudhury
Proceedings of the 24th Conference on Computational Natural Language Learning

Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce TaxiNLI, a new dataset, that has 10k examples from the MNLI dataset with these taxonomic labels. Through various experiments on TaxiNLI, we observe that whereas for certain taxonomic categories SOTA neural models have achieved near perfect accuracies—a large jump over the previous models—some categories still remain difficult. Our work adds to the growing body of literature that shows the gaps in the current NLI systems and datasets through a systematic presentation and analysis of reasoning categories.

2015

pdf bib
Identifying Various Kinds of Event Mentions in K-Parser Output
Arpit Sharma | Nguyen Vo | Somak Aditya | Chitta Baral
Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Recognizing Social Constructs from Textual Conversation
Somak Aditya | Chitta Baral | Nguyen Ha Vo | Joohyung Lee | Jieping Ye | Zaw Naung | Barry Lumpkin | Jenny Hastings | Richard Scherl | Dawn M. Sweet | Daniela Inclezan
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies