Sandipan Dandapat

Also published as: Sandipan Dandpat


2022

pdf bib
Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models
Kabir Ahuja | Shanu Kumar | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem. We jointly train predictive models for different tasks which helps us build more accurate predictors for tasks where we have test data in very few languages to measure the actual performance of the model. Our approach also lends us the ability to perform a much more robust feature selection, and identify a common set of features that influence zero-shot performance across a variety of tasks.

pdf bib
Beyond Static models and test sets: Benchmarking the potential of pre-trained models across tasks and languages
Kabir Ahuja | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP by utilizing features related to data and language typology to estimate the performance of an MMLM on different languages. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches, without the need for any additional translation as well as evaluation costs.

2021

pdf bib
A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist
Shaily Bhatt | Rahul Jain | Sandipan Dandapat | Sunayana Sitaram
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle ‘Checklist’, which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems. Disclaimer: The paper contains examples of content with offensive language. The examples do not represent the views of the authors or their employers towards any person(s), group(s), practice(s), or entity/entities.

2020

pdf bib
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja | Sandipan Dandapat | Anirudh Srinivasan | Sunayana Sitaram | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.

pdf bib
A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.

pdf bib
Code-mixed parse trees and how to find them
Anirudh Srinivasan | Sandipan Dandapat | Monojit Choudhury
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees. Existing work has shown that linguistic theories can be used to generate code-mixed sentences from a set of parallel sentences. We build upon this work, using one of these theories, the Equivalence-Constraint theory to obtain the parse trees of synthetically generated code-mixed sentences and evaluate them with a neural constituency parser. We highlight the lack of a dataset non-synthetic code-mixed constituency parse trees and how it makes our evaluation difficult. To complete our evaluation, we convert a code-mixed dependency parse tree set into “pseudo constituency trees” and find that a parser trained on synthetically generated trees is able to decently parse these as well.

2019

bib
Processing and Understanding Mixed Language Data
Monojit Choudhury | Anirudh Srinivasan | Sandipan Dandapat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts

Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.

pdf bib
INMT: Interactive Neural Machine Translation Prediction
Sebastin Santy | Sandipan Dandapat | Monojit Choudhury | Kalika Bali
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions. This makes the end-to-end translation process faster, more efficient and creates high-quality translations. We augment the OpenNMT backend with a mechanism to accept the user input and generate conditioned translations.

2018

pdf bib
Identifying Transferable Information Across Domains for Cross-domain Sentiment Classification
Raksha Sharma | Pushpak Bhattacharyya | Sandipan Dandapat | Himanshu Sharad Bhatt
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Getting manually labeled data in each domain is always an expensive and a time consuming task. Cross-domain sentiment analysis has emerged as a demanding concept where a labeled source domain facilitates a sentiment classifier for an unlabeled target domain. However, polarity orientation (positive or negative) and the significance of a word to express an opinion often differ from one domain to another domain. Owing to these differences, cross-domain sentiment classification is still a challenging task. In this paper, we propose that words that do not change their polarity and significance represent the transferable (usable) information across domains for cross-domain sentiment classification. We present a novel approach based on χ2 test and cosine-similarity between context vector of words to identify polarity preserving significant words across domains. Furthermore, we show that a weighted ensemble of the classifiers enhances the cross-domain classification performance.

pdf bib
Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
Adithya Pratapa | Gayatri Bhat | Monojit Choudhury | Sunayana Sitaram | Sandipan Dandapat | Kalika Bali
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language. We present a computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory. We show that when training examples are sampled appropriately from this synthetic data and presented in certain order (aka training curriculum) along with monolingual and real CM data, it can significantly reduce the perplexity of an RNN-based language model. We also show that randomly generated CM data does not help in decreasing the perplexity of the LMs.

pdf bib
Translating Web Search Queries into Natural Language Questions
Adarsh Kumar | Sandipan Dandapat | Sushil Chordia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
A Fluctuation Smoothing Approach for Unsupervised Automatic Short Answer Grading
Shourya Roy | Sandipan Dandapat | Y. Narahari
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

We offer a fluctuation smoothing computational approach for unsupervised automatic short answer grading (ASAG) techniques in the educational ecosystem. A major drawback of the existing techniques is the significant effect that variations in model answers could have on their performances. The proposed fluctuation smoothing approach, based on classical sequential pattern mining, exploits lexical overlap in students’ answers to any typical question. We empirically demonstrate using multiple datasets that the proposed approach improves the overall performance and significantly reduces (up to 63%) variation in performance (standard deviation) of unsupervised ASAG techniques. We bring in additional benchmarks such as (a) paraphrasing of model answers and (b) using answers by k top performing students as model answers, to amplify the benefits of the proposed approach.

pdf bib
Wisdom of Students: A Consistent Automatic Short Answer Grading Technique
Shourya Roy | Sandipan Dandapat | Ajay Nagesh | Y. Narahari
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
SODA:Service Oriented Domain Adaptation Architecture for Microblog Categorization
Himanshu Sharad Bhatt | Sandipan Dandapat | Peddamuthu Balaji | Shourya Roy | Sharmistha Jat | Deepali Semwal
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

pdf bib
MTWatch: A Tool for the Analysis of Noisy Parallel Data
Sandipan Dandapat | Declan Groves
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

State-of-the-art statistical machine translation (SMT) technique requires a good quality parallel data to build a translation model. The availability of large parallel corpora has rapidly increased over the past decade. However, often these newly developed parallel data contains contain significant noise. In this paper, we describe our approach for classifying good quality parallel sentence pairs from noisy parallel data. We use 10 different features within a Support Vector Machine (SVM)-based model for our classification task. We report a reasonably good classification accuracy and its positive effect on overall MT accuracy.

pdf bib
Hierarchical Recursive Tagset for Annotating Cooking Recipes
Sharath Reddy Gunamgari | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 11th International Conference on Natural Language Processing

2013

pdf bib
TMTprime: A Recommender System for MT and TM Integration
Aswarth Abhilash Dara | Sandipan Dandapat | Declan Groves | Josef van Genabith
Proceedings of the 2013 NAACL HLT Demonstration Session

2012

pdf bib
Approximate Sentence Retrieval for Scalable and Efficient Example-Based Machine Translation
Johannes Leveling | Debasis Ganguly | Sandipan Dandapat | Gareth Jones
Proceedings of COLING 2012

pdf bib
Combining EBMT, SMT, TM and IR Technologies for Quality and Scale
Sandipan Dandapat | Sara Morrissey | Andy Way | Josef van Genabith
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

2011

pdf bib
Using Example-Based MT to Support Statistical MT when Translating Homogeneous Data in a Resource-Poor Setting
Sandipan Dandapat | Sara Morrissey | Andy Way | Mikel L. Forcada
Proceedings of the 15th Annual conference of the European Association for Machine Translation

2010

pdf bib
Mitigating Problems in Analogy-based EBMT with SMT and vice versa: A Case Study with Named Entity Transliteration
Sandipan Dandapat | Sara Morrissey | Sudip Kumar Naskar | Harold Somers
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
Large-Coverage Root Lexicon Extraction for Hindi
Cohan Sujay Carlos | Monojit Choudhury | Sandipan Dandapat
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Complex Linguistic Annotation – No Easy Way Out! A Case from Bangla and Hindi POS Labeling Tasks
Sandipan Dandapat | Priyanka Biswas | Monojit Choudhury | Kalika Bali
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib
English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

2008

pdf bib
Prototype Machine Translation System From Text-To-Indian Sign Language
Tirthankar Dasgupta | Sandipan Dandpat | Anupam Basu
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

pdf bib
A Hybrid Named Entity Recognition System for South and South East Asian Languages
Sujan Kumar Saha | Sanjay Chatterji | Sandipan Dandapat | Sudeshna Sarkar | Pabitra Mitra
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages

pdf bib
Bengali and Hindi to English CLIR Evaluation
Debasis Mandal | Sandipan Dandapat | Mayank Gupta | Pratyush Banerjee | Sudeshna Sarkar
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies

2007

pdf bib
Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario
Sandipan Dandapat | Sudeshna Sarkar | Anupam Basu
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions