Sandipan Dandapat - ACL Anthology

Sandipan Dandapat

Also published as: Sandipan Dandpat

2025

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Saketh Reddy Vemula | Sandipan Dandapat | Dipti Sharma | Parameswari Krishnamurthy
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for languages with complex morphology. In this paper, we conduct a comprehensive evaluation of tokenizers using small-sized BERT models—from pre-training through fine-tuning—for Telugu (agglutinative), along with preliminary evaluation in Hindi (primarily fusional with some agglutination) and English (fusional). To evaluate morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms.Our experiments reveal two key findings for Telugu. First, the choice of tokenizer algorithm is the most significant factor influencing performance, with Unigram-based tokenizers consistently outperforming BPE across most settings. Second, while better morphological alignment shows a moderate, positive correlation with performance on text classification and structure prediction tasks, its impact is secondary to the tokenizer algorithm. Notably, hybrid approaches that use morphological information for pre-segmentation significantly boost the performance of BPE, though not Unigram. Our results further showcase the need for comprehensive intrinsic evaluation metrics for tokenizers that could explain downstream performance trends consistently.

SAGE: A Generic Framework for LLM Safety Evaluation
Madhur Jindal | Hari Shrawgi | Parag Agrawal | Sandipan Dandapat
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track

As Large Language Models are rapidly deployed across diverse applications from healthcare to financial advice, safety evaluation struggles to keep pace. Current benchmarks focus on single-turn interactions with generic policies, failing to capture the conversational dynamics of real-world usage and the application-specific harms that emerge in context. Such potential oversights can lead to harms that go unnoticed in standard safety benchmarks and other current evaluation methodologies. To address these needs for robust AI safety evaluation, we introduce SAGE (Safety AI Generic Evaluation), an automated modular framework designed for customized and dynamic harm evaluations. SAGE employs prompted adversarial agents with diverse personalities based on the Big Five model, enabling system-aware multi-turn conversations that adapt to target applications and harm policies. We evaluate seven state-of-the-art LLMs across three applications and harm policies. Multi-turn experiments show that harm increases with conversation length, model behavior varies significantly when exposed to different user personalities and scenarios, and some models minimize harm via high refusal rates that reduce usefulness. We also demonstrate policy sensitivity within a harm category where tightening a child-focused sexual policy substantially increases measured defects across applications. These results motivate adaptive, policy-aware, and context-specific testing for safer real-world deployment.

LLM Safety for Children
Prasanjit Rath | Hari Shrawgi | Parag Agrawal | Sandipan Dandapat
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

This paper analyzes the safety of Large Language Models (LLMs) in interactions with children below age of 18 years. Despite the transformative applications of LLMs in various aspects of children’s lives, such as education and therapy, there remains a significant gap in understanding and mitigating potential content harms specific to this demographic. The study acknowledges the diverse nature of children, often overlooked by standard safety evaluations, and proposes a comprehensive approach to evaluating LLM safety specifically for children. We list down potential risks that children may encounter when using LLM-powered applications. Additionally, we develop Child User Models that reflect the varied personalities and interests of children, informed by literature in child care and psychology. These user models aim to bridge the existing gap in child safety literature across various fields. We utilize Child User Models to evaluate the safety of six state-of-the-art LLMs. Our observations reveal significant safety gaps in LLMs, particularly in categories harmful to children but not adults.

Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided Strategy Selection
Shanu Kumar | Saish Mendke | Karody Lubna Abdul Rahman | Santosh Kurasa | Parag Agrawal | Sandipan Dandapat
Proceedings of the 31st International Conference on Computational Linguistics

Chain-of-thought (CoT) prompting has significantly enhanced the the capability of large language models (LLMs) by structuring their reasoning processes. However, existing methods face critical limitations: handcrafted demonstrations require extensive human expertise, while trigger phrases are prone to inaccuracies. In this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, a novel approach that improves CoT prompting by utilizing uncertainty estimates to select effective demonstrations without needing access to model parameters. Unlike traditional methods, ZEUS offers high sensitivity in distinguishing between helpful and ineffective questions, ensuring more precise and reliable selection. Our extensive evaluation shows that ZEUS consistently outperforms existing CoT strategies across four challenging reasoning benchmarks, demonstrating its robustness and scalability.

PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust
Avni Mittal | Sree Hari Nagaralu | Sandipan Dandapat
Proceedings of the Third Workshop on Social Influence in Conversations (SICon 2025)

This paper presents PROTECT, a novel policy-driven organizational value taxonomy designed to enhance ethical compliance and trust within organizations. Drawing on established human value systems and leveraging large language models, PROTECT generates values tailored to organizational contexts and clusters them into a refined taxonomy. This taxonomy serves as the basis for creating a comprehensive dataset of compliance scenarios, each linked to specific values and paired with both compliant and non-compliant responses. By systematically varying value emphasis, we illustrate how different LLM personas emerge, reflecting diverse compliance behaviors. The dataset, directly grounded in the taxonomy, enables consistent evaluation and training of LLMs on value-sensitive tasks. While PROTECT offers a robust foundation for aligning AI systems with organizational standards, our experiments also reveal current limitations in model accuracy, highlighting the need for further improvements. Together, the taxonomy and dataset represent complementary, foundational contributions toward value-aligned AI in organizational settings.

Comparing Language Models of Different Scales for Security-Focused Tabular Query Generation and Reasoning
Varivashya Poladi | Sandipan Dandapat
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Security-related data often exists in complex, multi-table formats and is scarce due to privacy and compliance constraints—posing a major challenge for training and evaluating language models (LMs) on security reasoning tasks. In this work, we systematically investigate the performance of large language models (LLMs) across different parameter scales in generating and solving multi-step, semantically rich queries over realistic security scenarios represented through three interlinked tabular datasets. We assess models on three key axes (i) their ability to formulate insightful, high complexity security questions; (ii) the quality and coherence of their reasoning chains; and (iii) their accuracy in deriving actionable answers from the underlying data. To address data scarcity, we propose a diffusion-based synthetic data generation pipeline that amplifies the existing dataset while preserving domain semantics and statistical structure. Our findings reveal that while large models often outperform in reasoning depth and query formulation, smaller models show surprising efficiency and accuracy. The study provides actionable insights for deploying generative models in security analytics and opens avenues for synthetic data-driven evaluation of LLMs in low-resource, high-stakes domains.

LITMUS++ : An Agentic System for Predictive Analysis of Low-Resource Languages Across Tasks and Models
Avni Mittal | Shanu Kumar | Sandipan Dandapat | Monojit Choudhury
Proceedings of The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

We present LITMUS++, an agentic system for predicting language-model performance for queries of the form “How will a Model perform on a Task in a Language?”, a persistent challenge in multilingual and low-resource settings, settings where benchmarks are incomplete or unavailable. Unlike static evaluation suites or opaque LLM-as-judge pipelines, LITMUS++ implements an agentic, auditable workflow: a Directed Acyclic Graph of specialized Thought Agents that generate hypotheses, retrieve multilingual evidence, select predictive features, and train lightweight regressors with calibrated uncertainty. The system supports interactive querying through a chat-style interface, enabling users to inspect reasoning traces and cited evidence. Experiments across six tasks and five multilingual scenarios show that LITMUS++ delivers accurate and interpretable performance predictions, including in low-resource and unseen conditions. Code is available at https://github.com/AvniMittal13/litmus_plus_plus.

2024

Uncovering Stereotypes in Large Language Models: A Task Complexity-based Approach
Hari Shrawgi | Prasanjit Rath | Tushar Singhal | Sandipan Dandapat
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent Large Language Models (LLMs) have unlocked unprecedented applications of AI. As these models continue to transform human life, there are growing socio-ethical concerns around their inherent stereotypes that can lead to bias in their applications. There is an urgent need for holistic bias evaluation of these LLMs. Few such benchmarks exist today and evaluation techniques that do exist are either non-holistic or may provide a false sense of security as LLMs become better at hiding their biases on simpler tasks. We address these issues with an extensible benchmark - LLM Stereotype Index (LSI). LSI is grounded on Social Progress Index, a holistic social benchmark. We also test the breadth and depth of bias protection provided by LLMs via a variety of tasks with varying complexities. Our findings show that both ChatGPT and GPT-4 have strong inherent prejudice with respect to nationality, gender, race, and religion. The exhibition of such issues becomes increasingly apparent as we increase task complexity. Furthermore, GPT-4 is better at hiding the biases, but when displayed it is more significant. Our findings highlight the harms and divide that these LLMs can bring to society if we do not take very diligent care in their use.

INMT-Lite: Accelerating Low-Resource Language Data Collection via Offline Interactive Neural Machine Translation
Harshita Diddee | Anurag Shukla | Tanuja Ganu | Vivek Seshadri | Sandipan Dandapat | Monojit Choudhury | Kalika Bali
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A steady increase in the performance of Massively Multilingual Models (MMLMs) has contributed to their rapidly increasing use in data collection pipelines. Interactive Neural Machine Translation (INMT) systems are one class of tools that can utilize MMLMs to promote such data collection in several under-resourced languages. However, these tools are often not adapted to the deployment constraints that native language speakers operate in, as bloated, online inference-oriented MMLMs trained for data-rich languages, drive them. INMT-Lite addresses these challenges through its support of (1) three different modes of Internet-independent deployment and (2) a suite of four assistive interfaces suitable for (3) data-sparse languages. We perform an extensive user study for INMT-Lite with an under-resourced language community, Gondi, to find that INMT-Lite improves the data generation experience of community members along multiple axes, such as cognitive load, task productivity, and interface interaction time and effort, without compromising on the quality of the generated translations.INMT-Lite’s code is open-sourced to further research in this domain.

2023

Performance and Risk Trade-offs for Multi-word Text Prediction at Scale
Aniket Vashishtha | S Sai Prasad | Payal Bajaj | Vishrav Chaudhary | Kate Cook | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Findings of the Association for Computational Linguistics: EACL 2023

Large Language Models such as GPT-3 are well-suited for text prediction tasks, which can help and delight users during text composition. LLMs are known to generate ethically inappropriate predictions even for seemingly innocuous contexts. Toxicity detection followed by filtering is a common strategy for mitigating the harm from such predictions. However, as we shall argue in this paper, in the context of text prediction, it is not sufficient to detect and filter toxic content. One also needs to ensure factual correctness and group-level fairness of the predictions; failing to do so can make the system ineffective and nonsensical at best, and unfair and detrimental to the users at worst. We discuss the gaps and challenges of toxicity detection approaches - from blocklist-based approaches to sophisticated state-of-the-art neural classifiers - by evaluating them on the text prediction task for English against a manually crafted CheckList of harms targeted at different groups and different levels of severity.

Problematic Webpage Identification: A Trilogy of Hatespeech, Search Engines and GPT
Ojasvin Sood | Sandipan Dandapat
The 7th Workshop on Online Abuse and Harms (WOAH)

In this paper, we introduce a fine-tuned transformer-based model focused on problematic webpage classification to identify webpages promoting hate and violence of various forms. Due to the unavailability of labelled problematic webpage data, first we propose a novel webpage data collection strategy which leverages well-studied short-text hate speech datasets. We have introduced a custom GPT-4 few-shot prompt annotation scheme taking various webpage features to label the prohibitively expensive webpage annotation task. The resulting annotated data is used to build our problematic webpage classification model. We report the accuracy (87.6% F1-score) of our webpage classification model and conduct a detailed comparison of it against other state-of-the-art hate speech classification model on problematic webpage identification task. Finally, we have showcased the importance of various webpage features in identifying a problematic webpage.

DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer
Shanu Kumar | Soujanya Abbaraju | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages. In this work, we envision languages as domains for improving zero-shot transfer by jointly reducing the feature incongruity between the source and the target language and increasing the generalization capabilities of pre-trained multilingual transformers. We show that our approach, DiTTO, significantly outperforms the standard zero-shot fine-tuning method on multiple datasets across all languages using solely unlabeled instances in the target language. Empirical results show that jointly reducing feature incongruity for multiple target languages is vital for successful cross-lingual transfer. Moreover, our model enables better cross-lingual transfer than standard fine-tuning methods, even in the few-shot setting.

Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models
Raghav Jain | Daivik Sojitra | Arkadeep Acharya | Sriparna Saha | Adam Jatowt | Sandipan Dandapat
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Temporal reasoning represents a vital component of human communication and understanding, yet remains an underexplored area within the context of Large Language Models (LLMs). Despite LLMs demonstrating significant proficiency in a range of tasks, a comprehensive, large-scale analysis of their temporal reasoning capabilities is missing. Our paper addresses this gap, presenting the first extensive benchmarking of LLMs on temporal reasoning tasks. We critically evaluate 8 different LLMs across 6 datasets using 3 distinct prompting strategies. Additionally, we broaden the scope of our evaluation by including in our analysis 2 Code Generation LMs. Beyond broad benchmarking of models and prompts, we also conduct a fine-grained investigation of performance across different categories of temporal tasks. We further analyze the LLMs on varying temporal aspects, offering insights into their proficiency in understanding and predicting the continuity, sequence, and progression of events over time. Our findings reveal a nuanced depiction of the capabilities and limitations of the models within temporal reasoning, offering a comprehensive reference for future research in this pivotal domain.

2022

”Diversity and Uncertainty in Moderation” are the Key to Data Selection for Multilingual Few-shot Transfer
Shanu Kumar | Sandipan Dandapat | Monojit Choudhury
Findings of the Association for Computational Linguistics: NAACL 2022

Few-shot transfer often shows substantial gain over zero-shot transfer (CITATION), which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretained model-based systems. This paper explores various strategies for selecting data for annotation that can result in a better few-shot transfer. The proposed approaches rely on multiple measures such as data entropy using n-gram language model, predictive entropy, and gradient embedding. We propose a loss embedding method for sequence labeling tasks, which induces diversity and uncertainty sampling similar to gradient embedding. The proposed data selection strategies are evaluated and compared for POS tagging, NER, and NLI tasks for up to 20 languages. Our experiments show that the gradient and loss embedding-based strategies consistently outperform random data selection baselines, with gains varying with the initial performance of the zero-shot transfer. Furthermore, the proposed method shows similar trends in improvement even when the model is fine-tuned using a lower proportion of the original task-specific labeled training data for zero-shot transfer.

Beyond Static models and test sets: Benchmarking the potential of pre-trained models across tasks and languages
Kabir Ahuja | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity. We argue that this makes the existing practices in multilingual evaluation unreliable and does not provide a full picture of the performance of MMLMs across the linguistic landscape. We propose that the recent work done in Performance Prediction for NLP tasks can serve as a potential solution in fixing benchmarking in Multilingual NLP by utilizing features related to data and language typology to estimate the performance of an MMLM on different languages. We compare performance prediction with translating test data with a case study on four different multilingual datasets, and observe that these methods can provide reliable estimates of the performance that are often on-par with the translation based approaches, without the need for any additional translation as well as evaluation costs.

Multilingual CheckList: Generation and Evaluation
Karthikeyan K | Shaily Bhatt | Pankaj Singh | Somak Aditya | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm –Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance. We release the code of TEA and the CheckLists created at aka.ms/multilingualchecklist

Too Brittle to Touch: Comparing the Stability of Quantization and Distillation towards Developing Low-Resource MT Models
Harshita Diddee | Sandipan Dandapat | Monojit Choudhury | Tanuja Ganu | Kalika Bali
Proceedings of the Seventh Conference on Machine Translation (WMT)

Leveraging shared learning through Massively Multilingual Models, state-of-the-art Machine translation (MT) models are often able to adapt to the paucity of data for low-resource languages. However, this performance comes at the cost of significantly bloated models which aren’t practically deployable. Knowledge Distillation is one popular technique to develop competitive lightweight models: In this work, we first evaluate its use in compressing MT models, focusing specifically on languages with extremely limited training data. Through our analysis across 8 languages, we find that the variance in the performance of the distilled models due to their dependence on priors including the amount of synthetic data used for distillation, the student architecture, training hyper-parameters and confidence of the teacher models, makes distillation a brittle compression mechanism. To mitigate this, we further explore the use of post-training quantization for the compression of these models. Here, we find that while Distillation provides gains across some low-resource languages, Quantization provides more consistent performance trends for the entire range of languages, especially the lowest-resource languages in our target set.

Proceedings of the First Workshop on Scaling Up Multilingual Evaluation
Kabir Ahuja | Antonios Anastasopoulos | Barun Patra | Graham Neubig | Monojit Choudhury | Sandipan Dandapat | Sunayana Sitaram | Vishrav Chaudhary
Proceedings of the First Workshop on Scaling Up Multilingual Evaluation

The SUMEval 2022 Shared Task on Performance Prediction of Multilingual Pre-trained Language Models
Kabir Ahuja | Antonios Anastasopoulos | Barun Patra | Graham Neubig | Monojit Choudhury | Sandipan Dandapat | Sunayana Sitaram | Vishrav Chaudhary
Proceedings of the First Workshop on Scaling Up Multilingual Evaluation

Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models
Kabir Ahuja | Shanu Kumar | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem. We jointly train predictive models for different tasks which helps us build more accurate predictors for tasks where we have test data in very few languages to measure the actual performance of the model. Our approach also lends us the ability to perform a much more robust feature selection, and identify a common set of features that influence zero-shot performance across a variety of tasks.

On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data
Kabir Ahuja | Monojit Choudhury | Sandipan Dandapat
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Borrowing ideas from Production functions in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models. We illustrate the effectiveness of our framework through a case-study on the TyDIQA-GoldP dataset. One of the interesting conclusion of the study is that if the cost of machine translation is greater than zero, the optimal performance at least cost is always achieved with at least some or only manually-created data. To our knowledge, this is the first attempt towards extending the concept of production functions to study data collection strategies for training multilingual models, and can serve as a valuable tool for other similar cost vs data trade-offs in NLP.

On the Calibration of Massively Multilingual Language Models
Kabir Ahuja | Sunayana Sitaram | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well in improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model-specific factors influence it, and pointing out the strategies to improve the same.

Vector Space Interpolation for Query Expansion
Deepanway Ghosal | Somak Aditya | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Topic-sensitive query set expansion is an important area of research that aims to improve search results for information retrieval. It is particularly crucial for queries related to sensitive and emerging topics. In this work, we describe a method for query set expansion about emerging topics using vector space interpolation. We use a transformer model called OPTIMUS, which is suitable for vector space manipulation due to its variational autoencoder nature. One of our proposed methods – Dirichlet interpolation shows promising results for query expansion. Our methods effectively generate new queries about the sensitive topic by incorporating set-level diversity, which is not captured by traditional sentence-level augmentation methods such as paraphrasing or back-translation.

2021

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist
Shaily Bhatt | Rahul Jain | Sandipan Dandapat | Sunayana Sitaram
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Despite state-of-the-art performance, NLP systems can be fragile in real-world situations. This is often due to insufficient understanding of the capabilities and limitations of models and the heavy reliance on standard evaluation benchmarks. Research into non-standard evaluation to mitigate this brittleness is gaining increasing attention. Notably, the behavioral testing principle ‘Checklist’, which decouples testing from implementation revealed significant failures in state-of-the-art models for multiple tasks. In this paper, we present a case study of using Checklist in a practical scenario. We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist. We lay out the challenges and open questions based on our observations of using Checklist for human-in-loop evaluation and improvement of NLP systems. Disclaimer: The paper contains examples of content with offensive language. The examples do not represent the views of the authors or their employers towards any person(s), group(s), practice(s), or entity/entities.

On the Universality of Deep Contextual Language Models
Shaily Bhatt | Poonam Goyal | Sandipan Dandapat | Monojit Choudhury | Sunayana Sitaram
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning. Furthermore, multilingual versions of such models like XLM-R and mBERT have given promising results in zero-shot cross-lingual transfer, potentially enabling NLP applications in many under-served and under-resourced languages. Due to this initial success, pre-trained models are being used as ‘Universal Language Models’ as the starting point across diverse tasks, domains, and languages. This work explores the notion of ‘Universality’ by identifying seven dimensions across which a universal model should be able to scale, that is, perform equally well or reasonably well, to be useful across diverse settings. We outline the current theoretical and empirical results that support model performance across these dimensions, along with extensions that may help address some of their current limitations. Through this survey, we lay the foundation for understanding the capabilities and limitations of massive contextual language models and help discern research gaps and directions for future work to make these LMs inclusive and fair to diverse applications, users, and linguistic phenomena.

2020

A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury
Proceedings of the 4th Workshop on Computational Approaches to Code Switching

Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.

Code-mixed parse trees and how to find them
Anirudh Srinivasan | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 4th Workshop on Computational Approaches to Code Switching

In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees. Existing work has shown that linguistic theories can be used to generate code-mixed sentences from a set of parallel sentences. We build upon this work, using one of these theories, the Equivalence-Constraint theory to obtain the parse trees of synthetically generated code-mixed sentences and evaluate them with a neural constituency parser. We highlight the lack of a dataset non-synthetic code-mixed constituency parse trees and how it makes our evaluation difficult. To complete our evaluation, we convert a code-mixed dependency parse tree set into “pseudo constituency trees” and find that a parser trained on synthetically generated trees is able to decently parse these as well.

GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja | Sandipan Dandapat | Anirudh Srinivasan | Sunayana Sitaram | Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.

2019

Processing and Understanding Mixed Language Data
Monojit Choudhury | Anirudh Srinivasan | Sandipan Dandapat
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): Tutorial Abstracts

Multilingual communities exhibit code-mixing, that is, mixing of two or more socially stable languages in a single conversation, sometimes even in a single utterance. This phenomenon has been widely studied by linguists and interaction scientists in the spoken language of such communities. However, with the prevalence of social media and other informal interactive platforms, code-switching is now also ubiquitously observed in user-generated text. As multilingual communities are more the norm from a global perspective, it becomes essential that code-switched text and speech are adequately handled by language technologies and NUIs.Code-mixing is extremely prevalent in all multilingual societies. Current studies have shown that as much as 20% of user generated content from some geographies, like South Asia, parts of Europe, and Singapore, are code-mixed. Thus, it is very important to handle code-mixed content as a part of NLP systems and applications for these geographies.In the past 5 years, there has been an active interest in computational models for code-mixing with a substantive research outcome in terms of publications, datasets and systems. However, it is not easy to find a single point of access for a complete and coherent overview of the research. This tutorial is expecting to fill this gap and provide new researchers in the area with a foundation in both linguistic and computational aspects of code-mixing. We hope that this then becomes a starting point for those who wish to pursue research, design, development and deployment of code-mixed systems in multilingual societies.

INMT: Interactive Neural Machine Translation Prediction
Sebastin Santy | Sandipan Dandapat | Monojit Choudhury | Kalika Bali
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions. This makes the end-to-end translation process faster, more efficient and creates high-quality translations. We augment the OpenNMT backend with a mechanism to accept the user input and generate conditioned translations.

2018

Identifying Transferable Information Across Domains for Cross-domain Sentiment Classification
Raksha Sharma | Pushpak Bhattacharyya | Sandipan Dandapat | Himanshu Sharad Bhatt
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Getting manually labeled data in each domain is always an expensive and a time consuming task. Cross-domain sentiment analysis has emerged as a demanding concept where a labeled source domain facilitates a sentiment classifier for an unlabeled target domain. However, polarity orientation (positive or negative) and the significance of a word to express an opinion often differ from one domain to another domain. Owing to these differences, cross-domain sentiment classification is still a challenging task. In this paper, we propose that words that do not change their polarity and significance represent the transferable (usable) information across domains for cross-domain sentiment classification. We present a novel approach based on χ2 test and cosine-similarity between context vector of words to identify polarity preserving significant words across domains. Furthermore, we show that a weighted ensemble of the classifiers enhances the cross-domain classification performance.

Training Deployable General Domain MT for a Low Resource Language Pair: English-Bangla
Sandipan Dandapat | William Lewis
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data
Adithya Pratapa | Gayatri Bhat | Monojit Choudhury | Sunayana Sitaram | Sandipan Dandapat | Kalika Bali
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language. We present a computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory. We show that when training examples are sampled appropriately from this synthetic data and presented in certain order (aka training curriculum) along with monolingual and real CM data, it can significantly reduce the perplexity of an RNN-based language model. We also show that randomly generated CM data does not help in decreasing the perplexity of the LMs.

Iterative Data Augmentation for Neural Machine Translation: a Low Resource Case Study for English-Telugu
Sandipan Dandapat | Christian Federmann
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

Telugu is the fifteenth most commonly spoken language in the world with an estimated reach of 75 million people in the Indian subcontinent. At the same time, it is a severely low resourced language. In this paper, we present work on English–Telugu general domain machine translation (MT) systems using small amounts of parallel data. The baseline statistical (SMT) and neural MT (NMT) systems do not yield acceptable translation quality, mostly due to limited resources. However, the use of synthetic parallel data (generated using back translation, based on an NMT engine) significantly improves translation quality and allows NMT to outperform SMT. We extend back translation and propose a new, iterative data augmentation (IDA) method. Filtering of synthetic data and IDA both further boost translation quality of our final NMT systems, as measured by BLEU scores on all test sets and based on state-of-the-art human evaluation.

Translating Web Search Queries into Natural Language Questions
Adarsh Kumar | Sandipan Dandapat | Sushil Chordia
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

Wisdom of Students: A Consistent Automatic Short Answer Grading Technique
Shourya Roy | Sandipan Dandapat | Ajay Nagesh | Y. Narahari
Proceedings of the 13th International Conference on Natural Language Processing

A Fluctuation Smoothing Approach for Unsupervised Automatic Short Answer Grading
Shourya Roy | Sandipan Dandapat | Y. Narahari
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

We offer a fluctuation smoothing computational approach for unsupervised automatic short answer grading (ASAG) techniques in the educational ecosystem. A major drawback of the existing techniques is the significant effect that variations in model answers could have on their performances. The proposed fluctuation smoothing approach, based on classical sequential pattern mining, exploits lexical overlap in students’ answers to any typical question. We empirically demonstrate using multiple datasets that the proposed approach improves the overall performance and significantly reduces (up to 63%) variation in performance (standard deviation) of unsupervised ASAG techniques. We bring in additional benchmarks such as (a) paraphrasing of model answers and (b) using answers by k top performing students as model answers, to amplify the benefits of the proposed approach.

SODA:Service Oriented Domain Adaptation Architecture for Microblog Categorization
Himanshu Sharad Bhatt | Sandipan Dandapat | Peddamuthu Balaji | Shourya Roy | Sharmistha Jat | Deepali Semwal
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2014

MTWatch: A Tool for the Analysis of Noisy Parallel Data
Sandipan Dandapat | Declan Groves
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

State-of-the-art statistical machine translation (SMT) technique requires a good quality parallel data to build a translation model. The availability of large parallel corpora has rapidly increased over the past decade. However, often these newly developed parallel data contains contain significant noise. In this paper, we describe our approach for classifying good quality parallel sentence pairs from noisy parallel data. We use 10 different features within a Support Vector Machine (SVM)-based model for our classification task. We report a reasonably good classification accuracy and its positive effect on overall MT accuracy.

Hierarchical Recursive Tagset for Annotating Cooking Recipes
Sharath Reddy Gunamgari | Sandipan Dandapat | Monojit Choudhury
Proceedings of the 11th International Conference on Natural Language Processing

2013

TMTprime: A Recommender System for MT and TM Integration
Aswarth Abhilash Dara | Sandipan Dandapat | Declan Groves | Josef van Genabith
Proceedings of the 2013 NAACL HLT Demonstration Session

2012

Combining EBMT, SMT, TM and IR Technologies for Quality and Scale
Sandipan Dandapat | Sara Morrissey | Andy Way | Josef van Genabith
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

Approximate Sentence Retrieval for Scalable and Efficient Example-Based Machine Translation
Johannes Leveling | Debasis Ganguly | Sandipan Dandapat | Gareth Jones
Proceedings of COLING 2012

2011

Using Example-Based MT to Support Statistical MT when Translating Homogeneous Data in a Resource-Poor Setting
Sandipan Dandapat | Sara Morrissey | Andy Way | Mikel L. Forcada
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

Mitigating Problems in Analogy-based EBMT with SMT and vice versa: A Case Study with Named Entity Transliteration
Sandipan Dandapat | Sara Morrissey | Sudip Kumar Naskar | Harold Somers
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

Large-Coverage Root Lexicon Extraction for Hindi
Cohan Sujay Carlos | Monojit Choudhury | Sandipan Dandapat
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Complex Linguistic Annotation – No Easy Way Out! A Case from Bangla and Hindi POS Labeling Tasks
Sandipan Dandapat | Priyanka Biswas | Monojit Choudhury | Kalika Bali
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2008

A Hybrid Named Entity Recognition System for South and South East Asian Languages
Sujan Kumar Saha | Sanjay Chatterji | Sandipan Dandapat | Sudeshna Sarkar | Pabitra Mitra
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages

Bengali and Hindi to English CLIR Evaluation
Debasis Mandal | Sandipan Dandapat | Mayank Gupta | Pratyush Banerjee | Sudeshna Sarkar
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies

Prototype Machine Translation System From Text-To-Indian Sign Language
Tirthankar Dasgupta | Sandipan Dandpat | Anupam Basu
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

2007

Automatic Part-of-Speech Tagging for Bengali: An Approach for Morphologically Rich Languages in a Poor Resource Scenario
Sandipan Dandapat | Sudeshna Sarkar | Anupam Basu
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

Co-authors

Parag Agrawal 3

Vishrav Chaudhary 3

Sara Morrissey 3

Sudip Kumar Naskar 3

Sudeshna Sarkar 3

Anirudh Srinivasan 3

Antonios Anastasopoulos 2

Pratyush Banerjee 2

Himanshu Sharad Bhatt 2

Harshita Diddee 2

Mikel L. Forcada 2

Declan Groves 2

Rejwanul Haque 2

Simran Khanuja 2

Graham Neubig 2

Prasanjit Rath 2

Ankit Srivastava 2

Josef van Genabith 2

Soujanya Abbaraju 1

Arkadeep Acharya 1

Peddamuthu Balaji 1

Pushpak Bhattacharyya 1

Priyanka Biswas 1

Cohan Sujay Carlos 1

Sanjay Chatterji 1

Sushil Chordia 1

Aswarth Abhilash Dara 1

Tirthankar Dasgupta 1

Christian Federmann 1

Debasis Ganguly 1

Deepanway Ghosal 1

Sharath Reddy Gunamgari 1

Sree Hari Nagaralu 1

Sharmistha Jat 1

Madhur Jindal 1

Karthikeyan K 1

Parameswari Krishnamurthy 1

Santosh Kurasa 1

Johannes Leveling 1

William Lewis 1

Debasis Mandal 1

Pabitra Mitra 1

Sergio Penkale 1

Varivashya Poladi 1

Adithya Pratapa 1

Karody Lubna Abdul Rahman 1

Sujan Kumar Saha 1

Sriparna Saha 1

Sebastin Santy 1

Deepali Semwal 1

Vivek Seshadri 1

Dipti Misra Sharma 1

Raksha Sharma 1

Anurag Shukla 1

Tushar Singhal 1

Daivik Sojitra 1

Harold Somers 1

Aniket Vashishtha 1

Saketh Reddy Vemula 1

Venues