Arindam Chatterjee


2023

pdf bib
Lost in Translation No More: Fine-tuned transformer-based models for CodeMix to English Machine Translation
Arindam Chatterjee | Chhavi Sharma | Yashwanth V.p. | Niraj Kumar | Ayush Raj | Asif Ekbal
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

Codemixing, the linguistic phenomenon where a speaker alternates between two or more languages within a conversation or even a single utterance, presents a significant challenge for machine translation systems due to its syntactic complexity and contextual nuances. This paper introduces a set of advanced transformerbased models fine-tuned specifically for translating codemixed text to English, more specifically, Hindi-English (colloquially referred to as Hinglish) codemixed text into English. Unlike standard bilingual corpora, codemixed data requires an understanding of the intricacies of grammatical structures and cultural contexts embedded within the language blend. Existing machine translation efforts in codemixed languages have largely been constrained by the paucity of robust datasets and models that can capture the nuanced semantic and syntactic interplay characteristic of such languages. We present a novel dataset PACMAN trans for Hinglish to English machine translation, based on the PACMAN strategy, meticulously curated to represent natural codemixing patterns. Our generic fine-tuned translation models trained on the novel data outperforms current state-of-theart Large Language Models (LLMs) by 38% in terms of BLEU score. Further, when fine-tuned on custom benchmark datasets, our focused dual fine-tuned models surpass the PHINC dataset BLEU score benchmark by 22%. Our comparative analysis illustrates significant improvements in translation quality, showcasing the potential of fine-tuning transformer models in bridging the linguistic divide in codemixed language translation. The success of our models reflects a promising step forward in the quest to provide seamless translation services for the ever-growing multilingual population and the complex linguistic phenomena they generate.

2022

pdf bib
PACMAN:PArallel CodeMixed dAta generatioN for POS tagging
Arindam Chatterjee | Chhavi Sharma | Ayush Raj | Asif Ekbal
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Code-mixing or Code-switching is the mixing of languages in the same context, predominantly observed in multilingual societies. The existing code-mixed datasets are small and primarily contain social media text that does not adhere to standard spelling and grammar. Computational models built on such data fail to generalise on unseen code-mixed data. To address the unavailability of quality code-mixed annotated datasets, we explore the combined task of generating annotated code mixed data, and building computational models from this generated data, specifically for code-mixed Part-Of-Speech (POS) tagging. We introduce PACMAN(PArallel CodeMixed dAta generatioN) - a synthetically generated code-mixed POS tagged dataset, with above 50K samples, which is the largest annotated code-mixed dataset. We build POS taggers using classical machine learning and deep learning based techniques on the generated data to report an F1-score of 98% (8% above current State-of-the-art (SOTA)). To determine the efficacy of our data, we compare it against the existing benchmark in code-mixed POS tagging. PACMAN outperforms the benchmark, ratifying that our dataset and, subsequently, our POS tagging models are generalised and capable of handling even natural code-mixed and monolingual data.

2021

pdf bib
Towards Explainable Dialogue System: Explaining Intent Classification using Saliency Techniques
Ratnesh Joshi | Arindam Chatterjee | Asif Ekbal
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Deep learning based methods have shown tremendous success in several Natural Language Processing (NLP) tasks. The recent trends in the usage of Deep Learning based models for natural language tasks have definitely produced incredible performance for several application areas. However, one major problem that most of these models face is the lack of transparency, i.e. the actual decision process of the underlying model is not explainable. In this paper, at first we solve a very fundamental problem of Natural Language Understanding (NLU), i.e. intent detection using a Bi-directional Long Short Term Memory (BiLSTM). In order to determine the defining features that lead to a specific intent class, we use the Layerwise Relevance Propagation (LRP) algorithm to find the defining feature(s). In the process, we conclude that saliency method of eLRP (epsilon Layerwise Relevance Propagation) is a prominent process for highlighting the important features of the input responsible for the current classification which results in significant insights to the inner workings, such as the reasons for misclassification by the black box model.

2012

pdf bib
Eating Your Own Cooking: Automatically Linking Wordnet Synsets of Two Languages
Salil Joshi | Arindam Chatterjee | Arun Karthikeyan Karra | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Discrimination-Net for Hindi
Diptesh Kanojia | Arindam Chatterjee | Salil Joshi | Pushpak Bhattacharyya
Proceedings of COLING 2012: Demonstration Papers

2011

pdf bib
Together We Can: Bilingual Bootstrapping for WSD
Mitesh M. Khapra | Salil Joshi | Arindam Chatterjee | Pushpak Bhattacharyya
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies