Sourabrata Mukherjee


2024

pdf bib
Multilingual Text Style Transfer: Datasets & Models for Indian Languages
Sourabrata Mukherjee | Atul Kr. Ojha | Akanksha Bansal | Deepak Alok | John P. McCrae | Ondrej Dusek
Proceedings of the 17th International Natural Language Generation Conference

Text style transfer (TST) involves altering the linguistic style of a text while preserving its style-independent content. This paper focuses on sentiment transfer, a popular TST subtask, across a spectrum of Indian languages: Hindi, Magahi, Malayalam, Marathi, Punjabi, Odia, Telugu, and Urdu, expanding upon previous work on English-Bangla sentiment transfer. We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages. We then evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches, including the Llama2 and GPT-3.5 large language models (LLMs). Our experiments highlight the significance of parallel data in TST and demonstrate the effectiveness of the Masked Style Filling (MSF) approach in non-parallel techniques. Moreover, cross-lingual and joint multilingual learning methods show promise, offering insights into selecting optimal models tailored to the specific language and task requirements. To the best of our knowledge, this work represents the first comprehensive exploration of the TST task as sentiment transfer across a diverse set of languages.

pdf bib
Are Large Language Models Actually Good at Text Style Transfer?
Sourabrata Mukherjee | Atul Kr. Ojha | Ondrej Dusek
Proceedings of the 17th International Natural Language Generation Conference

We analyze the performance of large language models (LLMs) on Text Style Transfer (TST), specifically focusing on sentiment transfer and text detoxification across three languages: English, Hindi, and Bengali. Text Style Transfer involves modifying the linguistic style of a text while preserving its core content. We evaluate the capabilities of pre-trained LLMs using zero-shot and few-shot prompting as well as parameter-efficient finetuning on publicly available datasets. Our evaluation using automatic metrics, GPT-4 and human evaluations reveals that while some prompted LLMs perform well in English, their performance in on other languages (Hindi, Bengali) remains average. However, finetuning significantly improves results compared to zero-shot and few-shot prompting, making them comparable to previous state-of-the-art. This underscores the necessity of dedicated datasets and specialized models for effective TST.

2023

pdf bib
Low-Resource Text Style Transfer for Bangla: Data & Models
Sourabrata Mukherjee | Akanksha Bansal | Pritha Majumdar | Atul Kr. Ojha | Ondřej Dušek
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Text style transfer (TST) involves modifying the linguistic style of a given text while retaining its core content. This paper addresses the challenging task of text style transfer in the Bangla language, which is low-resourced in this area. We present a novel Bangla dataset that facilitates text sentiment transfer, a subtask of TST, enabling the transformation of positive sentiment sentences to negative and vice versa. To establish a high-quality base for further research, we refined and corrected an existing English dataset of 1,000 sentences for sentiment transfer based on Yelp reviews, and we introduce a new human-translated Bangla dataset that parallels its English counterpart. Furthermore, we offer multiple benchmark models that serve as a validation of the dataset and baseline for further research.

pdf bib
UFAL-ULD at BLP-2023 Task 1: Violence Detection in Bangla Text
Sourabrata Mukherjee | Atul Kr. Ojha | Ondřej Dušek
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this paper, we present UFAL-ULD team’s system, desinged as a part of the BLP Shared Task 1: Violence Inciting Text Detection (VITD). This task aims to classify text, with a particular challenge of identifying incitement to violence into Direct, Indirect or Non-violence levels. We experimented with several pre-trained sequence classification models, including XLM-RoBERTa, BanglaBERT, Bangla BERT Base, and Multilingual BERT. Our best-performing model was based on the XLM-RoBERTa-base architecture, which outperformed the baseline models. Our system was ranked 20th among the 27 teams that participated in the task.

pdf bib
UFAL-ULD at BLP-2023 Task 2 Sentiment Classification in Bangla Text
Sourabrata Mukherjee | Atul Kr. Ojha | Ondřej Dušek
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this paper, we present the UFAL-ULD team’s system for the BLP Shared Task 2: Sentiment Analysis of Bangla Social Media Posts. The Task 2 involves classifying text into Positive, Negative, or Neutral sentiments. As a part of this task, we conducted a series of experiments with several pre-trained sequence classification models – XLM-RoBERTa, BanglaBERT, Bangla BERT Base and Multilingual BERT. Among these, our best-performing model was based on the XLM-RoBERTa-base architecture, which outperforms baseline models. Our system was ranked 19th among the 30 teams that participated in the task.

pdf bib
Polite Chatbot: A Text Style Transfer Application
Sourabrata Mukherjee | Vojtěch Hudeček | Ondřej Dušek
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

Generating polite responses is essential to build intelligent and engaging dialogue systems. However, this task is far from well-explored due to the difficulties of rendering a particular style in coherent responses, especially when parallel datasets for regular-to-polite pairs are usually unavailable. This paper proposes a polite chatbot that can produce responses that are polite and coherent to the given context. In this study, a politeness transfer model is first used to generate polite synthetic dialogue pairs of contexts and polite utterances. Then, these synthetic pairs are employed to train a dialogue model. Automatic and human evaluations demonstrate that our method outperforms baselines in producing polite dialogue responses while staying competitive in terms of coherent to the given context.

pdf bib
Text Detoxification as Style Transfer in English and Hindi
Sourabrata Mukherjee | Akanksha Bansal | Atul Kr. Ojha | John P. McCrae | Ondrej Dusek
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

This paper focuses on text detoxification, i.e., automatically converting toxic text into nontoxic text. This task contributes to safer and more respectful online communication and can be considered a Text Style Transfer (TST) task, where the text’s style changes while its content is preserved. We present three approaches: (i) knowledge transfer from a similar task (ii) multi-task learning approach, combining sequence-to-sequence modeling with various toxicity classification tasks, and (iii) delete and reconstruct approach. To support our research, we utilize a dataset provided by Dementieva et al. (2021), which contains multiple versions of detoxified texts corresponding to toxic texts. In our experiments, we selected the best variants through expert human annotators, creating a dataset where each toxic sentence is paired with a single, appropriate detoxified version. Additionally, we introduced a small Hindi parallel dataset, aligning with a part of the English dataset, suitable for evaluation purposes. Our results demonstrate that our approach effectively balances text detoxification while preserving the actual content and maintaining fluency.

pdf bib
Leveraging Low-resource Parallel Data for Text Style Transfer
Sourabrata Mukherjee | Ondrej Dusek
Proceedings of the 16th International Natural Language Generation Conference

Text style transfer (TST) involves transforming a text into a desired style while approximately preserving its content. The biggest challenge in TST in the general lack of parallel data. Many existing approaches rely on complex models using substantial non-parallel data, with mixed results. In this paper, we leverage a pretrained BART language model with minimal parallel data and incorporate low-resource methods such as hyperparameter tuning, data augmentation, and self-training, which have not been explored in TST. We further include novel style-based rewards in the training loss. Through extensive experiments in sentiment transfer, a sub-task of TST, we demonstrate that our simple yet effective approaches achieve well-balanced results, surpassing non-parallel approaches and highlighting the usefulness of parallel data even in small amounts.

pdf bib
Stylized Dialog Response Generation
Sourabrata Mukherjee
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

My primary research focus lies in the domain of Text Style Transfer (TST), a fascinating area within Natural Language Processing (NLP). TST involves the transfor- mation of text into a desired style while approximately preserving its underlying content. In my research, I am also driven by the goal of incorporating TST techniques into NLP systems, particularly within the realm of dia- logue systems. I am intrigued by the concept of Stylized Dialog Response Generation, which aims to enhance the versatility and adaptability of dialog systems in generat- ing text responses with specific style attributes. By ad- vancing our understanding of TST and its integration into dialogue systems, my research seeks to contribute to the broader field of human-computer interaction. Through the development of robust and versatile dialogue systems with enhanced style transfer capabilities, we can facili- tate more engaging and personalized conversational experiences.