Mayukh Das

2024

pdf bib abs
Toximatics: Towards Understanding Toxicity in Real-Life Social Situations
Mayukh Das | Wolf-Tilo Balke
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The proliferation of social media has increased the visibility and effects of hate speech. To address this, NLP solutions have been developed to identify both explicit and implicit forms of hate speech. Typically, these approaches evaluate the toxicity of utterances in isolation, ignoring the context. Drawing on pragmatics, our study examines how contextual factors can influence the perceived toxicity of utterances, thereby anchoring assessments in a more nuanced semantic framework. We present Toximatics, a dataset that includes context-dependent utterances and it’s toxicity score. We also introduce a novel synthetic data generation pipeline designed to create context-utterance pairs at scale with controlled polarity. This pipeline can enhance existing hate speech datasets by adding contextual information to utterances, either preserving or altering their polarity, and also generate completely new pairs from seed statements. We utilised both features to create Toximatics. To address biases in state-of-the-art hate datasets, which often skew towards specific sensitive topics such as politics, race, and gender, we propose a method to generate neutral utterances typical of various social settings. These are then contextualized to show how neutrality can shift to toxicity or benignity depending on the surrounding context. The evaluation results clearly indicate that the current models are underperforming on this dataset.

2023

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.

2022

pdf bib abs
Quantifying Bias from Decoding Techniques in Natural Language Generation
Mayukh Das | Wolf Tilo Balke
Proceedings of the 29th International Conference on Computational Linguistics

Natural language generation (NLG) models can propagate social bias towards particular demography. Though several studies investigated bias from data and model, NLG task distinctively uses stochastic decoder that can positively or negatively impact the bias-sensitive tokens initially predicted by the model. To address this gap in research, we present an extensive analysis of bias from decoding techniques for open-domain language generation considering the entire decoding space. We analyze to what extent bias metrics like toxicity and sentiment are impacted by the individual components of decoder algorithms. To this extent, we also analyze the trade-off between bias scores and human-annotated generation quality throughout the decoder space. Together, these methods reveal the imperative of testing inference time bias and provide evidence on the usefulness of inspecting the entire decoding spectrum.

2017

Agents that communicate back and forth with humans to help them execute non-linguistic tasks are a long sought goal of AI. These agents need to translate between utterances and actionable meaning representations that can be interpreted by task-specific problem solvers in a context-dependent manner. They should also be able to learn such actionable interpretations for new predicates on the fly. We define an agent architecture for this scenario and present a series of experiments in the Blocks World domain that illustrate how our architecture supports language learning and problem solving in this domain.

Mayukh Das

2024

2023

2022

2017

Co-authors

Venues