Sachin Kumar


pdf bib
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar | Vidhisha Balachandran | Lucille Njoo | Antonios Anastasopoulos | Yulia Tsvetkov
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Recent advances in the capacity of large language models to generate human-like text have resulted in their increased adoption in user-facing settings. In parallel, these improvements have prompted a heated discourse around the risks of societal harms they introduce, whether inadvertent or malicious. Several studies have explored these harms and called for their mitigation via development of safer, fairer models. Going beyond enumerating the risks of harms, this work provides a survey of practical methods for addressing potential threats and societal harms from language generation models. We draw on several prior works’ taxonomies of language model risks to present a structured overview of strategies for detecting and ameliorating different kinds of risks/harms of language generators. Bridging diverse strands of research, this survey aims to serve as a practical guide for both LM researchers and practitioners, with explanations of different strategies’ motivations, their limitations, and open problems for future research.

pdf bib
Enhancing Telugu News Understanding: Comparative Study of ML Algorithms for Category Prediction
Manish Rama Gopal Nadella | Venkata Krishna Rayalu Garapati | Eswar Sudhan S.k. | Gouthami Jangala | Soman K.p. | Sachin Kumar
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages

As one of the most extensively used languages in India, Telugu has a sizable audience and a huge library of news articles. Predicting the categories of Telugu news items not only helps with efficient organization but also makes it possible to do trend research, advertise in a certain demographic, and provide individualized recommendations. In order to identify the most effective method for accurate Telugu news category prediction, this study compares and contrasts various machine learning (ML) techniques, including support vector machines (SVM), random forests, and naive Bayes. Accuracy, precision, recall, and F1-score will be utilized as performance indicators to gauge how well these algorithms perform. The outcomes of this comparative analysis will address the particular difficulties and complexities of the Telugu language and add to the body of knowledge on news category prediction. For Telugu-speaking consumers, the study intends to improve news organization and recommendation systems, giving them more relevant and customized news consumption experiences. Our result emphasize that, although other models can be taken into account for further research and comparison, W2Vec-skip gram with polynomial SVM is the best performing combination.

pdf bib
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia | Sachin Kumar | Hila Gonen | Jungo Kasai | David Mortensen | Noah Smith | Yulia Tsvetkov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of “tokens” processed or generated by the underlying language models. What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages. In this work, we analyze the effect of this non-uniformity on the fairness of an API’s pricing policy across languages. We conduct a systematic analysis of the cost and utility of OpenAI’s language model API on multilingual benchmarks in 22 typologically diverse languages. We show evidence that speakers of a large number of the supported languages are overcharged while obtaining poorer results. These speakers tend to also come from regions where the APIs are less affordable, to begin with. Through these analyses, we aim to increase transparency around language model APIs’ pricing policies and encourage the vendors to make them more equitable.

pdf bib
Mitigating Societal Harms in Large Language Models
Sachin Kumar | Vidhisha Balachandran | Lucille Njoo | Antonios Anastasopoulos | Yulia Tsvetkov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Numerous recent studies have highlighted societal harms that can be caused by language technologies deployed in the wild. While several surveys, tutorials, and workshops have discussed the risks of harms in specific contexts – e.g., detecting and mitigating gender bias in NLP models – no prior work has developed a unified typology of technical approaches for mitigating harms of language generation models. Our tutorial is based on a survey we recently wrote that proposes such a typology. We will provide an overview of potential social issues in language generation, including toxicity, social biases, misinformation, factual inconsistency, and privacy violations. Our primary focus will be on how to systematically identify risks, and how eliminate them at various stages of model development, from data collection, to model development, to inference/language generation. Through this tutorial, we aim to equip NLP researchers and engineers with a suite of practical tools for mitigating safety risks from pretrained language generation models.

pdf bib
SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control
Xiaochuang Han | Sachin Kumar | Yulia Tsvetkov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the growing success of diffusion models in continuous-valued domains (e.g., images), similar efforts for discrete domains such as text have yet to match the performance of autoregressive language models. In this work, we present SSD-LM—a diffusion-based language model with two key design choices. First, SSD-LM is semi-autoregressive, iteratively generating blocks of text, allowing for flexible output length at decoding time while enabling local bidirectional context updates. Second, it is simplex-based, performing diffusion on the natural vocabulary space rather than a learned latent space, allowing us to incorporate classifier guidance and modular control using off-the-shelf classifiers without any adaptation. We evaluate SSD-LM on unconstrained text generation benchmarks, and show that it matches or outperforms strong autoregressive GPT-2 models across standard quality and diversity metrics, while vastly outperforming diffusion-based baselines. On controlled text generation, SSD-LM also outperforms competitive baselines, with an extra advantage in modularity.

pdf bib
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
Tianxing He | Jingyu Zhang | Tianle Wang | Sachin Kumar | Kyunghyun Cho | James Glass | Yulia Tsvetkov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore is confused by truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning or middle of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation. We have released our code and data at

pdf bib
Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker
Melanie Sclar | Sachin Kumar | Peter West | Alane Suhr | Yejin Choi | Yulia Tsvetkov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Theory of Mind (ToM)—the ability to reason about the mental states of other people—is a key element of our social intelligence. Yet, despite their ever more impressive performance, large-scale neural language models still lack basic theory of mind capabilities out-of-the-box. We posit that simply scaling up models will not imbue them with theory of mind due to the inherently symbolic and implicit nature of the phenomenon, and instead investigate an alternative: can we design a decoding-time algorithm that enhances theory of mind of off-the-shelf neural language models without explicit supervision? We present SymbolicToM, a plug-and-play approach to reason about the belief states of multiple characters in reading comprehension tasks via explicit symbolic representation. More concretely, our approach tracks each entity’s beliefs, their estimation of other entities’ beliefs, and higher-order levels of reasoning, all through graphical representations, allowing for more precise and interpretable reasoning than previous approaches. Empirical results on the well-known ToMi benchmark (Le et al., 2019) demonstrate that SymbolicToM dramatically enhances off-the-shelf neural networks’ theory of mind in a zero-shot setting while showing robust out-of-distribution performance compared to supervised baselines. Our work also reveals spurious patterns in existing theory of mind benchmarks, emphasizing the importance of out-of-distribution evaluation and methods that do not overfit a particular dataset.


pdf bib
Gradient-based Constrained Sampling from Language Models
Sachin Kumar | Biswajit Paria | Yulia Tsvetkov
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large pretrained language models are successful at generating fluent text but are notoriously hard to controllably sample from. In this work, we study constrained sampling from such language models, i.e., generating text that satisfies user-defined constraints, while maintaining fluency and model’s performance in a downstream task. We propose MuCoLa—a sampling procedure that combines the log-likelihood of the language model with arbitrary (differentiable) constraints in a single energy function, and then generates samples in a non-autoregressive manner. Specifically, it initializes the entire output sequence with noise and follows a Markov chain defined by Langevin Dynamics using the gradients of this energy. We evaluate MuCoLa on text generation with soft and hard constraints as well as their combinations, obtaining significant improvements over competitive baselines for toxicity avoidance, sentiment control, and keyword-guided generation.

pdf bib
Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation
Melanie Sclar | Peter West | Sachin Kumar | Yulia Tsvetkov | Yejin Choi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision), while allowing direct control for compression ratio. Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation (West et al., 2022), where latent knowledge in pre-trained language models is distilled via explicit examples sampled from the teacher models, further purified with three types of filters: length, fidelity, and Information Bottleneck. Moreover, we uniquely propose iterative distillation of knowledge, where student models from the previous iteration of distillation serve as teacher models in the next iteration. Starting off from a relatively modest set of GPT3-generated summaries, we demonstrate how iterative knowledge distillation can lead to considerably smaller, but better summarizers with sharper controllability. A useful by-product of this iterative distillation process is a high-quality dataset of sentence-summary pairs with varying degrees of compression ratios. Empirical results demonstrate that the final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the quality of resulting summarization.


pdf bib
Improving the Diversity of Unsupervised Paraphrasing with Embedding Outputs
Monisha Jegadeesan | Sachin Kumar | John Wieting | Yulia Tsvetkov
Proceedings of the 1st Workshop on Multilingual Representation Learning

We present a novel technique for zero-shot paraphrase generation. The key contribution is an end-to-end multilingual paraphrasing model that is trained using translated parallel corpora to generate paraphrases into “meaning spaces” – replacing the final softmax layer with word embeddings. This architectural modification, plus a training procedure that incorporates an autoencoding objective, enables effective parameter sharing across languages for more fluent monolingual rewriting, and facilitates fluency and diversity in the generated outputs. Our continuous-output paraphrase generation models outperform zero-shot paraphrasing baselines when evaluated on two languages using a battery of computational metrics as well as in human assessment.

pdf bib
Machine Translation into Low-resource Language Varieties
Sachin Kumar | Antonios Anastasopoulos | Shuly Wintner | Yulia Tsvetkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

State-of-the-art machine translation (MT) systems are typically trained to generate “standard” target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source–variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English–Russian MT system to generate Ukrainian and Belarusian, an English–Norwegian Bokmål system to generate Nynorsk, and an English–Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.


pdf bib
A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with Bilingual Semantic Similarity Rewards
Zi-Yi Dou | Sachin Kumar | Yulia Tsvetkov
Proceedings of the Fourth Workshop on Neural Generation and Translation

Cross-lingual text summarization aims at generating a document summary in one language given input in another language. It is a practically important but under-explored task, primarily due to the dearth of available data. Existing methods resort to machine translation to synthesize training data, but such pipeline approaches suffer from error propagation. In this work, we propose an end-to-end cross-lingual text summarization model. The model uses reinforcement learning to directly optimize a bilingual semantic similarity metric between the summaries generated in a target language and gold summaries in a source language. We also introduce techniques to pre-train the model leveraging monolingual summarization and machine translation objectives. Experimental results in both English–Chinese and English–German cross-lingual summarization settings demonstrate the effectiveness of our methods. In addition, we find that reinforcement learning models with bilingual semantic similarity as rewards generate more fluent sentences than strong baselines.


pdf bib
Topics to Avoid: Demoting Latent Confounds in Text Classification
Sachin Kumar | Shuly Wintner | Noah A. Smith | Yulia Tsvetkov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author’s native language is Swedish). We propose a method that represents the latent topical confounds and a model which “unlearns” confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.

pdf bib
A Margin-based Loss with Synthetic Negative Samples for Continuous-output Machine Translation
Gayatri Bhat | Sachin Kumar | Yulia Tsvetkov
Proceedings of the 3rd Workshop on Neural Generation and Translation

Neural models that eliminate the softmax bottleneck by generating word embeddings (rather than multinomial distributions over a vocabulary) attain faster training with fewer learnable parameters. These models are currently trained by maximizing densities of pretrained target embeddings under von Mises-Fisher distributions parameterized by corresponding model-predicted embeddings. This work explores the utility of margin-based loss functions in optimizing such models. We present syn-margin loss, a novel margin-based loss that uses a synthetic negative sample constructed from only the predicted and target embeddings at every step. The loss is efficient to compute, and we use a geometric analysis to argue that it is more consistent and interpretable than other margin-based losses. Empirically, we find that syn-margin provides small but significant improvements over both vMF and standard margin-based losses in continuous-output neural machine translation.