Mitigating Societal Harms in Large Language Models

,


Motivation
With the widespread success and increasing adoption on natural language processing (NLP) technologies in user-facing products including machine translation (Vaswani et al., 2017;Lewis et al., 2020), dialogue systems (Andreas et al., 2020;Gangadharaiah and Narayanaswamy, 2020) and recommendation systems (Jannach et al., 2020) the NLP community is becoming increasingly aware that we have a responsibility to evaluate the effects of our research and mitigate harmful outcomes (Bender et al., 2021).Indeed, models have been shown to introduce vulnerabilities and threats, both inadvertent and malicious, to individual users, social groups, and content integrity.Without social context and content control, deployed language generators have quickly derailed to racist, homophobic, hateful comments (Hunt, 2016;Jang, 2021;Wolf et al., 2017;Vincent, 2022), compromised user privacy (Carlini et al., 2021), spread disinformation (Shao et al., 2018), and even encouraged suicide (Daws, 2020).Prior works have outlined these risks (Maynez et al., 2020;Sheng et al., 2021;Weidinger et al., 2021), proposed taxonomies (Weidinger et al., 2022), discussed their points of origin, and advocated for research on ethical development of LMs (Bender et al., 2021;Solaiman et al., 2019).
However, there is little work that summarizes actionable approaches and technical solutions to preventing or mitigating these harms.This is the purpose of our tutorial, which is based on a survey we have recently conducted (Kumar et al., 2022).In this tutorial, we aim to provide a comprehensive, unified taxonomy of relevant mitigation strategies proposed in prior literature, specifically focusing on language generation models.

Tutorial Content and Relevance
What are language models?A brief background: To build a common ground for discussing the risk mitigation strategies, this tutorial will begin with a brief overview of recent trends in language modeling and pretraining.We will cover both causal (Radford et al., 2019;Brown et al., 2020) and non-causal language models (Devlin et al., 2019) highlighting their differences and their impact on NLP research.We will briefly discuss how pretrained models can be adapted to different tasks covering model finetuning (both complete and adapter based) as well as prompt-based formulation to solve NLP tasks.We will also focus on their scale both in terms of model parameters as well as training data size.
How can language models cause societal harm?After presenting the background on language models, we will then give a formal definition of harms based on taxonomy defined in prior work (Barocas et al., 2017) and focus on representational harms in this tutorial.Highlighting the impact of heedlessly using web data which is usually populationimbalanced (Bender et al., 2021) and contains biased language against towards specific populations, we will discuss how language models tend to reinforce and amplify bias against sub-populations based on different personal and social attributes such as gender (Stanovsky et al., 2019;de Vassimon Manela et al., 2021), race (Liang et al., 2021;Field et al., 2021), region (Huang et al., 2020), demographics (Huang et al., 2020), age (Nangia et al., 2020) among others.We will also discuss, that by not being grounded in real world knowledge, they pickup on spurious statistical correlations in data and generate (in other words, hallucinate) factually incorrect content which can potentially be used to spread misinformation (Zellers et al., 2020;Kryscinski et al., 2020).Major content of this section is borrowed from the course on Ethics in NLP developed at Carnegie Mellon University and the University the Washington by organizer Yulia Tsvetkov.
Can we reduce or mitigate such harms?Finally, in this part, we will focus on work on mitigating harmful effects of language generation systems.While still a nascent field of research, several solutions in this space have been proposed which we categorize into four categories, visualized in Fig. 1.We organize and discuss in detail interven-tion strategies based on where they fit in different stages of LM development: in data collection, modeling, post-factum decoding, and application.Within each of these categories, our taxonomy brings together prior works that have been treated as disjoint areas targeting different types of harms (toxic/biased language and misinformation).
Since LMs learn and amplify biases present in the training data, we will first discuss data level interventions which focus on either (1) filtering the pretraining corpora to create more balanced datasets (Jia et al., 2020), or (2) finetuning trained LMs on sanitized data (Gehman et al., 2020a).Second, we will review model level interventions where we consider approaches which modify either the architecture or training objectives to induce or remove desired biases (Nan et al., 2021;Cao and Wang, 2021).Third, we will present methods to modify model outputs post generation using decoding and editing methods to demote or remove harmful content (Yang and Klein, 2021;Kumar et al., 2021;Cao et al., 2020).These techniques are especially useful for cases where it is impossible to modify data or models or even decoding strategies such as in case of GPT3 (Brown et al., 2020) which are only available through an API.Finally, we will end with application level interventions where we show how methods to flag and redact harmful content allow applications to shield such content from reaching users (Vaidya et al., 2020;Sun et al., 2019).
Throughout the tutorial, we will highlight both detection and mitigation approaches, as well as their specific limitations and shortcomings.By the end of the tutorial, participants will be better informed where to focus future research efforts.
Due to the vast range of societal harms and their mitigation strategies, we do not plan an exhaustive treatment of this material.One central goal is to raise awareness for participants of the relevant issues, so that when they return to their research they will be more able to notice ways in which their research based on large language models might impact different variety of users.To achieve this goal, we will aim for a "T-shape" in terms of breadth and depth: to briefly mention a number of core questions and then to drill down into a few particular case studies to see how these issues play out in real research settings.

Tutorial Structure
We propose a cutting-edge tutorial on an emerging area that has not been previously covered in ACL/EMNLP/NAACL/COLING tutorials.This would be a discussion-style tutorial where the organizers will present material with structured time throughout for questions, and discussion amongst attendees.The duration of the tutorial will be 3 hours with 5 min breaks at the end of each hour.The following would be the outline of the talk: 1. Brief Introduction to Language models (10 mins) -We will provide a quick background on current state of NLP research with introduction to language models and their capabilities.2. Possible Harms of Language Technologies (15 mins) -We will briefly cover examples of ethical concerns, societal harms and biases present in current NLP tools.
• Decoding Techniques -Research on search and sampling algorithms for controllable generation by promoting or demoting specific properties in output text (Zhang et al., 2022;Krishna et al., 2022;King et al., 2022).• Post-Factum Editing -Research to edit or revise generated text to remove harmful content (Pryzant et al., 2020;He et al., 2021;Balachandran et al., 2022). 5. Model Level Interventions (30 mins) -Techniques to modify or optimize model parameters to prevent risky generations.
• Architecture and Training -Research on objectives and model architectures to enforce safe and reliable text generation (Yu et al., 2022;Nan et al., 2021;Falke et al., 2019).• Finetuning and Model Editing -Research on editing or finetuning model parameters to incorporate safety constraints, through with new objectives (Gururangan et al., 2020;Chan et al., 2021;Gehman et al., 2020b;Chronopoulou et al., 2020).6.Data Level Interventions (30 mins) -Techniques to curate clean training data to prevent models from using harmful text.
• Data Filtration -Research on filtering/removing training data instances containing toxic or harmful content (Ngo et al., 2021;Brown et al., 2020).

• Data Augmentation -Research on adding
safer examples to datasets to offset the effect of problematic data (Mathew et al., 2018;Dinan et al., 2020;Stafanovičs et al., 2020).

Open Problems and Future Research (20 mins)
The tutorial will be a series of presentations with a set of references to related research papers and external demos.The presentation will cover a wide array of research on the topics from across the field.We will share the slides with the participants in advance.We will additionally share an online repository of relevant research material and online links to available code and demos to help participants navigate and use relevant research for their work.No copyright issues are expected as we will use open-source material.

Organizers
Sachin Kumar is a sixth year PhD candidate at the Language Technologies Institute, School of Computer Science at CMU. Sachin's research tackles critical technical problems in core language generation with deep learning, such as open-vocabulary generation, detection and demotion of spurious confounders, and controllable generation.
Vidhisha Balachandran (she/her) is a fourth-year Ph.D. student at the Language Technologies Institute, School of Computer Science at CMU.Her current research focuses on building interpretable and reliable NLP models with a focus on summarization, factuality, and KB-based reasoning.Lucille Njoo (she/her) is a second-year PhD student at the Paul G. Allen School of Computer Science and Engineering at the University of Washington.She works in the intersection of NLP, ethics, and computational social science, working on identifying societal harms in NLP models.Antonios Anastasopoulos (he/him) is an Assistant Professor at the Department of Computer Science at George Mason University, USA.His research focuses on NLP for local and low-resource languages and varieties, cross-lingual learning and multilinguality, and cross-lingual fairness.Yulia Tsvetkov (she/her) is an Assistant Professor at the Paul G. Allen School of Computer Science and Engineering at the University of Washington, USA.Her research focuses on computational ethics, multilingual NLP, and machine learning for NLP.She developed a course on Computational Ethics in NLP and is teaching it at both undergraduate and graduate levels since 2017, and she is a co-chair of the ACL Ethics Committee.

Audience and Pre-Requisites
We expect participants from a wide array of backgrounds, including researchers, engineers, and end users of NLP technologies.Based on prior iterations of the tutorial, we expect an audience size of 50-100.No prior experience with NLP/ML is required, but we believe that our tutorial will most benefit those who are currently using NLP or are intending to use NLP tools in the near future in their research/products.An optional list of papers is presented in our survey paper (Kumar et al., 2022).

Diversity
The content of this tutorial highlights the impact of LMs on diverse users and therefore we aim to reach wide and diverse audiences.We will advertise this tutorial to diverse groups of researchers (e.g., Masakane, LatinX, North Africans, disabled in AI, indigenous in AI, Khipu) to bring in participants from various backgrounds.A previous version of this tutorial attracted audience from diverse gender, race as well as professional backgrounds like researchers, beginners and industry practitioners.Accordingly, our content will be made accessi-ble to such audiences.Our own team is also diverse across multiple demographic attributes as well as professional expertise.

Logistics
Previous Editions This is the second iteration of the tutorial.The first edition of the tutorial was presented at The Web Conference 2022.While the previous iteration was focused to a general CS audience with less NLP background, this iteration will be modified to be aligned more for NLP-focused audience.This would entail including deeper technical specification of the interventions, including data, models and objectives.
Our tutorial is related and complementary to prior ACL tutorials related to bias and fairness in NLP (Socially Responsible NLP at NAACL 2018, Bias and Fairness in NLP at EMNLP 2019, Integrating Ethics into the NLP Curriculum at ACL 2020).Complementary to the content of the above tutorials which highlight social harms in NLP and discuss their detection, primarily focusing on representation learning and text classification, our tutorial will focus on practical methods to identify and mitigate harms in large language models and language generation.Venue We prefer EMNLP or ACL, but any venue would work for us.
Technical Requirements We will not require additional equipment other than presentation material: an LCD projector, a computer with PowerPoint and Acrobat Reader, and internet connection.
Public Release We will publicly release all tutorial materials, including prerecorded lectures as backup for the tutorial which will be uploaded prior to the tutorial.These will be hosted on an openaccess platform and linked from our University websites.

Ethics Statement
Although the aim of this tutorial is to improve the safety and inclusivity of NLP technologies and equip practitioners with tools to do so, we are well aware that as a not perfectly-diverse group of researchers we might incorporate our own biases into tutorial stricture and its technical focus.We will acknowledge this limitation in our tutorial, as well as the fact that the field of computational ethics is developing rapidly, and thus the content of our tutorial is inherently incomplete.

Figure 1 :
Figure 1: Overview of Intervention Strategies.Our survey presents a taxonomy of intervention strategies organized around the different phases where they can be applied.