Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models

Considerable effort has been dedicated to mitigating toxicity, but existing methods often require drastic modifications to model parameters or the use of computationally intensive auxiliary models. Furthermore, previous approaches have often neglected the crucial factor of language's evolving nature over time. In this work, we present a comprehensive perspective on toxicity mitigation that takes into account its changing nature. We introduce Goodtriever, a flexible methodology that matches the current state-of-the-art toxicity mitigation while achieving 43% relative latency reduction during inference and being more computationally efficient. By incorporating a retrieval-based approach at decoding time, Goodtriever enables toxicity-controlled text generation. Our research advocates for an increased focus on adaptable mitigation techniques, which better reflect the data drift models face when deployed in the wild. Code and data are available at https://github.com/for-ai/goodtriever.


Introduction
Large-scale pretrained language models (LMs) have demonstrated remarkable progress in capabilities (Radford et al., 2019;Brown et al., 2020).However, an unintended consequence of this progress is the generation of toxic and harmful language, including hate speech, insults, profanities, and threats (Gehman et al., 2020;Bender et al., 2021).With the widespread adoption of large language model systems such as ChatGPT (OpenAI, 2022;Liu et al., 2023) and OpenAssistant (Köpf et al., 2023), there is a need for techniques that can effectively mitigate the generation of toxic and harmful text (Rae et al., 2021;Deshpande et al., 2023).To address this challenge, it is essential not only to measure and understand the origins of toxic text generation but also to take effective steps towards its mitigation in LMs.
Prior research on detoxification has primarily focused on two computationally expensive approaches: finetuning or constrained decoding (Zhang et al., 2022a).Finetuning requires modifications to a pretrained LM parameters through additional training on carefully curated data.On the other hand, constrained decoding relies on an auxiliary model or processing module that modifies the next-token probabilities at inference time.Both of these approaches are known to be highly compute-intensive (Zhang et al., 2022a).
In addition to the drawbacks of the aforementioned techniques, the academic treatment of toxic language mitigation has predominantly assumed that toxicity remains static over time.Most of the existing research has focused on building specialized models for specific domains or locales, which lack flexibility once trained and may have limited applicability across different tasks and domains (Wang et al., 2022;Gururangan et al., 2020).However, human language is shaped by a cumulative culture, constantly building upon itself and evolving over time (Silvey, 2016).Similarly, the ways in which language can cause harm, such as offensive and harassing text (Gehman et al., 2020), also evolve (Lopez-Zafra and Garcia-Retamero, 2021;Charlesworth and Banaji, 2022).
In this work, we propose a flexible technique called GOODTRIEVER (Figure 1) that effectively tackles both static and lifelong toxicity mitigation.Our approach is designed to handle domain shifts and builds upon recent advancements in language modeling, which have successfully incorporated external memory to enhance performance (Khandelwal et al., 2019;Lewis et al., 2020;Guu et al., 2020;Borgeaud et al., 2022;Izacard et al., 2022).More specifically, GOODTRIEVER combines a large LM with two external datastores.These datastores control text generation based on desirable (non-toxic) and undesirable (toxic) attributes.This property allows for convenient and immediate incorporation of new knowledge, as well as the ability to edit, correct and remove existing information without requiring any retraining of the LM.
We conduct extensive experiments for static and continual toxicity mitigation, showing that GOODTRIEVER achieves comparable performance to state-of-the-art methods while being far less compute-intensive on the static tasks.We also show that GOODTRIEVER achieves competitive results to the method of multitask finetuning on continual toxicity mitigation tasks.Our key contributions are: • We introduce a flexible method that enables the integration of multiple retrieval mechanisms into LMs.This approach matches stateof-the-art toxicity mitigation scores while reducing inference time by 43% and minimizing computational requirements, particularly in terms of parameters.
• We evaluate the efficacy of GOODTRIEVER across different model sizes and families, namely GPT2 (Radford et al., 2019), Pythia (Biderman et al., 2023), andOPT (Zhang et al., 2022b).By varying the base model sizes from 124M to 6.9B parameters, we show that GOODTRIEVER remains efficient in mitigating toxicity even as the model size increases.
• We explore the task of continual toxicity mitigation.Through our experiments, GOODTRIEVER achieves competitive performance compared to the expected baseline upper bound, which is a model finetuned on all available data.Our method demonstrates its flexible controllability capabilities by promptly mitigating toxicity for each newly added domain.

Controlled Text Generation with Retrieval-Augmented Models
Language models (LMs) define probability distributions over sequence of tokens.Given a context sequence of tokens c t = (w 1 , . . ., w t−1 ), the probability distribution p(w t |c t ) over the target token w t is estimated using next-token prediction where the probabilities are modeled as the product of conditional probabilities for each token in the sequence, given the tokens that came before it from the autoregressive LMs.These models are typically implemented using a transformer network, parameterized by a set of parameters θ: where c t is the context sequence of tokens preceding w t , also referred to as its prefix.
Retrieval-augmented LMs compute next token distributions based not only on the immediately preceding context c t and the model parameters θ, but also on an external datastore C, from which examples are retrieved and incorporated into the base LM's prediction.Specifically, for predicting w t , the retrieval operation from C depends on its prefix: 2.1 GOODTRIEVER Formalization GOODTRIEVER, illustrated in Figure 1, is an inference-time method for controlled text generation.In addition to the standard, parametric, nextword prediction, GOODTRIEVER accesses information retrieved from a pair of datastores that contains toxic and non-toxic samples to model text with undesirable and desirable attributes respectively.In the following, we will detail the components of our method.
We define the function f (•), which takes a context c as input and produces a fixed-length vector representation.As an example, in a Transformer model, f (c) can be defined to map the context c to an intermediate representation obtained from a self-attention layer within the model.For the i th example (c i , w i ) ∈ D, the key-value pair (k i , v i ) is formed, where k i denotes the vector representation of the context f (c i ) and v i denotes the value associated with the target word w i .GOODTRIEVER creates two datastores: (K − , V − ) from toxic examples and (K + , V + ) from non-toxic examples.
Inference.During inference, the parameteric component of the LM generates the output distribution p LM (w t |c t ; θ) over the next tokens, produces the corresponding context representation f (c t ), given the text input context c t and the logits z t ∈ R |V| , where V is the model's vocabulary.Then the nonparametric component of the LM queries each datastore (K, V) with the f (c t ) representation to retrieve N , the k-nearest neighbors (k-NN) according to Euclidean distance function d(•, •).Next, the token probabilities p kN N are computed over these neighbors by applying a softmax with temperature T to the neighbors' negative distances and aggregating over each token of the vocabulary, as in the following: A temperature higher than 1 tends to flatten the distribution and prevents overfitting (Khandelwal et al., 2020).More details about how the temperature parameter impacts GOODTRIEVER performance are in Appendix C.3.For each context c t , we obtain three sets of probability distributions: the next token distributions i) from the base language model p LM , ii) from the toxic datastore p − kN N and iii) from the non-toxic datastore p + kN N respectively and their corresponding logits z t , z − t , z + t .Ensembling.kNN-LM interpolates the nearest neighbor distribution p kN N with the base LM distribution p LM using a tuned parameter to produce the final next-token distribution.kNN-LM only allows to augment the model with a single datastore.Here we introduce a method that allows us to combine multiple nearest neighbor distributions computed based on different datastores with the base LM probability distribution.Our method is based on product of experts which is first proposed by Hinton (2002).That idea allows us to combine toxic and non-toxic datastore outputs with base LM as: where α is the tuned parameter that controls the impact of the datastores over the base model.Equation 5 corresponds to the following: This equation indicates that a token possesses a high probability if it satisfies the condition of having high probabilities under both p LM and p + kN N , while simultaneously having a low probability under p − kN N .With this equation, we gain the flexibility to incorporate multiple datastores with the LM, allowing us to combine their logits through addition or subtraction.(Liu et al., 2021;Fan et al., 2018;Holtzman et al., 2019), we truncate the logits z prior to ensembling with the toxic and non-toxic datastores using nucleoussampling (Holtzman et al., 2019).This process effectively eliminates the unreliable tail of the distribution, leading to enhanced fluency in the generated content.
Baselines.We compare GOODTRIEVER to different toxicity mitigation techniques: DEXPERTS (Liu et al., 2021), GeDi (Krause et al., 2020), PPLM (Dathathri et al., 2019), DAPT (Gururangan et al., 2020) and UDDIA (Yang et al., 2022).In Appendix B.1, we include a brief overview of each technique.In addition to these techniques, we also report results for the toxic-only variation of GOODTRIEVER.In this case, the non-toxic logits are replaced by the base LM logits in Equation 5.

Evaluation
To evaluate the toxicity degeneration and capabilities of mitigation of different techniques, we adopt the protocol outlined by Gehman et al. ( 2020) and use the samples selected by Liu et al. (2021), a random selection of 10K non-toxic prompts from the REALTOXICITYPROMPTS (RTP) dataset.For each prompt, the models generate 25 continuations of 20 tokens.We evaluate models for three sets of metrics: toxicity, fluency, and diversity which we briefly introduce below.
Toxicity.Following the methodology proposed by Gehman et al. (2020), we measure toxicity using two metrics.Expected Maximum Toxicity (EMT) is the maximum toxicity over k model generations for a given prompt.A higher EMT indicates a greater expected toxicity in the worst-case scenario.The Toxicity Probability is the empirical probability of generating a span with TOXICITY > 0.5 at least once among the k generations.This metric captures the frequency of toxicity generation by the model.It is important to note that toxicity scores from the Perspective API3 tend to change over time and become lower (Pozzobon et al., 2023).This poses challenges in making direct comparisons.To ensure fair comparisons between techniques, we adhere to the protocol recommended by Pozzobon et al. (2023) and rescore all previously generated model continuations using the same version of the Perspective API.
Fluency.Generation fluency is the mean perplexity of generated continuations.In line with best practices from prior work (Liu et al., 2021;Yang et al., 2022), we score perplexity using a larger pretrained LM from the same family as our primary base model, GPT2-XL.Lower perplexity is generally preferable, however if lower perplexity is accompanied by reduced diversity, it signifies repetitive output, which is undesirable.Ideally, the post-toxicity mitigation technique should exhibit comparable perplexity levels to the base model.
Diversity.Generation diversity is measured by the number of distinct n-grams in generated responses scaled by the number of generated tokens (Li et al., 2015).We report diversity results for unigrams, bigrams, and 3-grams (dist-1, dist-2, and dist-3, where 'dist' denotes 'distinct').A higher diversity score indicates a greater variety of unique n-grams generated by the model and is desirable as it signifies a broader range of possible continuations for each prompt.

Results
Table 1 presents the results of GOODTRIEVER when compared to the baselines.GOODTRIEVER is competitive with previous state-of-the-art (SOTA) methods and even outperforms the SOTA EMT for GOODTRIEVER (small) at a cost of slightly higher perplexity.Qualitative examples of generated continuations using GOODTRIEVER versus the base model are available in the Appendix E.
In Table 2, we show that our method significantly reduces latency and computational costs compared to the previous SOTA method, DEXPERTS.In terms of inference time, GOODTRIEVER (large) achieves a 43% reduction compared to DEXPERTS, while consuming three times fewer parameters.
Table 1: Generations from DAPT, GeDi, PPLM, and UDDIA were rescored with Perspective API to obtain up-todate toxicity metrics (Pozzobon et al., 2023).DEXPERTS was entirely re-run in our code.Perplexity is computed for a sample of 1000 prompts.We also conducted ablation studies to investigate the impact of 1) datastore size, 2) number of neighbors, 3) temperature parameters, and 4) automatic labeling of the datastore samples.We briefly summarize the findings below, with full treatment in Appendix C. Datastore size.Our observations indicate that toxicity mitigation occurs even with small amounts of data in both the toxic and non-toxic datastores.GPT2's raw EMT value is 0.39, as shown in Table 1.Remarkably, for all combinations of GOODTRIEVER sizes in Figure 2, the maximum EMT is 0.26, a highly competitive performance compared to the baselines presented in Table 1.
The size of the toxic datastore appears to directly impact the diversity of the generated output.When the datastore is too small (< 500K tokens), the diversity metrics fall below an acceptable threshold, only marginally matching the scores of the base model.Regarding fluency, both datastores exhibit a clear trend: as the amount of toxic data increases and the amount of non-toxic data decreases, perplexity values rise.
Number of retrieved k neighbors.Figure 6 (in Appendix C.2) shows the impact of k neighbors retrieved for each datastore.Two types of experiments are performed: 1) varying number of neighbors for one datastore while keeping the other fixed at the maximum value of 1024, and 2) varying number of neighbors for both datastores.
Increasing the number of neighbors contributes to a decrease in toxicity across all settings.In scenario (1), retrieving more neighbors from the non-toxic datastore leads to a significant reduction in perplexity and diversity.For instance, when retrieving a single non-toxic neighbor and 1024 toxic neighbors, the perplexity is around 2000.However, when retrieving 1024 tokens from each datastore, the perplexity decreases to approximately 30.Similarly, the diversity metric improves from 0.2 to nearly 0.6 for the same numbers of retrieved neighbors.Conversely, when varying only the number of retrieved neighbors for the toxic datastore, perplexity increases while diversity also rises.These findings align with the observations presented in previous section, highlighting the significant influence of the toxic datastore on diversity metrics.Alpha vs. Temperature parameters.Figure 7 (in Appendix C.3) shows the impacts of α and kNN softmax temperature T .In our framework, α determines the weighting of the next token probabilities sourced from the datastores.T is the softmax temperature to build the probability distributions from the datastores, with higher values flattening the distribution and preventing overfitting (Khandelwal et al., 2020).As depicted in Figure 7, increasing the value of α leads to a trade-off between toxicity mitigation and perplexity for all evaluated temperatures.Conversely, larger values of T allow for more aggressive utilization of the probabilities from the datastores (with larger α values), as increasing T decreases perplexity while maintaining diversity close to the baseline.3 we show how GOODTRIEVER performs across GPT2, Pythia (Biderman et al., 2023) and OPT (Zhang et al., 2022b) model families.This allows us to understand generalization across model families and quantify how retrievalaugmented toxicity mitigation scales with model size.Applying GOODTRIEVER to the OPT family required some tuning of parameters for satisfactory results.Results are shown for α = 0.5 and T = 500.For Pythia, T = 500 was used.

Different Model Sizes and Families. In
We observe consistent mitigation performance across all variants GOODTRIEVER in terms of model size and family.The EMT is reduced by a maximum relative value of 48% in GPT2-small (from 0.39 to 0.20) and a minimum of 24% in OPT 1.3B (from 0.45 to 0.34).We don't see a clear trend between mitigation performance and model sizes.The OPT 6.7B model shows a higher relative reduction in toxicity than its 1.3B version, while the Pythia 1B has a higher relative reduction compared to its 6.9B version.It is noteworthy that models within the same family show similar base toxicity, a finding that is in line with previous work (Rae et al., 2021).
Automatic Labeling the Datastores.We performed additional experiments to demonstrate the robustness of GOODTRIEVER by substantially reducing the size of the datastores and automatically annotating them.We perform such experiments with two datasets as datastores: Jigsaw, our main dataset, and a subset of REALTOXICITYPROMPTS (RTP) not used for evaluation.Base models are kept the same, and so are generation parameters described in Appendix B.4.
In Table 4 we show results of GOODTRIEVER with substantially smaller automatically annotated datastores by Perspective API.We also show results of human-annotated datastores for a smaller-scale Jigsaw datastore.Respectively for toxic and nontoxic datastores, reported experiments have about 16x and 40x smaller datastores than the results shown in Table 1.
Surprisingly, at this data-constraint regime, both variants of automatically-labeled GOODTRIEVER datastores (Jigsaw and RTP) achieve lower toxicity metrics than the variant with a full-sized humanannotated Jigsaw from Table 1.Most likely due to smaller toxic datastores (i.e. Figure 2), diversity is slightly lower for all new variants.It is also remarkable how GOODTRIEVER with the randomly subsampled human-annotated Jigsaw performs on par with its much larger version from Table 1.

Continual Toxicity Mitigation
Work to date has often treated toxicity as a fixed characteristic, disregarding its variations over time and among different demographic groups (Goldfarb-Tarrant et al., 2023).One of the key advantages of GOODTRIEVER lies in its adaptability, facilitated by semi-parametric language models (Khandelwal et al., 2019;Izacard et al., 2022).We demonstrate the benefits of this flexible representation of toxicity by benchmarking GOODTRIEVER on the task of continual toxicity mitigation.The goal of this task is to continuously adapt to new types of toxicity while maintaining effective mitigation for previously encountered domains.

Experimental Setting
Data.To properly evaluate the task of continual toxicity mitigation, we introduce a controlled toxic dataset consisting of five well-defined domains, each associated with a specific demographic group.Our dataset is derived from CivilComments-  WILDS (Koh et al., 2021), which is a subset of the previously mentioned Jigsaw dataset, annotated with demographic information.Appendix D provides further details on the data processing steps and the topics covered in each domain (see Table 7).Throughout our experiments, we keep the non-toxic datastore fixed at a size of 50K sentences, focusing on examining shifts in toxicity.
Continual Learning Baselines.We compare the continual mitigation capabilities of GOODTRIEVER with two CL baselines based on DEXPERTS (Liu et al., 2021): 1) multitask and 2) continual finetuning.In the multitask setting, the experts are finetuned from scratch using all current and previous domain data at each step.This baseline aims to achieve the upper-bound performance among benchmarked CL techniques since it can directly optimize for all available data.In the continual finetuning setting, the experts have access to data from the current domain, while reusing the experts from previous steps for finetuning.This protocol closely resembles GOODTRIEVER's access to data, but it is expected to be a lower-bound due to catas-trophic forgetting (CF) (Goodfellow et al., 2013).
For both of these models, the non-toxic expert was trained (and kept fixed) with the same samples from GOODTRIEVER's non-toxic datastore.
Evaluation.To preserve the demographic context of our domains, we refrain from employing prompt/completion separation as done in the RTP dataset (Gehman et al., 2020).Instead, we take a set of 200 toxic sequences from each domain in the processed dataset, which serve as our prompts for evaluating the models.We report the same metrics as described in section 2.

Results
Continual mitigation results are shown in Figures 5  and 8 as well as in Table 8 in Appendix D. As expected, the continually finetuned DEXPERTS model performs the worst in the task due to CF.Its mitigation capabilities are not improved as new domains are incorporated for finetuning.We observe that GOODTRIEVER results are competitive to the multitask finetune baseline, which has the advantage of optimizing directly for all previous and current domains.These results are particularly significant as GOODTRIEVER does not require finetuning on all prior datasets, which can be costly and time-consuming at scale.We also note that these results were achieved without exploring specialized adaptation techniques that have been employed by other retrieval methods, such as specialized sample selection for the datastores or online adaptation of the interpolation parameter (Peng et al., 2023;Huang et al., 2023;Bhardwaj et al., 2022) (Welbl et al., 2021;Sheng et al., 2019), updating a pretrained model's hidden representations based on the gradient of a classifier with respect to the desired class (Dathathri et al., 2019), or directly adjusting the distribution using signals from a toxicity classifier (Krause et al., 2020).A notable approach in this category is DEXPERTS (Liu et al., 2021), which studies controllable text generation by combining a trained expert model trained on non-toxic data and a trained anti-expert model trained on toxic data using the Product of Experts (PoE) (Hinton, 2002).Similar to DExperts, (Hallinan et al., 2022) presented a text detoxification algorithm that combines an expert and an anti-expert with an LM using PoE.In contrast to DExperts, we do not leverage auxiliary models but rather utilizing the retrieval-augmented techniques.This avoids exploding parameter count and minimizes latency while preserving performance.Our technique also avoids directly adjusting the output distribution using signals from a toxicity classifier as done by Krause et al. (2020), which can impact fluency.
It augments an LM with one external memory or datastore that is consulted to modify the next-token probabilities.To our knowledge, we are the first to apply a retrieval-augmented approach to toxicity mitigation.In contrast to a standard kNN-LM, we augment multiple datastores with base LM by using an entirely different interpolation technique that utilizes PoE and mitigate toxicity with the aid of two datastores: one with toxic and another with non-toxic examples.
Continual Learning (CL) in LMs remains relatively unexplored, with only a limited number of works focusing on adapting language models to emerging corpora across various domains and timelines (Gururangan et al., 2020;Jang et al., 2021;Jin et al., 2021).There has been some work on toxicity classification that explores possible variations in terms of in-text demographic citations (Borkan et al., 2019) or the onset of new hate ideologies over time (Qian et al., 2021).Borkan et al. (2019) introduce a human-labeled dataset with 450K samples with demographic identity citations, later adapted to be the CivilComments-WILDS dataset (Koh et al., 2021).Qian et al. (2021) investigates lifelong hate-group classification applied to tweets and show how the major hate-speech topics change over time (Qian et al., 2021).To the best of our knowledge, our research is the first to tackle lifelong toxicity mitigation within the context of CL.

Conclusion
We present GOODTRIEVER, a novel method for toxicity mitigation, which utilizes multiple retrieval mechanisms to effectively adapt to the changing nature of language and toxicity without compromising linguistic quality.GOODTRIEVER achieves 43% decrease in inference time when compared to previous state-of-the-art while maintaining a comparable toxicity mitigation performance.We also show how GOODTRIEVER mitigates toxicity consistently across model sizes and families.Unlike prior approaches, GOODTRIEVER remains flexible and competitive in the face of evolving data.

Limitations
In this work, as in prior works we use the toxicity definitions from Perspective API for our datastore and evaluation.We understand the definition of what is toxic is extremely subjective and that there's no perfect answer for how toxic a given sentence is.We also don't evaluate if our mitigation technique amplifies biases against marginalized groups, as investigated in previous work (Xu et al., 2021;Welbl et al., 2021).On the technical aspects, other limitations are: (1) supporting only HuggingFace models implemented in the PyTorch framework; (2) our evaluation of the impact of GOODTRIEVER is limited to the metrics we propose and qualitative inspection (as visualized in Appendix E).Finally, as real-world applications continue to become increasingly multilingual and multicultural, it becomes crucial to develop toxicity mitigation strategies that can effectively address toxicity in cross-lingual systems.We acknowledge the need for such an approach and leave it as an area for future application of GOODTRIEVER.

Ethics Statement
Our research investigates the usage of retrieval models during decoding time of text generation to suppress toxic language and enhance the harmlessness of generated content.It is important to note that while our method for toxic language suppression significantly reduces the probability of generating toxic language, it does not entirely eliminate it.While extensive experimentation has demonstrated a significant decrease in model toxicity, we advise careful consideration when applying our method in real-world applications.
We are fully aware that the datasets used in our research, as well as some of the sample generated content may potentially include offensive or objectionable material.We acknowledge that exposure to such datasets and generated content could potentially be unpleasant or uncomfortable for the readers.However, we employ these datasets and generations to better understand, examine, and mitigate the harmful effects of toxic language generation in language models.
While our method is primarily developed to mitigate toxicity in LMs, we acknowledge the potential of its misuse to generate harmful texts by altering the usage of datastores, such as designating toxic attributes as desirable and non-toxic attributes as non-desirable.It is crucial to emphasize that our research is driven by the goal of promoting responsible and ethical use of language models, with a focus on ensuring the generation of safer content.We firmly discourage any attempts to exploit our method for malicious purposes, as it directly contradicts our ethical principles.Zihan Zhang, Meng Fang, Ling Chen, and Mohammad-Reza Namazi-Rad.2022c.Is neural topic modelling better than clustering?an empirical study on clustering with contextual embeddings for topics.arXiv preprint arXiv:2204.09874.

A Extended Related Work
Retrieval-Augmented LMs.These methods involve the retrieval of documents from a textual knowledge corpus, which are subsequently utilized to perform various language tasks.The integration of retrieval components with LMs has gained significant attention in recent studies, particularly in the field of language modeling (Min et al., 2022;Borgeaud et al., 2022) and question answering (Lewis et al., 2020;Izacard and Grave, 2020;Izacard et al., 2022;Guu et al., 2020).In addition to these explicit text retrieval methods, there is a category of models known as semi-parametric language models (Sukhbaatar et al., 2019;Wu et al., 2022) that employ memory to store text as key-value pairs.One prominent example is the kNN-LM (Khandelwal et al., 2019), that extends a pretrained LM by linearly interpolating its next word distribution with a k-nearest neighbors (kNN) model.The kNN-LM utilizes non-parametric external memory to store previously encountered text examples.During testing, this memory is leveraged to enhance the predictions of the parametric LM, eliminating the need for training or retraining.The simplicity and effectiveness of the kNN-LM has prompted the development of several methods for investigating semiparametric LMs (Khandelwal et al., 2020;Jiang et al., 2021;Meng et al., 2021;Yogatama et al., 2021;Das et al., 2022;Drozdov et al., 2022;Zhong et al., 2022;Peng et al., 2023).In this work, we develop a semi-parametric model based on kNN-LM that effectively mitigates toxicity during text generation tasks.
Continual Learning (CL) in LM.There are a few more studies in addition to the ones given in Section 5. LAMOL (Sun et al., 2019) has introduced a method that simultaneously learns tasks and generates training samples, enabling the model to replay pseudo-samples from previous tasks without requiring additional memory or model capacity and ELLE (Qin et al., 2022) has employed a combination of replay-based and parameter isolation based methods for continual pre-training.However, to the best of our knowledge, our research is the first to tackle lifelong toxicity mitigation within the context of CL.

B.1 Baselines of comparison
We leverage open-sourced continuations (Liu et al., 2021;Yang et al., 2022) for all models except DEX-PERTS.To ensure comparability, we rescore the toxicity scores, making certain that they adhere to the same version of the Perspective API (Pozzobon et al., 2023).
DAPT finetunes an LM for additional steps on domain-specific data.The base language model, GPT2-large, is fine-tuned on the non-toxic subset of the OpenWebText corpus, as specified by Liu et al. (2021).
GeDi uses class-conditional language models (CC-LM) to steer a larger LMs' next-token probabilities with Bayes rule to favor a given controlled attribute (Krause et al., 2020).The authors used GPT2-XL as a base model and GPT2-medium as the CC-LM fine-tuned on the Jigsaw dataset for detoxification.
PPLM updates the base language model's hidden activations using a toxicity classifier finetuned on the Jigsaw dataset (Dathathri et al., 2019).Due to high computational cost, PPLM is evaluated on a random subset of 1K non-toxic prompts.
UDDIA removes dependencies between a protected attribute, that in our case is toxicity, and text produced by LMs by rectifying the probability space.For toxicity mitigation, they leverage PPLM's classifier (Dathathri et al., 2019) and a novel redo mechanism that determines which layers need to have hidden activations modified (Yang et al., 2022).
DEXPERTS (Liu et al., 2021) addresses controllable text generation by combining an expert model trained on non-toxic data, and an anti-expert model trained on toxic data.In the original codebase, we were able to achieve a slightly lower EMT score of 0.19 instead of 0.21 as obtained by our codebase, but the inference time was more than 5 times higher.
The average inference time for each continuation of 20 tokens was of 0.19 seconds in the original code versus 0.033 in our implementation.We believe the differences come from the main libraries' versioning differences, particularly the transformers library.As we prioritized a fair comparison in terms of inference time, we show the results of our implementation of DEXPERTS.

B.2 Pretrained Language Models
All pretrained language models are available at the HuggingFace transformers library (Wolf et al., 2019)

B.3 Dataset Details
The details of toxic and non-toxic datastore datasets, which are processed versions of the Jigsaw Unintended Bias dataset, are provided in Table 5.The numbers of tokens are reported for experiments based on the GPT2 family of models.We compare toxicity metrics for multiple model sizes and families.All results from sections 3.3 and 3.3 were performed for the 10K non-toxic prompts from REALTOXICITYPROMPTS selected previously by (Liu et al., 2021).For inference, we used exclusively A100 40GB GPUs.
In Table 6, we present the parameters used for GOODTRIEVER-based models across all sizes and families.Additionally, we provide the nucleoussampling (Holtzman et al., 2019), also referred to as top-p sampling value.Top-p is a technique employed in language generation, selecting the next word or token in a sequence based on a restricted subset known as the nucleus, consisting of the most probable candidates.Typically, top-p is set to a high value (e.g., 0.9) to limit the long tail of lowprobability tokens that may be sampled.

C.1 Datastore size
To understand the impact of datastore size on the metrics, we modify the experimental protocol from section 3 so that evaluation is performed on a selection of 100 non-toxic prompts.Results are seen in Figure 2, which conveys the trade-offs of the metrics under these settings.
Fluency exhibits a clear trend for both datastores: as the toxic data increases and the non-toxic data decreases, perplexity values rise.When we have larger toxic datastores, increasing the non-toxic datastore decreases perplexity.
The size of the toxic datastore appears to directly influence the diversity of generated content.When the toxic datastore is too small (< 500K tokens), the diversity metrics fall below the acceptable rate that is of marginally matching the base model scores.In this case, as we're controlling the generation for toxicity, the small toxic datastores may hold the model hostage to repeating a small selection of safe sentences for each prompt, although this is a counterintuitive and unexplored hypothesis in our work.As models become more repetitive (less  1.Although experimental settings are not strictly comparable as the datastore size experiments use a sample of 100 nontoxic prompts instead of the total 10K.Additionally, the EMT results do not vary monotonically as we increase or decrease datastores' size.Interestingly, the best EMT of 0.19 is achieved with a selection of 1M and 10M toxic and non-toxic tokens, respectively, which represents approximately 10% and 25% of the full-sized datastores.This raises the question: how can we select toxic and non-toxic samples to add to the datastores to observe a monotonic decline in toxicity?

C.2 Number of retrieved k neighbors
As we discussed in Section 3.3, the impact of the number of neighbors (k) retrieved for each datastore is illustrated in Figure 6.Building on the discussion in Section 3.3, we observe that maintaining an equal number of retrieved neighbors from both datastores (scenario (2) or 'both' in the plots) results in better control over perplexity and diversity compared to scenario (1).However, toxicity levels decrease with a higher number of retrieved neighbors.This suggests that by retrieving an equal number of neighbors from each datastore, we can more effectively mitigate toxicity while preserving the desired perplexity and diversity of generated content.

C.3 Alpha vs. Temperature parameters
As we discussed in Section 3.3, Figure 7 shows the impacts of α and kNN softmax temperature T .The figure demonstrates that increasing the value of α leads to a trade-off between toxicity mitigation and perplexity for all evaluated temperatures.Conversely, larger values of T allow for more aggressive utilization of the probabilities from the datastores (with larger α values), as increasing T decreases perplexity while maintaining diversity close to the baseline.
Based on these experiments, we use T = 100 and α = 2.0 for all GOODTRIEVER runs, except for GOODTRIEVER with toxic datastore only and for OPT family results.Respectively, we use T = 25 and α = 1.5, and T = 500 and α = 0.5.As mentioned in Section 4.1, we utilize the CivilComments-WILDS dataset (Koh et al., 2021) as the initial data for our continual learning experiments.The dataset undergoes preprocessing steps: 1) merging the original train and validation splits, 2) filtering out comments with more than a single demographic citation, and 3) retaining only toxic comments.Following this, we extract sentence embeddings using SimCSE (Gao et al., 2021), reduce their dimensionality with UMAP, and perform clustering using k-means.This approach, similar to the one adopted by Zhang et al. (2022c), demonstrates how directly clustering high-quality sentence embeddings can lead to coherent and diverse topics.Visual inspection in UMAP's 2D space allows us to select five well-defined clusters.Table 7 presents the number of samples and top words from each domain, extracted as described by Zhang et al. (2022c).Figure 8 displays the domainspecific results as each domain is added, providing further insights into the experiment's outcomes.

D Continual Learning Experiments
Static Baselines.We benchmark the overall performances of CL approaches relative to the off-theshelf GPT2 model, as well as the GOODTRIEVER and DEXPERTS models from section 2. GPT2 is our main lower bound, while both GOODTRIEVER and DEXPERTS trained on Jigsaw are expected up-per bounds as they have much more in-domain data (more than 2M comments) than is available in our controlled CL experiments trained with a fraction of the WILDS data (up to 2.4K samples for each domain, WILDS is a subset of Jigsaw).In Table 8, you can find the domain-specific results for the baselines and the mitigation techniques that we are benchmarking.

E Continuation Examples
Table 9 and Table 10 present prompt completion examples along with their toxicity scores for the evaluated models.We also show the prompt and its original continuation scores.Prompts were selected based on high toxicity scores from the offthe-shelf GPT2-large model.The tables showcase three completions for each prompt from the evaluated models.

Figure 1 :
Figure 1: An illustration of GOODTRIEVER.The toxic and non-toxic datastores are built with toxic and non-toxic examples respectively.For a given test context, we (1) embed and search for the k most similar contexts in each datastore and (2) ensemble the next token probabilities from the LM with the datastores' probabilities.

Figure 3 :
Figure3: Relative difference of metrics between GOODTRIEVER and their base models.Relative EMT (↓) reduction is achieved for all GOODTRIEVER variants compared to their base model.

Figure 4 :
Figure 4: Absolute EMT (↓) for GOODTRIEVER models and their base models.GOODTRIEVER consistently reduces EMT for different model sizes and families.
4 and Table

Figure 5 :
Figure 5: Overall EMT for each continual learning technique benchmarked.GOODTRIEVER has competitive performance with the multitask fine-tune technique.

Figure 6 :
Figure 6: Impact of varying the number of retrieved neighbors from each datastore on GOODTRIEVER (GPT2 Large) metrics.

Table 2 :
Inference time corresponds to the time to generate a single continuation of 20 tokens on an A100 GPU.We report mean values over three runs of 100 prompts with 25 continuations per prompt.We compare GOODTRIEVER inference time with DEXPERTS, the previous SOTA for mitigation and inference time trade-offs.The base model is GPT2-large for both GOODTRIEVER and DEXPERTS.

Table 3 :
Toxicity mitigation results for different model families and sizes, sizes are ranging from 124M to 6.9B.We show how GOODTRIEVER has consistent mitigation performance even with larger models.The highest absolute decrease in EMT is of 0.19, while the minimum is of 0.11.

Table 4 :
GOODTRIEVER (Large) results when coupled with human or automatically annotated datastores.With 16x and 40x less toxic and non-toxic tokens in the datastores, respectively, automatically labeled datastores lead to better mitigation results than the human-annotated datastores from Table1.
. Our code currently supports Causal Language Models from this library implemented in the PyTorch framework.The kNN retrieval of GOODTRIEVER is built upon the open-sourced code by Alon et al. (2022) 4 .

Table 5 :
Dataset details for GOODTRIEVER GPT2 based models experiments.

Table 7 :
Topics and number of samples from each domain of the continual mitigation experiments.