Text Detoxification using Large Pre-trained Neural Models

We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT to replace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.


Introduction
Identification of toxicity in user texts is an active area of research (Zampieri et al., 2020;D'Sa et al., 2020;Han and Tsvetkov, 2020). The task of automatic rewriting of offensive content attracted less attention, yet it may find various useful applications such as making online world a better place by suggesting to a user posting a more neutral version of an emotional comment. The existing works on text detoxification (dos Santos et al., 2018;Tran et al., 2020;Laugier et al., 2021) cast this task as style transfer. The style transfer task is generally understood as rewriting of text with the same content and with altering of one or several attributes which constitute the "style", such as authorship (Voigt et al., 2018), sentiment (Shen et al., 2017), or degree of politeness (Madaan et al., 2020). Despite the goal of preserving the content, in many cases changing the style attributes changes the meaning of a sen-tence significantly. 1 So in fact the goal of many style transfer models is to transform a sentence into a somewhat similar sentence of a different style on the same topic. 2 We suggest that detoxification needs better preservation of the original meaning than many other style transfer tasks, such as sentiment transfer, so it should be performed differently.
We present two models for text detoxification, which have extra control for content preservation. The first model, ParaGeDi, is capable of fully regenerating the input. It is based on two ideas: external control of an output of a generation model by a class-conditioned LM (Krause et al., 2020) and formulation of style transfer task as paraphrasing (Krishna et al., 2020). Being based on a paraphraser model, ParaGeDi explicitly aims at preserving the meaning of the original sentence. The second approach, CondBERT, inspired by Wu et al. (2019a), follows the pointwise editing setup. It uses BERT to replace toxic spans found in the sentence with their non-toxic alternatives. The semantic similarity is maintained by showing the original text to BERT and reranking its hypotheses based on the similarity between the original words and their substitutes. Interestingly, BERT does not need any class-conditional pre-training to successfully change the text style from toxic to normal.
In addition, we perform a large-scale evaluation of style transfer models on detoxification task, comparing our new models with baselines and state-ofthe-art approaches. We release our code and data. 3 Our contributions are as follows: • We propose two novel detoxification methods based on pre-trained neural language models: ParaGeDi (paraphrasing GeDi) and Cond-BERT (conditional BERT).

7980
• We conduct an evaluation of these models and their comparison with a number of state-of-theart models for text detoxification and sentiment transfer and release the detoxification dataset. • We create an English parallel corpus for the detoxification task by retrieving toxic/safe sentence pairs from the ParaNMT dataset (Wieting and Gimpel, 2018). We show that it can further improve our best-performing models.

Related Work
One of the most straightforward ways of solving style transfer task is to "translate" a source sentence into the target style using a supervised encoderdecoder model (Rao and Tetreault, 2018). Since the source and the target are in the same language, pretrained LMs such as GPT-2 (Radford et al., 2019) can be applied for this task -fine-tuning them on relatively small parallel corpora gives a good result (Wang et al., 2019). However, this method is used quite rarely because of the lack of sufficiently large parallel data. The rest of described models are trained in an unsupervised way.
Pointwise Editing Models A relatively easy yet efficient style transfer method is to leave the sentence intact and manipulate only individual words associated with the style. Delete-Retrieve-Generate (DRG) framework (Li et al., 2018) was the first effort to perform such transfer. It proposes four methods based on this principle. Two of them perform well on our data: DRG-RetrieveOnly retrieves a sentence with the opposite style which is similar to the original sentence and returns it, and DRG-TemplateBased takes the style attributes from it and plugs them into the original sentence. Here, the performance depends on the methods for the identification of style markers and retrieval of replacements. Words associated with style are typically identified either based on their frequencies as in the original paper, some works use attention weights as features (Sudhakar et al., 2019). Alternatively, style transfer can use Masked Language Modelling (MLM). An MLM trained on a dataset with style labels picks a replacement word based not only on the context, but also on the style label. An example of such model is Mask & Infill (Wu et al., 2019b). It is most similar to CondBERT method we propose. However, Cond-BERT performs additional control over the style and the content preservation and is able to make multi-word replacements. Another similar model of this type is described by Malmi et al. (2020). It has a more complicated structure: there, two MLMs trained on corpora of different styles perform replacements jointly.
End-to-end Architectures In contrast to these models, there exist end-to-end architectures for style transfer. They encode the source sentence, then manipulate the resulting hidden representation in order to incorporate a new style, and then decode it. Some of them disentangle the hidden representation into the representation of content and style . The others force the encoder to represent style-independent content (Hu et al., 2017). Alternatively, the model DualRL by Luo et al. (2019) performs a direct transfer from the source to the target style. The task is paired with the dual task (back transfer to the source style) which allows models to train without parallel data. The Deep Latent Sequence Model (DLSM) model by He et al. (2020) uses amortized variational inference to jointly train models for the primal and dual tasks. The Stable Style Transformer (SST) method (Lee, 2020) trains a pair of sequence-tosequence transformers for primal and dual tasks using cross-entropy of a pretrained style classifier as an additional discriminative loss. The Style Transfer as Paraphrase (STRAP) method by Krishna et al. (2020) views a style transfer model as a paraphraser that adds attributes of a particular style to a text. The authors create pseudo-parallel data by transferring style-marked texts to neutral with a pre-trained general-purpose paraphraser and then train sequence-to-sequence models on these neutral-to-styled parallel datasets. Our ParaGeDi model is conceptually similar to these methods. However, unlike these methods, the style is not infused into the model or a sentence representation but is imposed on the generator by another model.

Detoxification
Detoxification of text is a relatively new style transfer task. The first work on this topic by (dos Santos et al., 2018) is an end-toend seq2seq model trained on a non-parallel corpus with autoencoder loss, style classification loss and cycle-consistency loss. A more recent work by Tran et al. (2020) uses a pipeline of models: a search engine finds non-toxic sentences similar to the given toxic ones, an MLM fills the gaps that were not matched in the found sentences, and a seq2seq model edits the generated sentence to make it more fluent. Finally, Laugier et al. (2021) detoxify sentences by fine-tuning T5 as a denoising autoencoder with additional cycle-consistency loss. Dathathri et al. (2020) and Krause et al. (2020) approach a similar problem: preventing a language model from generating toxic text. They do not need to preserve the meaning of the input text. However, the idea of applying a discriminator to control an LM during generation can be used for style transfer, as we show in our experiments.

Paraphrasing GeDi Model
The recently proposed GeDi model (Krause et al., 2020) performs text generation from scratch guided by a language model informed about specific attributes of a text, e.g. style or topic. We extend this model by enabling it to paraphrase the input text.

GeDi
The original GeDi model consists of two components: a generation model (GPT-2) and a discrimination model, which is also a GPT-2 trained on sentences with additional sentence-level style labeling -during training the style label is prepended to a sentence. This makes the discriminating model learn the word distributions conditioned on a particular label. At each generation step, the distribution of the next token predicted by the main model P LM is modified using an additional class-conditional language model P D and the Bayes rule: Here, x t is the current token, x <t is the prefix of the text, and c is the desired attribute (e.g. toxicity or sentiment) -one of C classes. The first term is produced by the main language model P LM , and the second term is calculated using the Bayes rule and the additional class-conditional language model P CC . Thus, the tokens which are more likely to appear in a text of the chosen style get a higher probability: The name GeDi stands for Generative Discriminator, because a language model, which is generative by its nature, is used as a discriminator for guiding the generation process. GeDi was successfully applied to guiding a GPT-2 language model towards generating texts of particular topics and making the generated text less toxic.

ParaGeDi
In order to enable GeDi to preserve the meaning of the input text, we replace the regular language cmon man, the article was complete trash! Man, the article was a whole ____ If we denote the original text by x, the generated text of length T by y, and the desired style by c, ParaGeDi models the following probability: The last step is an approximation because the class probability should be conditioned on both x and y. However, this approximation, although not being fully justified, allows us to decouple the paraphraser model (which requires a parallel corpus for training) from the style model (which requires only texts with style labels, not necessarily parallel). The paraphraser and the style model can be trained independently. Moreover, we can plug in any paraphraser as long as it shares the vocabulary with the class-conditional LM. The third (optional) component of this model is a reranker -an external model which reweighs the hypotheses generated by the style LM-guided paraphraser with respect to the style. Our reranker is a pre-trained toxicity classifier which chooses the least toxic hypothesis generated by the ParaGeDi model. Figure 1 illustrates the workflow of our model.
ParaGeDi is trained as follows. Its loss L P araGeDi consists of a linear combination of two losses: the generative loss L G used in LM training, and the discriminative loss L D which further pushes different classes away from one another.
We enhance the model with a number of inference heuristics that improve content preservation and increase the style transfer accuracy. First, we use a heuristic from the original GeDi model. We raise the conditional LM probability to the power w > 1, which biases the discriminator towards the correct class during generation: P (y t |y <t , x, c) ∝ P LM (y t |y <t , x)P CC (c|y t , y <t ) w Besides that, we suggest two new heuristics: Smoothing of probabilities -adding a small α > 0 to all probabilities from the conditional language model discourages the generation of tokens with low probability conditional on all classes: Asymmetric lower and upper bounds (l and u) for class-conditional corrections: P α,l,u (c|x t , x <t ) = max(l, min(u, P α (c|x t , x <t ))).
By decreasing the value of u we discourage the insertion of new tokens, as opposed to prohibiting existing tokens. For the problem of detoxification, it means that the model will try less to insert polite words than to delete toxic words from the sentence.

Conditional BERT Model
BERT (Devlin et al., 2019) has been trained on the task of filling in gaps ("masked LM"), we can use it to insert non-toxic words instead of the toxic ones. This approach has been suggested by Wu et al. (2019a) as a method of data augmentation. The authors identify words belonging to the source style, replace them with the [MASK] token, and the BERT model then inserts new words of the desired style in the designated places. To push BERT towards the needed style, the authors finetune BERT on a style-labelled dataset by replacing segmentation embeddings of original BERT with trainable style embeddings.
We perform some changes to this model to adapt it for the detoxification task. While in the original conditional BERT model the words are masked randomly, we select the words associated with toxicity. This can be done in different ways, e.g. by training a word-level toxicity classifier or manually creating a vocabulary of rude and toxic words. We use a method which does not require any additional data or human effort. We train a logistic bag-of-words toxicity classifier. This is a logistic regression model which classifies sentences as toxic or neutral and uses their words as features. As a byproduct of the training process, each feature (word) yields a weight which roughly corresponds to its importance for classification. The words with the highest weights are usually toxic. We use the normalised weights from the classifier as toxicity score. The overview of CondBERT is shown in Figure 2.  For each word in a sentence, we compute the toxicity score and then define toxic words as the words with the score above a threshold t = max(t min , max(s 1 , s 2 , ..., s n )/2), where s 1 , s 2 , ..., s n are scores of all words in a sentence and t min = 0.2 is a minimum toxicity score. This adaptive threshold allows balancing the percentage of toxic words in a sentence so that we avoid cases when too many or no words are marked as toxic.
To preserve the meaning of the replaced word, we employ the content preservation heuristics suggested by Arefyev et al. (2020): (i) Preserve the original tokens instead of masking them before the replacement; (ii) Rerank the replacement words suggested by BERT by the similarity of their embedding with the embedding of the original word.
Despite using class-specific sentence embeddings, conditional BERT often predicts toxic words, apparently paying more attention to the context than to the embeddings of the desired class. To force the model to generate non-toxic words we calculate the toxicity of each token in BERT vocabulary and penalize the predicted probabilities of tokens with positive toxicities.
Finally, we enable BERT to replace a single [MASK] token with multiple tokens. We generate each next token progressively by beam search and score each multitoken sequence by the harmonic mean of the probabilities of its tokens.

Detoxification Experiments
We train the two new models and a number of other systems for text detoxification. Below we describe datasets, evaluation setup, and results.

Toxicity Classifier
We train two binary classifiers of toxicity. One of them is used to rerank hypotheses in the ParaGeDi model, and the other participates in the evaluation. We train these two classifiers on different sets of data. The overall dataset is the merge of the English parts of the three datasets by Jigsaw (Jigsaw, 2018(Jigsaw, , 2019(Jigsaw, , 2020, containing around 2 million examples. We split it into two parts and fine-tune a RoBERTa model (Liu et al., 2019) on it. We use the roberta-large model from the original repository. The classifiers perform closely on the test set of the first Jigsaw competition, reaching the AUC-ROC of 0.98 and F 1 -score of 0.76.

Dataset
For training and testing of the style transfer models, we use the English data from the first Jigsaw competition (Jigsaw, 2018). The majority of our methods are trained on non-parallel corpora of source and target styles. To prepare the toxic dataset, we divide the comments labelled as toxic into sentences (the original comments are often too long) and classify each of them with our toxicity classifier. Sentences classified as toxic are used as the toxic part of the dataset (we find 154,771 of them). To select the neutral part of the dataset, we randomly pick the same number of non-toxic sentences from the sentence-separated Jigsaw data. The test set is prepared analogously to the test set of the Jigsaw competition: we use 10,000 sentences with the highest toxicity score according to our classifier.

Metrics
There is no parallel test set available for the detoxification task, so we cannot use BLEU, METEOR or ROUGE metrics and resort to referenceless evaluation. Style transfer models need to change the style, preserve content and produce a fluent text. These parameters are often inversely correlated, so we need a compound metric to find a balance between them. We follow the evaluation strategy of Krishna et al. (2020) and use the metric J, which is the multiplication of sentence-level style accuracy, content preservation, and fluency. The system-level J is the average of sentence-level scores. Style accu-racy (ACC) is measured with a pre-trained toxicity classifier described in Section 5.1. Content preservation (SIM) is evaluated as the similarity of sentence-level embeddings of the original and transformed texts computed by the model of Wieting et al. (2019). Fluency (FL) measured with the classifier of linguistic acceptability trained on the CoLA dataset (Warstadt et al., 2019). J is computed as the average of their sentence-level product. In addition to that, we tried a similar aggregated metric GM (Pang and Gimpel, 2019; Laugier et al., 2021) which uses perplexity as the measure of fluency and employs a different aggregation method.
Our preliminary experiments showed that J and GM are strongly correlated, so we keep only J for further evaluation.

Implementation Details
For ParaGeDi, we use a pre-trained T5-based (Raffel et al., 2020) paraphraser, 4 fine-tuned on a random subsample of the ParaNMT dataset (Wieting and Gimpel, 2018). As a discriminator, we finetune the gpt2-medium model (Radford et al., 2019) on the training part of the Jigsaw-1 dataset using two control codes for toxic and polite texts. Before fine-tuning, we change the vocabulary of the discriminator to match that of T5, and update its embeddings accordingly. We train the discriminator using a combined generative and discriminative loss from Krause et al. (2020), adapting their code for this purpose.
We use beam search decoding with 10 beams to generate paraphrase candidates with the paraphraser and discriminator described above. We apply the classifier from section 5.1 to select the least toxic candidate from the 10 resulting paraphrases.

Competing Methods
We compare our models with state-of-the-art methods described in Section 2: DRG-TemplateBased, DRG-RetrieveOnly, Mask&Infill, DLSM, STRAP, and SST. We also implement three other baselines: Machine Translation, Detoxifying GPT-2, and Paraphraser. We do not directly compare our models with GeDi, because it is a language model and was not explicitly trained to transform texts.
Machine Translation There is evidence that automatic translation tends to eliminate toxicity (Prabhumoye et al., 2018). Thus, we use a chain of Machine Translation models for detoxification. Namely, we perform English → Pivot → English translation. We choose French and Igbo as pivot languages. French is resource-rich and structurally similar to English, which ensures high-quality translations. Conversely, Igbo is low-resourced and syntactically different. Both experiments are conducted using Google Translate API.
Detoxifying GPT-2 GPT-2 (Radford et al., 2019) can be adapted to a wide range of NLP tasks using a very small task-specific dataset. We experiment with the model's ability to perform sequenceto-sequence tasks. We train it on a parallel dataset of 200 toxic and safe sentences. We randomly select toxic sentences from the Google Jigsaw toxic comments dataset (Jigsaw, 2018) and manually rewrite them in the neutral tone.
Paraphraser Krishna et al. (2020) suggest that a general-purpose paraphraser can remove style markers from text. We check this assumption.

Results
The performance of all tested models is given in Table 1. Both ParaGeDi and CondBERT outperform other models by a large margin. The success of CondBERT is explained by its use of heuristics targeted at the components of the metric: (i) it is penalized for generating toxic tokens, which ensures a high ACC score, (ii) over 80% tokens stay unchanged, and the replacements are selected with respect to the similarity to the original words, increasing the overall SIM score, (iii) MLM is pretrained to replace masked tokens with plausible substitutes, increasing FL. ParaGeDi is behind in terms of similarity but has a slightly higher fluency because generation is a better strategy in terms of text naturalness than pointwise corrections. The closest competitor of our models is Mask&Infill which uses similar principles as CondBERT. However, some engineering decisions (e.g. masking of all words at once) result in a substantial drop in fluency and some decrease in style transfer accuracy.
Surprisingly, many advanced models perform below the simplistic (DRG) models TemplateBased and RetrieveOnly. TemplateBased achieves a high similarity because it keeps most of the original sentence intact, and RetrieveOnly yields a high similarity and style transfer accuracy, because it outputs real non-toxic sentences from the training data. DLSM and SST models perform full re-generation of text (as opposed to pointwise corrections). More importantly, their decoders are trained from scratch on a relatively small dataset, hence their low fluency scores. Conversely, STRAP, which also generates the sentence, has the access to the larger pseudo-parallel data, resulting in higher fluency.
Another finding is that MT has detoxification ability. However, it is inversely correlated with its quality: the En→Ig→En detoxifies 37% of sentences but has low SIM and FL scores. Conversely, En→Fr→En yields a better output which keeps most of the original features, including toxicity. The same applies to the T5 paraphraser. On the other hand, the GPT-2 model can be trained to detoxify even on a very small number of parallel sentences (200 in our experiments). Although it performs below many other models, we suggest that training it on a larger parallel dataset can boost its performance. We show examples of the paraphrases by the best-performing models in Table 2.
Additional examples and qualitative analysis can be found in Appendices F and E, respectively.

Parameter Selection
Our models use multiple parameters and heuristics. We perform an ablation study to explore their usefulness. It turns out that the crucial features of CondBERT are multiword replacement which ensures high fluency and toxicity penalty which increases style strength. On the other hand, masking of all tokens at once as well as control of similarity do not affect the quality. More details on the Cond-BERT ablation study are given in Appendix B.
ParaGeDi has only one training hyperparameter λ which controls the strength of its discriminative loss. We discover its value has only a marginal effect on the overall quality: the value of J decreases only for λ = 1 which constitutes the absence of generative loss (see Figure 3). The style strength control influences the style accuracy, whereas the use of word probability upper bound increases the similarity, and the absence of beam search decreases fluency. On the other hand, reranking, beam size, smoothing do not affect the model performance. An ablation study of the ParaGeDi model can be found in Appendix C.

Mining a Parallel Detoxifying Corpus
The STRAP model (Krishna et al., 2020) is based on the assumption that a regular paraphraser can transform a stylistically marked text into a neutral text. Although our experiments show that a para-   phrase dataset (Wieting and Gimpel, 2018) with our toxicity classifier described in Section 5.1 and obtain 500,000 paraphrase pairs where one sentence is more toxic than the other (for more details on the data collection process please see Appendix D). We then compare the regular paraphraser from Section 5.4 fine-tuned on a random subset of ParaNMT (regular) with its version fine-tuned on the mined toxic/safe parallel paraphrase corpus (mined). We also plug both paraphrasers into ParaGeDi model and compare the overall performance. The results are shown in Table 3.

Discussion of Results
None of the paraphrasers can fully detoxify the test set, but the mined paraphraser gets a better ACC than the regular one (42% vs 15%). When we replace the regular paraphraser with the detoxifying one in ParaGeDi, both detoxification rate and fluency improve without loss in the similarity score. This leaves us with the J score of 0.54, which is the highest score we obtained in our detoxification experiment. We do not include it in the main results (Table 1) because this model is not unsupervised. However, this result shows that the general-purpose ParaNMT corpus contains a large number of toxic/safe paraphrase pairs. We believe that mining parallel training sets from large corpora, as opposed to unsupervised methods of style transfer, is a fruitful direction.

Human Evaluation of Detoxification
While the automatic reference-free evaluation is cheap and fast, it may be unreliable. Toxicity and fluency classifiers are not perfect and can return erroneous evaluations. The embedding distance which is used to measure the content preservation was shown to weakly correlate with human judgements (Yamshchikov et al., 2021). Thus, we evaluate the best-performing models manually.

Experimental Setup
We design our manual evaluation setup to be as close as possible to the automatic evaluation. We evaluate our models along the same three metrics: style accuracy (ACC m ), content similarity (SIM m ), and fluency (FL m ). For all metrics we use a ternary scale: {0, 0.5, 1} corresponding to a bad, partially acceptable, and fully acceptable sentence. We ask five annotators to evaluate the models. Annotators are NLP researchers with MSc degree or above and with a good command of English. Prior to the annotation, we arranged a preliminary round to reach common annotation understanding. Each sentence is evaluated by three annotators, the final score for a sentence is computed as the average of their scores. We measure the inter-annotator agreement in terms of Krippendorff's α. We obtained the score of 0.42 for the style accuracy, 0.31 for content preservation, and 0.52 for fluency: a moderate agreement for style and fluency annotation, and low agreement for content annotation.
We evaluate three models: our new models Par-aGeDi and CondBERT, and Mask&Infill whose automatic scores were the highest among the existing models. The evaluation was conducted on 200 source sentences, each of them was transformed  by each of the evaluated models. The input (toxic) sentences for the evaluation were manually preselected to filter out disfluent or senseless utterances (this pre-selection did not consider the outputs). To compensate for the low inter-annotator agreement, we annotate each sample three times and report the average score.

Discussion of Results
We show the performance of models in terms of human evaluation in Table 4. The model scores are the averaged sentence scores. We combine the three metrics into a joint quality score which we denote as J m . Sentence-level J m is a multiplication of sentence ACC m , SIM m , and FL m , and the model J m scores are the average of sentence scores. This manual evaluation corroborates the superiority of our models over Mask&Infill model. At the same time, it confirms that our two models are not significantly different. Although ParaGeDi outperforms CondBERT in terms of all metrics, the difference in scores is statistically significant only for FL m . Besides the evaluation itself, we investigated to what extent the automatic metrics reflect the human judgements. To do that, we compute their Spearman's ρ correlation score with human judgements (see Table 6). For style, we consider the accuracy of toxicity classifier that we used for the evaluation (ACC) and its version which returns the confidence instead of the binary label (ACC-soft). For content we compare SIM (embedding similarity used for computing the J score) and BLEU score between the original and detoxified sentence. For fluency, we consider the linguistic acceptability classifier (FL) and perplexity of the GPT-2 (Radford et al., 2019) language model (PPL) which is used for evaluating fluency in many works on style transfer and other generation tasks.
This evaluation shows that the tested metrics of content preservation show only weak correlation with manual scores, which agrees with the previous research (Yamshchikov et al., 2021). The correla-  Table 5: Performance of the sentiment transfer models on the YELP dataset. The models are sorted with respect to the aggregated J score. * indicates the score which is significantly higher than the next best model with p < 0.01.  tion of automatic style and fluency metrics with human judgements is moderate. It turns out that the confidence of style classifier is a better style accuracy metric than a binary classifier and the acceptability classifier works better than perplexity, confirming the criticism of perplexity as a fluency metric (Krishna et al., 2020).

Sentiment Transfer Experiments
Text detoxification is not as well-established as other style transfer tasks, which makes it is difficult to put our models in the context of other works on style transfer. Thus, we conduct an experiment on a different domain, namely, sentiment transfer. Experimental Setup We train ParaGeDi and CondBERT on the Yelp reviews dataset (Li et al., 2018) and compare them with Mask&Infill, SST, DRG-TemplateBased, DRG-RetrieveOnly, and Du-alRL models (see Section 2). We tune the hyperparameters of ParaGeDi and CondBERT on the Yelp development set and use the outputs of other models generated by their authors.
We evaluate the models using the J as in our detoxification experiments. For the evaluation of style transfer accuracy, we train two sentiment classifiers on two disjoint parts of the Yelp dataset as in Section 5.1. We use one for inference and another for evaluation. We also compute the BLEU score against human references provided by Li et al. (2018). The results are shown in Table 5, averaged over two transfer directions.
Discussion of Results ParaGedi outperforms other models in terms of J. As before, the other models fail to generate fluent texts because they re-place only specific words or because they learn to generate texts from scratch. ParaGeDi model is the only competitor which combines pre-trained models and with full regeneration. The performance of the CondBERT model is low on this task, corroborating that detoxification and style transfer for other domains require different techniques.
On the other hand, the BLEU score questions this result. Compared to the human references, the best-performing model is DualRL followed by the two MLM-based models: Mask&Infill and our CondBERT. The evaluation of reference human answers also questions the referenceless metrics. First, we see that the ACC score is limited by the classifier performance. Since it gives only 0.81 to presumably 100% correct manually written sentences, the small differences in ACC should not be considered significant, and the ACC above 0.81 is unreliable. Overall, since the score of human answers is close to those of ParaGeDi and Mask&Infill, ParaGeDi can still be considered a strong style transfer model, and more precise evaluation should be done by humans because metrics cannot distinguish between the models at this level.

Conclusion
We present two style transfer models tailored for detoxification, i.e. transfer from toxic to non-toxic texts. Both of them combine high-quality pretrained LMs with the extra style guidance. Par-aGeDi is based on a paraphraser guided by a styleconditioned GPT-2 model. CondBERT model is based on BERT which does not need any finetuning, and all style control is performed with a pre-trained toxicity classifier. We conduct a largescale study of style transfer models exploiting both automatic and manual evaluation. Our experiments show that the proposed methods outperform other state-of-the-art style transfer models on the tasks of detoxification and sentiment transfer.

Ethical Statement
Toxicity is a sensitive topic where the unexpected results and byproducts of research can cause harm. Therefore, we would like to consider some ethical concerns related to our work.
On Definition of Toxicity Toxicity is an umbrella term for almost any undesirable behaviour on the Internet. It ranges from "mild" phenomena like condescending language (Perez Almendros et al., 2020) to grave insults or oppression based on racial or other social-demographic characteristics.
While annotators agree when labelling serious cases of toxicity such as hate speech (Fortuna and Nunes, 2018), the labelling of less severe toxicity is subjective and depends on the annotator's background (Al Kuwatly et al., 2020). This can cause the underestimation of certain types of toxicity. To define the toxicity in the most objective feasible way, we adopt a data-driven approach as presented in detail formally in Appendix A. Both models we propose recognise toxicity based on a toxicitylabelled dataset and do not require any additional manually created dictionaries or rules. Thus, their understanding of toxicity can be tuned with the input data. This ensures that given a corpus with unbiased toxicity labelling our models can produce unbiased detoxification.
On the other hand, in case the training corpus is biased, the model can reproduce the biases, so it should be applied with caution. Toxification of Texts Detoxification task implies the possibility to perform the opposite transformation, i.e. to rewrite a neutral text into a toxic one. Various style transfer models, including ours, could in principle be used to complete this task. However, in case of CondBERT, the quality of such transformation would be bad, and it would be almost impossible to pass the results of this "toxification" off as real toxic sentences. The reason for that is the structure of toxic data.
One of the main properties of toxic style is the presence of lexical markers of this style (rude or obscene words). Such markers (i) carry most of stylistic information of a sentence (i.e. their presence is a strong indicator of this class), (ii) have synonyms which are free from this stylistic information. Both our methods strongly rely on these properties. They identify toxic words and replace them with non-toxic synonyms. On the other hand, if performing the opposite transformation, we can-not use these properties any more. First, there do not exist non-toxic words which are strong indicators of neutral (non-toxic) style. Second, it is almost infeasible to identify non-toxic words which have toxic synonyms and replace them appropriately. Therefore, we suggest that CondBERT is not suitable for toxification.
The above arguments do not prove that Cond-BERT or ParaGeDi cannot be applied for toxification. However, they suggest that the quality of the resulting text might not be higher than with simpler toxification methods (e.g. handwritten rules for inserting rude words).
Detoxification as a Censorship Another concern is the fact the detoxification technology could no used to rewrite user-generated messages, which might be considered a form of censorship. We would like to look at that from a different perspective. The social media currently already perform censorship, e.g. Instagram provides tools for removal of messages based on automatically identified harmful content. 5 On the other hand, we suggest mitigating this policy by rewriting toxic messages instead of removing them altogether. Last but not least, we suggest that user messages should not be modified without user consent. The detoxificaiton models should be used for suggesting detoxifying edits rather than performing them automaticallly.
At the same time, detoxification models can make chatbots safer by detoxifying (if necessary) their answers before sending them to users. An automatically generated toxic comment by a neural chatbot may be the result of pre-training on the biased textual data -a problem which is currently unsolved completely (Gehman et al., 2020). Therefore, a detoxification of automatically generated content might be a valid use-case for minimizing reputational losses for the company created such an unmoderated chatbot (Babakov et al., 2021

A Definition of Text Detoxification Task
In our work, we adhere to the data-driven definition of toxicity. The toxicity is a particular binary value associated with a text: {toxic, neutral}. We assume that this textual characteristic is measurable using a function σ(x i ) → s i that obtains as input text x i and returns the corresponding style label s i . For instance, it can be implemented using a text classifier. Let us assume a set of two discreet mutually exclusive styles S = {s src , s tg } which corresponds to the source toxic and target neutral styles. Let us consider two text corpora D src = {d src 1 , d src 2 , ..., d src n } and D tg = {d tg 1 , d tg 2 , ..., d tg m } belonging to the source and target styles s src and s tg , respectively. For each text d i , let us assume that it has a style s i measurable with the function σ : D → S. There also exists a binary function δ : D × D → [0, 1] that indicates the semantic similarity of two input texts and a unary function ψ : D → [0, 1] that indicates the degree of the text fluency. In general, the sizes of the source and the target corpora D src and D tg are different (n = m) and the texts in them are not aligned, i.e., in general, δ(d src i , d tg i ) = 1. If n = m and δ(d src i , d tg i ) = 1 for all texts, this is a special case of a parallel stylealigned corpus. Given the introduced notations, we define the task of text detoxification as follows: A text detoxification model is a function α : S × S × D → D that, given a source style s src , a target style s tg , and an input text d src , produces an output text d tg such that: • The style of the text changes from the source style s src to the target style s tg : σ(d src ) = σ(d tg ), σ(d tg ) = s tg ; • The content of the source text is saved in the target text as much as required for the task: δ(d src , d tg ) ≥ t δ ; • The fluency of the target text achieves the required level: ψ(d tg ) ≥ t ψ , where t δ and t ψ are the error tolerance threshold values for the content preservation (δ) and fluency (ψ) functions.
When removing the toxicity from a text, we inevitably change a part of its meaning, so full content preservation cannot be reached. However, we should attempt to save the content as much as possible and adjust t δ to the needs of this task.
Thus, the task of obtaining a text detoxification model with the best parameters set may be viewed as maximizing the probabil-ity P (d tg |d src , s src , s tg ) given the three abovementioned constraints based on parallel or nonparallel text corpora D src and D tg .

B CondBERT Ablation Study
Our CondBERT model was inspired by Wu et al. (2019a) and is similar to Wu et al. (2019b), but has some unique properties. We test their importance with the ablation study on the detoxification task.  We use two heuristics for content preservation: not masking the toxic tokens and reranking replacement candidates with respect to their similarity to the original tokens. Removing any of these heuristics leads to lower content preservation and higher style accuracy, showing the inverse correlation of these properties (see Table 7). However, the J score for these models stays the same. On the other hand, turning off the possibility of filling a single mask with multiple words reduces the fluency and style accuracy, although obviously yields a better content preservation score, because the output sentence contains less new words. This affects the J score which is reduced for this model. Finally, the greatest impact on the J metric is caused by eliminating the toxicity penalty. The ACC is reduced dramatically, and although the other two metrics slightly grow compared to the full model, they cannot compensate for the low style accuracy.

C ParaGeDi Ablation Study
ParaGeDi model consist of a paraphraser, a language model trained with the generativediscriminative loss, and a style classifier for reranking hypotheses. In addition to that, we use a number of heuristics during inference. We conduct ablation study on the detoxification task to understand the usefulness of these components. We test the following variations of ParaGeDi: • no discriminative loss (λ = 0), • no generative loss (λ = 1), • no upper bound (u = ∞), • no discriminator (w = 0),  • no extra control of style strength (w = 1), • no probability smoothing (α = 0), • no reranking, • no beam search (beam size of 1).
In each configuration, all other parameters are fixed. The performance of models is given in Table 8.
Decreasing the number of beams leads to the deterioration of fluency and of style strength because of the smaller number options for the reranker to choose from. Removing the reranker leads to lower style strength with small gains in similarity or fluency. Turning off the smoothing of probabilities makes similarity and fluency degrade a little. Removing the upper bound on the discriminative correction leads to nearly 100% style transfer but to very low similarity of the generated sentences to the original ones, as the model starts hallucinating. Decreasing the w parameter reduces style accuracy but improves fluency and similarity, showing a clear trade-off between them.
The individual components of the loss are slightly less important for style than inference parameters. With only the discriminative loss the model is still able to successfully transform style in 77% of the cases, and the generative loss alone is able to change the style in 90% cases. The latter figure shows that the model equipped with style labels can discriminate between styles even if it was not explicitly trained to do that. On the other hand, the elimination of the generative loss results in a significant drop in fluency. Although the class-conditional LM in ParaGeDi is a GPT2 model which has already been trained for generation task, the lack of generation-based fine-tuning reduces the quality of the resulting text.

D Details of Mining the Parallel Corpus
Here we describe in more detail the process of mining of a detoxifying parallel paraphrase corpus. We use the ParaNMT dataset (Wieting and Gimpel, 2018) comprised of 50 million English sentences and their paraphrases back-translated from Czech. We filter the dataset keeping only the sentence pairs with moderate similarity (0.6 to 0.95) and similar length (with difference at most 40%), which is approximately 50% of the dataset. We compute the similarity as the cosine distance between the averaged BERT embeddings of all words in a sentence. After this similarity-and length-based filtering we score each sentence with a RoBERTa-based toxicity classifier from Section 5.1 and keep only the pairs with the difference in toxicity scores of at least 50%. Thus, we obtain 500,000 sentence pairs. Their examples are given in Table 9.
Manual inspection of a random sample of the selected pairs shows that around 10% of them are invalid paraphrases, 40% are in fact both toxic or both safe, and around 50% of them are valid detoxifying paraphrases. This suggests that with more rigorous filtering we can yield a corpus for detoxification of around 250,000 high-quality parallel sentences, which is larger than the majority of existing parallel style transfer datasets.

E Qualitative Analysis
Both automatic and manual joint scores show that our best models are halfway between useless and perfect. But the actual success rate is much less than half. We call a detoxified sentence "perfect" if all three annotators gave it the maximal scores for all three aspects. With this definition, only 20% of ParaGeDi sentences and 14% of Cond-BERT sentences are perfect, and only about 1.5% of Mask&Infill sentences are perfect.
As you can judge from Table 4, the main cause of imperfection for all models is distortion of meaning. Below we describe our manual investigation into the causes of this distortion.
In half of the cases, ParaGeDi conveys the meaning more or less adequately. Its mistakes include: • replacement of toxic words by similarly looking less toxic words with different meaning (e.g. "whore" → "Who's who", "stop behaving like fascist thugs" → "Stop looking at fascism", "taxman massive cunt , only outdone by linuxcunt himself ." → "Taxman's massive cut, outdone by Linuxcune himself."). • replacement of sentence meaning ("the election was yours to lose" → "the election is to be won", "this crap hole institute run by motherfuckers and 11 So now their spirits are cursed, walking back roads, waterways, and if they find an unfaithful man, they kill him, and that man is never seen again. their souls are cursed, they guard the paths, he says, and when they encounter an unfaithful man, he will be killed, and his body will never be found.  Table 9: Examples of mined detoxifying paraphrases. Here sim is similarity between sentences, computed by (Wieting and Gimpel, 2018), ld is relative difference in length, tox ref and tox trn are toxicity scores calculated by our classifier.
bastards" → "a deloitte institute for mothers and their children") • Avoiding the toxic or difficult part, for example "why we gotta have this miscegenation crap ?" → "Why do we need to have it?". In some cases, however, ParaGeDi masks or rephrases the toxic part of the message, while still preserving the general meaning, for example "start there first you idiot!" → "Let's start there first." In general, ParaGeDi makes the impression of fantasising too much, because it often rewrites the whole sentence, and from time to time changes its structure significantly.
The DLSM and Template-based DRG models often preserve the meaning by just preserving the toxic words, so their total success rate is low. The Retrieve-only DRG model almost never preserves the meaning. The Mask&Infill model seems to be overfitted: it often replaces toxic words with irrelevant non-toxic words (e.g. "crap" → "compelling") that the model apparently considers to be the "markers" of the non-toxic style. These properties make the baselines unsuitable for the detoxification task without adaptation, and the CondBERT model is in fact such an adaptation.
Typical mistakes of both ParaGeDi and Cond-BERT can be attributed mostly to the insufficiency of semantic understanding: they often replace toxic words with semantically related words of different (often opposite) meaning, or simply with similarly looking words. We conjecture that with a paraphraser trained on a larger corpus (we have used only 2% of ParaNMT) or on more difficult examples would improve the ability of ParaGeDi to preserve meaning.
Generally, our models produce the impression of not being mature enough for fully automatic use with texts where meaning is important. However, they can be used to suggest detoxification options to human writers, or to detoxify the output of chit-chat bots where the cost of producing an inarticulate utterance is considerably less than the cost of producing a toxic one. Input try not to make wikipedia look so stupid . ParaGeDi (ours)

F Examples of Detoxification
Try not to make Wikipedia seem like a bad idea. CondBERT (ours) try not to make wikipedia look so unsettling . Mask&Infill try not to make wikipedia look so compelling .
Input i will make sure i revert any stupid edits you make from now on . ParaGeDi (ours) I'll be sure to correct any wrong edits that you make. from now on. CondBERT (ours) i will make sure i do not make any mistake about any edits you make from now on . Mask&Infill i will make sure i revert any compelling edits you make from now on .