C ON P ROMPT : Pre-training a Language Model with Machine-Generated Data for Implicit Hate Speech Detection

,


Introduction
Warning: this paper contains content that can be offensive and upsetting.
Implicit hate speech is a disparaging statement targeting a certain group without explicit cues such as swear words.For example, we as a society should not take care of those with mental illness is an example of implicit hate speech targeting Figure 1: The generalization issue of existing pretrained language models (HateBERT and fBERT) in hate speech domain when adapted to implicit hate speech detection task.The performance of the pretrained models severely drops on cross-dataset evaluation, although all datasets target the same task (implicit hate speech detection).Each model is fine-tuned on IMPLICIT HATE CORPUS (IHC) dataset.
Mentally Disabled group.Since there is a lack of explicit cues, it is difficult to detect implicit hate speech using methods such as lexicon-based approaches (Waseem et al., 2017;Ocampo et al., 2023).Training pre-trained language models on implicit hate speech dataset showed satisfactory performance on in-dataset evaluation (e.g., a model trained on the training set of dataset A is evaluated on the test set of dataset A) (ElSherief et al., 2021).However, the trained models fail to generalize to other implicit hate speech datasets (Kim et al., 2022b).In other words, the performance of the models drops consistently on cross-dataset evaluation (e.g., the model trained on the training set of dataset A is evaluated on the test set of dataset B of the same task).
One possible way to improve the generalization ability is further pre-training models on the relevant large corpus.However, existing models pre-trained on abusive or hate speech corpus are not specialized in implicit hate speech.For example, the existing pre-trained language models in hate speech domain such as HateBERT (Caselli et al., 2021) and fBERT (Sarkar et al., 2021) suffer from per-  (Hartvigsen et al., 2022).Given a machine-generated statement (on the right side of the gray box), the example statements in its origin prompt (on the left side of the gray box) are considered positive samples for contrastive learning.The pre-training process would enable a model to learn some useful features relevant to implicit hate speeches, such as target group and toxicity.
formance drop on cross-dataset evaluation across implicit hate speech datasets (Figure 1).We suspect that the lack of knowledge relevant to implicit hate speeches makes the existing pre-trained models rely on spurious correlations such as identity term bias (i.e., classifying a text as hateful just because of the presence of identity terms such as Asian).
Recently, Hartvigsen et al. (2022) presented a large-scale dataset, TOXIGEN with over 250k2 samples for implicit hate speech detection by using GPT-3 (Brown et al., 2020).They aim at generating implicit statements on certain target groups (e.g., Asian) with certain toxicity (i.e., toxic or non-toxic).They encourage GPT-3 to generate such statements by providing GPT-3 with a set of example statements (i.e., a prompt) toward a certain target group with a certain toxicity.For instance, given a set of example statements on Mentally Disabled group with toxic as a toxicity label, GPT-3 tends to generate toxic statements on Mentally Disabled group (the gray box in Figure 2).
We pre-train a language model for implicit hate speech detection by leveraging machine-generated TOXIGEN as a dataset.We propose a novel pretraining approach that can fully leverage machinegenerated data.Specifically, we present CON-PROMPT3 , a pre-training approach which utilizes machine-generated statements and their origin prompts as positive pairs for contrastive learning (Figure 2).In the machine generation process in TOXIGEN, a machine-generated statement resembles the examples of its origin prompt.For example, given a set of examples with {implicit, Asian, toxic} properties as a prompt, GPT-3 tends to generate statements with similar {Asian, toxic, implicit} properties.Inspired by this, we conjecture that making the representation between a machinegenerated statement and the examples in its origin prompt similar would enable the model to learn some common properties between them.Since the examples in the prompt in ToxiGen are carefully curated to carry desirable properties regarding implicit hate speech (i.e., target group, toxicity), we expect that pre-training on TOXIGEN by leveraging CONPROMPT would result in a model with the useful features for implicit hate speeches.To this end, we present TOXIGEN-CONPROMPT, a further pretrained BERT for implicit hate speech detection by pre-training on TOXIGEN using CONPROMPT.
We use cross-dataset evaluation settings across three implicit hate speech datasets to evaluate the generalization ability of TOXIGEN-CONPROMPT.TOXIGEN-CONPROMPT consistently outperforms other pre-trained language models.This shows the effectiveness of the proposed pre-training approach, CONPROMPT on the generalization ability.We also observe that TOXIGEN-CONPROMPT mitigates the identity term bias compared to BERT, while other MLM-based pre-trained models in the hate speech domain rather exacerbate the identity term bias.This further emphasizes the advantage of TOXIGEN-CONPROMPT, showing its suitability as a pre-trained model for implicit hate speech detection with superior generalization ability and reduced unintended bias.In addition, we conduct analyses to investigate the representation quality of TOXIGEN-CONPROMPT.We confirm that TOXIGEN-CONPROMPT has learned desirable features (i.e., target group and toxicity) for implicit hate speech-related tasks.We look forward to its potential usage in implicit hate speech-related tasks.
Our main contributions are as follows: (1) We propose a novel pre-training approach, CONPROMPT, which can fully leverage machinegenerated dataset.
(2) We present TOXIGEN-CONPROMPT, a pretrained BERT for implicit hate speech detection using CONPROMPT.TOXIGEN-CONPROMPT shows outperforming generalization ability on implicit hate speech detection compared to other pre-trained language models.
(3) We show the effectiveness of TOXIGEN-CONPROMPT in mitigating the identity term bias, which is a major issue in hate speech detection.(4) We demonstrate that TOXIGEN-CONPROMPT has learned desirable features (i.e., toxicity and target group) regarding implicit hate speeches via extensive analyses on its representation quality.

ToxiGen
Recently, Hartvigsen et al. (2022) presented a large-sized dataset for implicit hate speech detection.The authors proposed to use a set of humancurated examples as a prompt to encourage GPT-3 to generate samples for an implicit hate speech dataset.They determined some desired properties of machine-generated statements so that the generated statements can be used as a sample for the dataset.The desired properties of a machinegenerated statement they considered are as follows: • IMPLICITNESS: a machine-generated statement should be in implicit forms (i.e., without explicit hateful words such as slurs).

SimCSE
Gao et al. ( 2021) proposed a contrastive learning method for enhancing sentence embeddings.They proposed supervised SimCSE which uses entailment relationship in natural language inference (NLI) datasets to construct positive pairs for contrastive learning.In other words, given a sentence x i (premise), an entailment hypothesis of x i is considered as a positive sample x pos i for x i .Given an anchor sentence x i , the authors proposed: N j=1 e sim(h(x i ),h(x pos j ))/τ , (1) where N is the number of sentences in a minibatch, h(•) is the representation of a sentence • from an encoder, and τ is a temperature hyperparameter.sim(h(x i ), h(x j )) is a similarity between two representations h(x i ) and h(x j ), and they use the cosine similarity.

CONPROMPT
CONPROMPT aims at making the representation between a generated statement and the example statements in its origin prompt similar.Since example statements exhibit some desired properties and the generated statement resembles it, pulling a generated statement and its origin example statements would enable a model to learn such desired properties between them.
SimCSE chooses one positive sample per anchor sample.Here, we use a machine-generated state-ment g i as an anchor sample.As a positive sample for the anchor g i , we propose to use an example statement of the origin prompt.We denote P (g i ) as a function that returns a set of example statements (i.e., a prompt) which g i originated from.Given a prompt P (g i ) = {s 1 , ..., s m } which consists of m example statements, we randomly select one example statement s pos i as a positive sample for g i .When we simply follow SimCSE, the resulting objective for an anchor is: However, several example statements (i.e., some of {s 1 , ..., s m }) of the same prompt P (g i ) can exist in a mini-batch.In such a case, example statements of the same prompt P (g i ) would be considered negative samples and pushed away from the generated statement g i in the representation space, which is not what we intended.We expect any example statements in P (g i ) to be considered as positive samples if they are included in a mini-batch.Thus, we modify Eq. 2 to include such example statements as positive samples for an anchor g i leveraging the membership relation: N j=1 e sim(h(g i ),h(s pos j ))/τ , (3) where is the proposing objective given a generated statement g i .While this modification is inspired from Khosla et al. (2020), where they use the label information (i.e., whether a sample has the same label with the anchor) to allow several positive samples, we propose to use the membership relation (i.e., whether an example statement is the element of a prompt).
For N samples in a mini-batch, our objective is: We also use masked language model (MLM) objective L mlm on machine-generated statements since Gao et al. (2021) showed that incorporating MLM objective was beneficial for the performance improvement on transfer task.Following Devlin et al. (2019), after we choose 15% of tokens, we mask 80% of the tokens, replace 10% of the tokens with random tokens, and leave 10% of the tokens unchanged.
Our final objective for pre-training is: where λ is a weighting hyperparameter, and we set λ = 0.1 which showed the best performance on the transfer tasks in Gao et al. (2021).
ToxiGen-ConPrompt We present TOXIGEN-CONPROMPT, a pre-trained BERT using TOXI-GEN as a dataset and the proposed CONPROMPT as a pre-training approach.Before pre-training, we discovered that email information, URLs, user or channel mentions are included in the machinegenerated statements in TOXIGEN.Since this can cause harm to society regarding a privacy issue, we anonymize them following Ramponi and Tonelli (2022).Then, we use the anonymized dataset as a pre-training source for TOXIGEN-CONPROMPT.More details on the process are described in the Ethics Statement Section.We use bert-base-uncased4 as an initial model and further pre-train the model on TOXIGEN.We use cosine similarity to calculate the similarity and use τ = 0.03.We use train subset of TOXI-GEN5 for pre-training, which consists of 250,934 machine-generated statements and 23,322 humancurated prompts with 522 distinct example statements6 .For pre-training TOXIGEN-CONPROMPT, we set the learning rate as 5e-5, max sequence length as 64, batch size as 256, and train for 5 epochs.We use the representation of [CLS] from BERT as h(•).We utilize 4 NVIDIA RTX3090 GPUs with batch size 64 per device.

Setup
Cross-dataset Evaluation Implicit hate speech detection is a task of classifying whether a statement is hate or non-hate (i.e., binary classification) where most of the hateful statements are in implicit forms.While one can evaluate a model on the test set of the same dataset that is used for training (i.e., in-dataset evaluation), in-dataset evaluation is considered an unreliable way to evaluate the generalization ability of a model in hate speech detection  1: Full fine-tuning and probing results of the pre-trained language models.In the full fine-tuning setup (denoted as Full), we train both the encoder and the classifier.In the probing setup (denoted as Probing), we freeze the encoder and train the classifier.TOXIGEN-CONPROMPT consistently outperforms all other pre-trained language models in 11 out of 12 cross-dataset evaluation settings, demonstrating its superior ability for generalization.
due to unintended biases in datasets (Wiegand et al., 2019).A model can achieve high performance on in-dataset evaluation by exploiting unintended biases in a dataset such as an identity term bias.For example, when the term Asian is presented more frequently in samples labeled as hate in contrast to those labeled as non-hate in a dataset, a model might classify a statement as hate solely based on the presence of the term Asian, resulting in high in-dataset performance.Such performance lacks reliability as an indicator of the generalization ability.As a result, cross-dataset evaluation is a common experimental setup in hate speech detection to test the generalization ability of models (Caselli et al., 2020;Nejadgholi and Kiritchenko, 2020;Caselli et al., 2021;Wullach et al., 2021;Ramponi and Tonelli, 2022).Particularly, Kim et al. (2022b) proposes a cross-dataset evaluation setup in implicit hate speech detection.They use three datasets for implicit hate speech detection and train a model on one of the datasets and evaluate the trained model on the other two datasets for cross-dataset evaluation.While they compared various fine-tuning approaches using the setup, we follow their setup to evaluate the generalization ability of various pretrained models on implicit hate speech detection.
Dataset We follow most of the dataset settings in Kim et al. (2022b).We use the following datasets: IMPLICITHATECORPUS (IHC) (ElSherief et al., 2021), DYNAHATE (DH) (Vidgen et al., 2021), and SOCIAL BIAS INFERENCE CORPUS-HATE (SBIC-H).Considering the definition of hate speech, instead of using SOCIAL BIAS INFERENCE COR-PUS (SBIC) utilized in Kim et al. (2022b), we use the subset of it (i.e., SBIC-H).Among the SBIC dataset, we set an offensive-labeled sample with target group as a hate class and set a nonoffensive sample as a non-hate class.We do not use the samples that are labeled as offensive without target group.This dataset setup is in line with AlKhamissi et al. (2022).Further information regarding the datasets can be found in Appendix C. Baseline Pre-trained Language Models For a fair comparison between pre-trained models, we use the pre-trained models that are based on bert-base-uncased.As baselines, we experimented with three existing pre-trained models with different pre-training sources: 1) BERT; 2) Hate-BERT; 3) fBERT.Please refer to Appendix D for the details.Fine-tuning Setup We fine-tune each pre-trained model on a dataset using the cross-entropy loss with binary labels (i.e., hate or non-hate), which is a general fine-tuning approach. 7For a thorough comparison of the generalization ability between the pre-trained models, we conduct two types of experiments: 1) full fine-tuning and 2) probing.In the full fine-tuning experiment, we fine-tune each pre-trained language model (encoder) with a classifier on it.Though it is a common practice to finetune both the encoder and classifier, fine-tuning often leads to catastrophic forgetting (McCloskey and Cohen, 1989), which makes the comparison between the pre-trained models indirect.Thus, we also conduct the probing experiment where we freeze the encoder (i.e., pre-trained models) and solely train the classifier with one linear layer to enable a more direct comparison of the pre-trained representations, similarly to the method in Aghajanyan et al. ( 2021).The implementation details of fine-tuning can be found in Appendix E.

Results
The results are shown in Table 1.We will focus on the cross-dataset evaluation results, as it is considered a more reliable way to evaluate the generalization ability in hate speech detection.Full Fine-tuning Experiment All pre-trained models showed comparable performance across indataset evaluation settings.Importantly, the proposed TOXIGEN-CONPROMPT achieves the best performance on 5 out of 6 cross-dataset evaluation settings (except for the comparable performance on DH → IHC setting).Particularly, on the crossdataset evaluation settings using IHC dataset as a training set, TOXIGEN-CONPROMPT outperforms the best-performing existing pre-trained language model largely by 4.54%p (IHC → SBIC-H) and 4.61%p (IHC → DH).This verifies the generalization ability of TOXIGEN-CONPROMPT by learning useful features for implicit hate speeches through the pre-training with CONPROMPT.We analyze the useful features that TOXIGEN-CONPROMPT has learned in Section 5.2.Probing Experiment TOXIGEN-CONPROMPT outperforms other pre-trained language models on in-dataset evaluation.For example, in IHC → IHC setting, TOXIGEN-CONPROMPT outperforms BERT, HateBERT, fBERT by 12.88%p, 5.10%p, 2.21%p, respectively.On all 6 cross-dataset evaluation settings, TOXIGEN-CONPROMPT consistently shows the best performance.For example, using IHC dataset as a training set, approach using the cross-entropy loss.We will leave the investigation on the combination of various fine-tuning approaches and pre-trained models as a future research direction.TOXIGEN-CONPROMPT outperforms the bestperforming existing pre-trained language model by 11.52%p (IHC → SBIC-H) and 6.70%p (IHC → DH).By consistently outperforming existing pretrained models while keeping the encoder frozen, TOXIGEN-CONPROMPT clearly demonstrates its superior representation quality.Regarding the DH → IHC setting, while TOXIGEN-CONPROMPT shows the second-best performance in the full finetuning experiment, it shows the best performance in the probing experiment with the large margin (7.77%p gap with the best performing existing pre-trained model).We conjecture that the catastrophic forgetting while fine-tuning would have degraded the representation quality of TOXIGEN-CONPROMPT.As a future direction, it would be worth investigating the fine-tuning approaches that can better preserve and leverage the high-quality representation of TOXIGEN-CONPROMPT.

Comparison with the MLM Objective
Since the MLM objective is the most common approach to pre-train a language model, we compare our pre-training approach-CONPROMPT-with the MLM objective. 8We experiment with the two pre-training approaches using TOXIGEN as a pretraining source.That is, we pre-train a model using TOXIGEN solely with the MLM objective on machine-generated statements and compare the MLM version with TOXIGEN-CONPROMPT.The cross-dataset evaluation results are shown in Table 2 and the in-dataset evaluation results are pre- Table 3: Analysis on the identity term bias using the HATECHECK dataset.We measure the false positive rate (FPR) on the neutral or positive statements with identity terms (F18, F19).TOXIGEN-CONPROMPT shows the lowest FPR, successfully mitigating the identity term bias compared to other pre-trained models.
sented in Appendix G.In the full fine-tuning experiment, CONPROMPT outperforms the MLM objective on 5 out of 6 cross-dataset evaluation settings.
In the probing experiment, CONPROMPT consistently outperforms MLM on all 6 cross-dataset evaluation settings.Overall, these verify the effectiveness of CONPROMPT over MLM as a pre-training approach for generalization.CONPROMPT also outperforms MLM on DH → IHC, where MLM outperformed CONPROMPT in the full fine-tuning experiment.Since the probing experiment enables a more direct comparison between pre-trained models than the full-fine-tuning experiment, we speculate that this provides strong evidence for the superior contribution of CONPROMPT over MLM in terms of generalization.We further analyze the additional advantage (i.e., in terms of identity term bias) of CONPROMPT over MLM in Section 5.1.

Identity Term Bias
One of the issues with hate speech detectors is their overreliance on identity terms such as Asian or Black to classify a text as hateful (i.e., identity term bias) (ElSherief et al., 2021;Sen et al., 2022).
A model can learn such unintended biases when there is a high prevalence of instances containing identity terms within the hateful class.We analyze the identity term bias of each pre-trained language model using HATECHECK dataset (Röttger et al., 2021).There are 315 neutral or positive statements with the protected group identifier in HATECHECK.For example, "I love disabled people." is an example of a positive statement with the identity term, disabled people.We evaluate the extent to which each model incorrectly classifies the corresponding cases as hate speech.We use each model fine-tuned on one of IHC, SBIC-H, and DH to measure the false positive rate (FPR) on the 315 statements.We report the average FPR of the models fine-tuned on each dataset for a pre-trained model (Table 3).Overall, TOXIGEN-CONPROMPT shows the lowest FPR (30.88) among the pre-trained models.Interestingly, other models pre-trained on hate speech-related corpora using the MLM objective (i.e., HateBERT, fBERT, and ToxiGen-MLM9 ) show higher FPR than BERT.Note that the pre-trained models were further pre-trained from BERT.They rather exacerbated identity term bias by further pre-training.The results indicate that solely employing the MLM objective on the hate speech-related corpus tends to amplify the identity term bias.Furthermore, in contrast to TOXIGEN-CONPROMPT, TOXIGEN-MLM shows the highest FPR (37.14).This highlights the superior effectiveness of CONPROMPT in mitigating the identity term bias compared to MLM.

Representation Quality Regarding Implicit Hate Speeches
We hypothesize that TOXIGEN-CONPROMPT has learned the desirable representation regarding implicit hate speeches.We conjecture that the model has learned the features regarding the target group and toxicity in the pre-training process.We analyze the representation quality of the model in terms of the target group and toxicity, utilizing the human-annotated test set of TOXIGEN.Details of the dataset are given in the Appendix H.We compare the representation with the SimCSE model10 , which shows high-quality sentence embeddings in the general domain.
In Figure 3, we visualize the representation of the SimCSE model and TOXIGEN-CONPROMPT with t-SNE (van der Maaten and Hinton, 2008).The samples with the same target group are more closely clustered with TOXIGEN-CONPROMPT than the SimCSE model.To deeply analyze the representation regarding the target group and toxicity label, given each statement, we retrieve some statements among 768 statements (except for the statement itself) based on cosine similarity.When using the SimCSE model, 42.78% of the top-1 retrieval results have the same target group and toxicity label as the query statement.For TOXIGEN-CONPROMPT, a higher proportion (62.03%) of the top-1 retrieval results have the same target group Each color represents a target group (total of 13 target groups).Toxic statements are plotted with circle-shaped markers, and non-toxic statements are plotted with X-shaped markers.We can observe that the samples with the same target group are more densely clustered in TOXIGEN-CONPROMPT.and toxicity label.In Table 4, we present the top-1 retrieval results for two example queries.While TOXIGEN-CONPROMPT consistently retrieves the statement with the same target group and toxicity label as a query, the SimCSE model retrieves the statements with a different target group or toxicity label.We speculate that this is because TOXIGEN-CONPROMPT has learned relevant features by pulling machine-generated statements and their origin prompts.Note that prompts are carefully curated to have desirable properties (i.e., all the example statements in a prompt have the same target group and toxicity label).Since target group and toxicity label are both important for implicit hate speech, we believe that TOXIGEN-CONPROMPT has learned desirable representation for implicit hate speech-related tasks.

Ablation Study
We investigate the contribution of two components in CONPROMPT: 1) modifying SimCSE objective using the membership relationship (i.e., modify-ing Eq. 2 to Eq. 3 by allowing multiple positive samples in a mini-batch using the membership relation); 2) using MLM as an auxiliary objective.We ablate each component to observe the effectiveness of each component.That is, 1) instead of using the proposed Eq. 3 for a given generated statement g i , we use Eq. 2 as an objective (denoted as -Membership Relation), and 2) we pre-train a model without MLM objective (i.e., solely using Eq. 4 given a mini-batch.We denote it as -MLM).
We report the cross-dataset evaluation results in Table 5.You can find the in-dataset evaluation results in Appendix I. We observe that ablating the proposed membership relation-based modification (-Membership Relation) leads to performance degradation.The difference of the model (-Membership Relation) with TOXIGEN-CONPROMPT is that given g i , the example statements in the prompt P (g i ) except for s pos i are considered negative samples in a mini-batch.Since these example statements in the prompt P (g i ) included in a mini-batch would have weakened the pulling strength we in- tended and resulted in worse generalization ability, this verifies that our idea of pulling a generated statement and the example statements in its origin prompt is effective for better generalization ability.
In the case of ablating masked language modeling (-MLM), it also shows performance degradation, which implies that leveraging MLM as an auxiliary objective is beneficial for boosting the generalization ability.Considering the large drop when ablating MLM, we conjecture that learning token-level features is important in implicit hate speech detection.While this has been confirmed by many previous works on contrastive learning, we note that we use machine-generated sentences for MLM objective.We confirm that the effectiveness of using MLM as an auxiliary objective also holds for machine-generated sentences by GPT-3.

Discussion
We propose a pre-training approach that leverages machine-generated data.The use of machinegenerated data to pre-train a model requires careful consideration given the unpredictable nature of machines.We emphasize the importance of validating machine-generated data before using it for pre-training a model.
The dataset (i.e., TOXIGEN) we leveraged to pretrain TOXIGEN-CONPROMPT was thoroughly validated in Hartvigsen et al. (2022), including humanvalidation.Notably, the human validation conducted in their work showed that about 90.5% of the machine-generated statements were considered to be written by humans, which demonstrates the high quality of the machine-generated data.Fur-thermore, 98.2% of the statements in TOXIGEN were considered to be implicit, which is a proportion higher than that of many other hate speech datasets.You can refer to Hartvigsen et al. (2022) for the detailed validation results of TOXIGEN.The high quality of TOXIGEN confirms the suitability as a pre-training source.
Leveraging machine-generated data led to the superior performance of TOXIGEN-CONPROMPT compared to other pre-trained models.We remark that other existing pre-trained models were pre-trained on human-generated data.Similarly, there have been recent works that demonstrated the superiority of models trained on machinegenerated data over those trained on humangenerated data (West et al., 2022;Kim et al., 2022a).For instance, West et al. (2022) trained a commonsense model using a machine-generated knowledge graph and empirically showed the effectiveness of using machine-generated data over humangenerated data.Therefore, we believe that developing methodologies for leveraging machinegenerated data is a promising direction that can benefit our society more than it poses harm when utilized with cautious care such as validating machinegenerated data before utilizing it.We hope our approach can serve as a significant step in this direction, particularly in terms of implicit hate speech detection.

Conclusions
We have proposed a pre-training strategy, CON-PROMPT to fully leverage machine-generated data.Given a machine-generated sentence, we have cast the idea of utilizing examples from a prompt as positive samples for contrastive learning.We have presented TOXIGEN-CONPROMPT, a pretrained language model pre-trained on a machinegenerated implicit hate speech dataset.TOXIGEN-CONPROMPT outperforms various pre-trained hate speech language models, including HateBERT and fBERT on cross-dataset evaluation and shows its generalization ability on implicit hate speech detection.In addition, TOXIGEN-CONPROMPT is effective in reducing the identity term bias.We have demonstrated that TOXIGEN-CONPROMPT learns desirable features of target group and toxicity in terms of implicit hate speech.

Limitations
Although we have shown the promising performance of TOXIGEN-CONPROMPT and thus the effectiveness of our pre-training approach (COMPROMPT) through extensive experiment on implicit hate speech detection, there are some limitations.First, we focus on one specific task (i.e., detection) regarding implicit hate speech.For example, there is a generation task which generates implied meaning of implicit hate speech for the explanation.Since TOXIGEN-CONPROMPT is pre-trained to learn implicit hate speech-related features, TOXIGEN-CONPROMPT, as an encoder, can be adapted to such tasks to further investigate its generalization ability.Second, regarding CONPROMPT, while it can be used in any machine-generated dataset leveraging examplebased prompt, we only show its effectiveness in implicit hate speech detection.CONPROMPT could be tested broadly with machine-generated datasets in other domains to further validate its effectiveness.

Ethical Considerations
Privacy Issue of the Machine-generated Data Since our pre-training approach uses machinegenerated statements as a source for pre-training, we emphasize careful pre-processing of machinegenerated samples.As the data generating process in TOXIGEN itself showed, a large language model such as GPT-3 can generate some toxic or undesirable information in the content.Before we pre-train TOXIGEN-CONPROMPT using TOX-IGEN, we found out that some private information such as URLs, user or channel mentions, and email addresses exists in the machine-generated statements in TOXIGEN.As we mentioned in Section 3, we anonymize such private information following Ramponi and Tonelli (2022).Specifically, we first define the patterns associated with URLs, user or channel mentions, and email addresses.Second, we detect the predefined patterns within the machine-generated statements.Third, we substitute the matched patterns with designated placeholders.In detail, we replace the matched patterns of URL with '[URL]', the user or channel mention with '[USER]', and the email address with '[EMAIL]'.We implement the process using the 'sub()' function contained within the 're' module.We present some example codes that we used for the anonymization in Appendix J.You can refer to our code in the public repository for the full version of the implementation.Potential Misuse The pre-training source of TOXIGEN-CONPROMPT includes toxic statements.While we utilize such toxic statements on purpose aimed at pre-training a better model for implicit hate speech detection, the pre-trained model necessitates careful handling.Here, we discuss some behaviors of our model that can lead to potential misuse so that our model is utilized for the good of society rather than being misused unintentionally or maliciously.(1) As our model was trained with the MLM objective as one of the training objectives, our model might generate toxic statements with its MLM head.(2) As our model learned features regarding the implicit hate speeches (Section 5.2), our model might retrieve some similar toxic statements given a toxic statement.While these behaviors can be utilized for social good such as constructing training data for hate speech detectors, one can potentially misuse such behaviors.We strongly emphasize the need for cautious handling to prevent unintentional misuse and warn against malicious exploitation of our model.We repeatedly inform and emphasize this when sharing our code and model to prevent any misuse of our model.

A Related Work
Hate speech detection is the task of classifying whether a statement includes a hateful expression or not.There have been many works to automatically classify whether a text is hateful or not using lexicon-based methods (Gitari et al., 2015;Lee et al., 2018;Wiegand et al., 2018) and neural network-based approaches (Gambäck and Sikdar, 2017;Lee et al., 2019;AlKhamissi et al., 2022).In hate speech detection, the generalization issue of a model has been studied actively (Swamy et al., 2019;Caselli et al., 2020;Nejadgholi and Kiritchenko, 2020;Caselli et al., 2021;Wullach et al., 2021;Ramponi and Tonelli, 2022;Kim et al., 2022b).Although a model might show somewhat satisfactory performance on in-dataset evaluation, this performance is overestimated since the model utilizes undesirable bias or spurious correlations as shortcut to achieve the performance (Arango et al., 2019;Wiegand et al., 2019).Thus, cross-dataset evaluation across different datasets has been considered as a more reliable way to evaluate the generalization ability of a model on hate speech detection.Kim et al. (2022b) reported that existing pre-trained language models suffer from performance degradation in cross-dataset evaluation settings for implicit hate speech detection.
Many works studied ways to improve the generalization ability of the hate speech detector.Wullach et al. (2021) proposed to augment a downstream dataset with machine-generated samples.While both our work and Wullach et al. (2021) propose to leverage machine-generation for improving generalization ability of a model, there are a few differences: 1) while they propose data augmentation using machine-generation to improve the generalization ability of a model, we propose training strategy that can fully utilize machine-generated samples; 2) their approach is for fine-tune model on each downstream task, while our approach is to pre-train model.As for the generalization ability of implicit hate speech detectors, Kim et al. (2022b) proposed to use the implied meaning of a hateful comment in the fine-tuning stage.Thus, their approach can be applied to the downstream dataset which has human-annotated implied meanings.On the other hand, once pre-trained using CONPROMPT, TOXIGEN-CONPROMPT shows improved generalization ability across several downstream datasets without such additional resources for the fine-tuning process.

B Statistics of ToxiGen Training Portion
TOXIGEN training portion which was used to pretrain TOXIGEN-CONPROMPT is shown in Table 6.We note that 'Target Group' and 'Toxicity Label' for the generated statements are the target group and toxicity label of their origin prompts as proxies.
C Dataset for Fine-tuning

D Details of Existing Pretrained
Language Models • BERT12 (Devlin et al., 2019) is a pre-trained model using large corpus from BookCorpus and WikiPedia.We note that these sources cover generic domains rather hate speech-related domains.• HateBERT13 (Caselli et al., 2021) is further pretrained based on BERT using MLM objective on 1,478,348 comments from banned communities on Reddit.• fBERT14 (Sarkar et al., 2021) is also a further pre-trained BERT using MLM objective on 1.4M comments from the offensive language dataset, SOLID.

E Implementation Details of Fine-tuning
For the full-tine tuning experiment, for each dataset, we fine-tune pre-trained models for 6 epochs with batch size 8 and search learning rate among {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} following Kim et al. (2022b).We validate each fine-tuned model at every end of the epoch and use the model with the best validation macro f1 score on in-dataset evaluation to report the results.We use these settings for Section 5 as well.
For the probing experiment, we freeze the encoder and only train a linear classifier for 30 epochs with batch size 8.We search learning rate among {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}.We validate each fine-tuned model at every end of the epoch and use the model with the best validation macro f1 score on in-dataset evaluation.
The reported results are the macro f1 score on the test set of each dataset, and we average the results of the 5 fine-tuned models with different random seeds (0, 1, 2, 3, 4).We use one NVIDIA RTX3090 GPU to fine-tune each model.For each pre-trained language model, we use their pre-trained weights for the encoder and randomly initialize the classifier.

F Implementation Details of Pre-training
We described implementation details of TOXIGEN-CONPROMPT in Section 3. Here, we describe the implementation details of the variants of TOXIGEN-CONPROMPT such as the model pre-trained only with MLM objective (Section 4.3) and the ablated versions of TOXIGEN-CONPROMPT (Section 5.3).While keeping other settings the same, we pretrained the model only with the MLM objective for 25 epochs because we have empirically found that more epochs are required for better performance using the MLM objective.For the ablated versions of TOXIGEN-CONPROMPT experimented in the analysis, we used the same pre-training hyperparameters as TOXIGEN-CONPROMPT.

G Results of Comparison with the MLM Objective
We report the results of different pre-training approaches, including the in-dataset evaluation results in Table 8.

H Details on Representation Quality Regarding Implicit Hate Speeches
Given 940 samples in the human-annotated test set of TOXIGEN, we only use the ones with half or more than half of the elements in 'predicted_group' list refer to the value in 'target_group'.In addition, we calculate the max value between 'toxicity_ai' and 'toxicity_human' and we do not use the sample if the value is 3 (since it can be considered ambiguous).If the value is greater than 3, we use 'toxic' for the toxicity label, and if the value is less than 3, we use 'non-toxic' for the toxicity label.Finally, we use a total of 769 samples for the analysis in Section 5.2.In Table 9, we present the top-3 retrieval results for the two example queries in Table 4.

I Results of Ablation Study
We report the results of the ablation study including the in-dataset evaluation results in Table 10.

J Example Codes for Anonymization
We show example codes we used to anonymize private information such as email addresses, URLs, and user/channel mentions in Figure 5.We implemented it following Ramponi and Tonelli (2022).You can refer to the full version of the code in our public repository.We note that CONPROMPT leverages a machinegenerated dataset, which is the result of feeding prompt with some examples to GPT-3.One can continuously increase the amount of data by showing some examples of target domain (in our case, toxic/non-toxic statements toward minority groups) to GPT-3.Likewise, CONPROMPT can further improve the performance of TOXIGEN-CONPROMPT by enlarging the pre-training dataset (TOXIGEN) with GPT-3.We conduct the experiment by varying the size of the pre-training dataset (from 50K to 250K) to simulate the scenario in which GPT-3 continuously generates hateful/benign statements. 15igure 4 shows the cross-dataset evaluation results when fine-tuning TOXIGEN-CONPROMPT on IHC dataset.We can observe that performance improves consistently as the number of generated statements in TOXIGEN increases, which suggests that our pre-training strategy, CONPROMPT is scalable to the continuously generated hateful/benign samples.We empirically observe similar trends from the experiments with other datasets as a train set.CONPROMPT shows its scalability on four out of six cross-dataset evaluation (except for SBIC-H → IHC and DH → IHC).From these results, we can expect further performance gain by generating more statements using GPT-3.We report the full results of the scalability experiment in Table 11.

K Scalability of CONPROMPT
We also emphasize that the membership relation we used (if an example statement s i is the element of the origin prompt P (g i ) of a machine-generated statement g i ) can be naturally obtained in the process of machine generation.That is, one only needs to keep the information of which prompt was used to generate a statement.Thus, we believe CON-PROMPT is easy to deploy for any machine generation process with example-based prompting and look forward to its potential usage.

A 3 Figure 2 :
Figure2: An overview of the proposed pre-training approach, CONPROMPT.The gray box area is the machinegeneration process used to generate the TOXIGEN dataset(Hartvigsen et al., 2022).Given a machine-generated statement (on the right side of the gray box), the example statements in its origin prompt (on the left side of the gray box) are considered positive samples for contrastive learning.The pre-training process would enable a model to learn some useful features relevant to implicit hate speeches, such as target group and toxicity.

Figure 3 :
Figure3: Visualization of the representation of toxic and non-toxic statements in TOXIGEN test set using t-SNE.Each color represents a target group (total of 13 target groups).Toxic statements are plotted with circle-shaped markers, and non-toxic statements are plotted with X-shaped markers.We can observe that the samples with the same target group are more densely clustered in TOXIGEN-CONPROMPT.

Figure 4 :
Figure 4: Scalability experiment using IHC dataset as a training set.x-axis indicates the number of machinegenerated statements in TOXIGEN dataset used for pretraining, and the y-axis indicates the F1 score of each model.

Table 2 :
Comparison of pre-training approaches on the cross-dataset evaluation settings.The proposed CON-PROMPT consistently outperforms the MLM objective.This indicates the superior contribution of CONPROMPT in improving the generalization ability of models.
Query: if someone has a physical disability, they are lamer than anybody else (folks with physical disabilities, toxic)

Table 4 :
The top-1 retrieved statements with the SimCSE model and TOXIGEN-CONPROMPT.The % in bold indicates the percentage of top-1 retrieval results which have the same target group and toxicity as the query.
TOXIGEN-CONPROMPT shows a much higher proportion (62.03%), which is a desirable feature for the tasks relevant to implicit hate speeches.

Table 5 :
Ablation study to investigate the contribution of two components in CONPROMPT.We report the cross-dataset evaluation results.The degradation of the performance when ablating each component demonstrates the effectiveness of each component.SH in the table denotes SBIC-H.

Table 6 :
The statistics of TOXIGEN training portion.