Probing LLMs for hate speech detection: strengths and vulnerabilities

Recently efforts have been made by social media platforms as well as researchers to detect hateful or toxic language using large language models. However, none of these works aim to use explanation, additional context and victim community information in the detection process. We utilise different prompt variation, input information and evaluate large language models in zero shot setting (without adding any in-context examples). We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans. We find that on average including the target information in the pipeline improves the model performance substantially (~20-30%) over the baseline across the datasets. There is also a considerable effect of adding the rationales/explanations into the pipeline (~10-20%) over the baseline across the datasets. In addition, we further provide a typology of the error cases where these large language models fail to (i) classify and (ii) explain the reason for the decisions they take. Such vulnerable points automatically constitute 'jailbreak' prompts for these models and industry scale safeguard techniques need to be developed to make the models robust against such prompts.


Introduction
Abusive language has become a perpetual problem in today's online social media.An ever-increasing number of individuals are falling prey to online harassment, abuse and cyberbullying as established in a recent study by Pew Research (Vogels, 2021).In the online setting, such abusive behaviours can lead to traumatization of the victims (Vedeler et al., 2019), affecting them psychologically.Furthermore, widespread usage of such content may lead to increased bias against the target community, making violence normative (Luft, 2019).Many gruesome incidents like the mass shooting at Pitts-burgh synagogue1 , the Charlottesville car attack2 , etc. have all been caused by perpetrators consuming/producing such abusive content.
In response to this, social media platforms have implemented moderation policies to reduce the spread of hate/abusive content.One important step in content moderation is filtering of abusive content.A common way to do this is to train language models on human annotated contents for classification.However there are challenges in this approach in the forms of heavy resources required in terms of labour and expertise to annotate these hateful contents.This exercise also exposes the annotators to a wide array of hateful contents that is almost always psychologically very taxing.Therefore, recently, many works have tried to understand if Large language models (LLMs) can be used for detecting such abusive language3 (Huang et al., 2023;Ziems et al., 2023), but none of these study the role of additional context as input (output) to (from) such LLMs.
We, for the first time, introduce several prompt variations and input instructions to probe two of the LLMs (GPT 3.5 and text-davinci) across three datasets -HateXplain (Mathew et al., 2021), implicit hate (ElSherief et al., 2021) and Toxic-Spans (Pavlopoulos et al., 2022).Note that all these three datasets contain ground truth explanations in the form of either rationales (Mathew et al., 2021;Pavlopoulos et al., 2022) or implied statements (ElSherief et al., 2021) that tells why an annotator took a particular labelling decision.In addition, two of the datasets also contain the information about the target/victim community against whom the hate speech was hurled.In particular, we design prompts that contain (a) only the hate post as input and a query for the output label (vanilla) (b) post as well as the definition of hate speech as input and a query for the output label (c) post (-/+ definition) and the target community information as input and a query for the output label, (d) post (-/+ definition) and the explanation as input and a query for the output label, (e) post (-/+ definition) as input and a query for the output label and the target community, and (f) post (-/+ definition) as input and a query for the output label and the explanation.We record the performance of all these approaches and also identify the most confusing cases.In order to facilitate future research, we further provide a typology of the error cases where the LLMs fail to classify and usually provide poor explanations for the classification decisions taken.This typology would naturally constitute the 'jailbreak' prompts 4to which such LLMs are vulnerable thus pointing to the exact directions in which industry scale safeguards need to be built.
We make the following observations.
• In terms of vanilla prompts (that is case (a)), we find that flan-T5-large performs the best among the three models; we also observe that text-davinci-003 is better than gpt-3.5-turbo-0301although the latter is a more recent version.• Our proposed strategies of prompts individually benefit the LLMs in most cases.The prompt with target community in the pipeline gives the best performance with ∼ 20 − 30% improvements over the vanilla setup.None of these LLMs are able to benefit themselves if multiple prompt strategies are combined.• While doing a detailed error analysis, we find that the misclassification of non-hate/nontoxic class is the most common error for implicit hate and ToxicSpans datasets while for the Hatexplain dataset the majority of misclassifications are from the normal to the offensive class.There are also a large number of cases where the model confuses between the hate speech and the offensive class.• From the typology induced from the error cases, we find many interesting patterns.These LLMs make errors due to the presence of sensitive or controversial terms in otherwise non-hateful posts.Presence of negation words and words expressing support for a community are misclassified as hateful.Ideological posts and posts containing opinions or fact check information about news articles are often misclassified as toxic/hateful.On the other hand, many offensive/hateful posts are marked normal by the model either due to a vocabulary gap or presence of unknown or polysemous words.Similarly, these models miss to classify implicitly toxic posts and mark them as non-toxic.We make our codes and resources used for this research publicly available for reproducibility purposes5 .

Related works
Large language models: Based on the training setup, the model architecture and the use cases LLMs can be broadly classified into encoder-only, encoder-decoder based and decoder-only types.In recent years decoder-only LLMs6 have seen a huge surge with industry scale releases like chatGPT, gpt-4, BARD, LLaMa, etc. Decoder-only LLMs have been used for benchmarks like GLUE (Zhong et al., 2023) where there is no downstream application involved.In the classification setting LLMs have been extensively used for sentiment analysis (Zhang et al., 2023b).They have also been heavily used in NLI and QA tasks on multiple datasets (Chowdhery et al., 2022).In the generation setting, LLMs have found applications in summarization (Zhang et al., 2023a), machine translation (Chowdhery et al., 2022) and open-ended generation (Brown et al., 2020).LLMs for hate speech detection: In (Zhu et al., 2023), the authors use chatGPT to relabel various datasets one of which is on hate speech detection and found that the agreement with human annotations is still quite poor.The authors in (Li et al., 2023) use chatGPT to classify a comment as harmful (i.e., hateful, offensive, or toxic -HOT) and found that the model is better in identifying non-HOT comments than HOT comments.Finally in (Huang et al., 2023) the authors attempted to classify implicit hate speech using chatGPT.However, their prompt was framed as a 'yes/no' question (rather than based on the exact classes i.e., implicit hate, explicit hate, non-hate as in the original study (ElSherief et al., 2021)) which makes the problem lose its original fervour.

Datasets and metrics
For all these datasets, we utilise (create) a test dataset for our experiments.Implicit hate dataset: The implicit hate (ElSherief et al., 2021) corpus is a specialized collection of data aimed at detecting hate speech.It provides detailed labels (implicit_hate, explicit_hate or non_hate) for each message, including information about the implied meaning behind the content.The dataset comprises 22,056 tweets sourced from major extremist groups in the United States.
We test on a subset of 2147 samples (108 entries labelled explicit_hate, 710 entries labelled implicit_hate, and 1329 entries labelled not_hate) which are sampled from the entire dataset in a stratified fashion.Note that we do not have explanations and targets for all the posts.The implied statements and targets were available only for the samples with label implicit_hate.Hence, when we experiment with explanation as input (see section 5), we pass the input post as the implied statement.This is for the explicit_hate and not_hate data points.Both these categories do not required any additional implied statement as there is nothing implied in them.In case of targets as inputs (see section 5), we remove the explicit_hate datapoints since they are targetting some victim community but the targets are not present in the annotated dataset.The targets for not_hate are set as 'none'.HateXplain dataset: HateXplain (Mathew et al., 2021) is a benchmark dataset specifically designed to address bias and explainability in the domain of hate speech.It provides comprehensive annotations for each post, encompassing three key perspectives: classification (hate speech, offensive, or normal), the targeted community, and the rationales -which denote the specific sections of a post that influenced the labelling decision (hate, offensive, or normal).We test on the already released test dataset containing 1924 samples (594 entries labelled as hate speech, 782 entries labelled as normal, and 548 entries labelled as offensive).Note that we do not have rationales for the normal posts.In the explanations as input experiments (see section 5), the complete post (tokenized) is taken as their rationale for the normal posts.ToxicSpans dataset: The ToxicSpans (Pavlopoulos et al., 2022) dataset is a subset (containing 11,006 samples labelled toxic) of the Civil Comments dataset (1.2M Posts).The dataset also contains the toxic spans, i.e., the region of the texts found toxic.Not all the posts had toxic spans annotated.We create a test set of 2000 samples by picking 1000 samples labelled toxic from this dataset (where the toxic spans were available), and 1000 samples labelled non-toxic from the Civil Comments dataset (Borkan et al., 2019).Note that we have the spans/rationales marked only for the toxic data points.For the non-toxic posts For explanations as input experiments (see section 5), the complete post (tokenized) is taken as the rationale.Metrics: For primary evaluation, we rely on classification performance.We use precision, recall, accuracy and macro F1-score to measure classification performance, which are all standard metrics.
The data points which have rationales are additionally evaluated using other generation metrics.For the natural language explanations (implicit hate dataset) we use BERTScore averaged over all data points.BERTScore (Zhang et al.) computes a similarity score for each token in the candidate sentence with each token in the reference sentence.However, instead of exact matches, they compute token similarity using contextual embeddings.For extractive explanation (HateXplain/ToxicSpans dataset), we use average sentence-BLEU (Papineni et al., 2002) score which is the standard among the generation metrics.

Models
For our experiments we utilise three LLMs.Two of these are from the proprietary GPT-3.5 model series7 while the third is an open source one from the T5 series.
GPT-3.5 models are better than their predecessors GPT 3 (Brown et al., 2020).Both of these models are highly advanced language models capable of generating human-like text based on the provided prompts but they differ in some key ways.As per their documentation8 , GPT-3 was optimized on code completion tasks to create code-davinci-002.This was further improved using instruction finetuning (Ouyang et al., 2022) to create text-davinci-002.This was later upgraded to text-davinci-003 which was trained on a larger dataset (ope) making it better at higher quality text generation, following instructions and generating longer context.gpt-3.5-turbo-0301 is an improvement over text-davinci-003.We choose these two models in our study namely the gpt-3.5-turboand the text-davinci-003.
The third model we use is the open source flan-T5-large which is an instruction finetuned variant of the popular T5 model (Chung et al., 2022).As per their documentation the model was instruction finetuned with an emphahsis on scaling the number of tasks, scaling the model size and introducing chain of thought data in the finetuning pipeline.The authors have claimed that this particular sort of finetuning has benefited the T5 models greatly by outperforming models of higher size like GPT-3.

Prompts
In this section, we list the prompt variations used in this work.A concise summary of the different variants is noted in the Appendix A (Table 7) and the details for each are discussed in the subsections below.

Dataset
Label list HateXplain normal, offensive or hate speech Implicit hate explicit_hate, implicit_hate, or not_hate ToxicSpans toxic or non_toxic Table 1: The list of labels for each dataset.
Vanilla prompts: In this category, we use a prompt template where we ask the model to classify the given post into a label out of a list_of_labels.In addition, we also provide a few example_outputs (one class per line) for helping the models generate proper answer.The list_of_labels for each datasets are noted in the Table 1.(+) definitions: In the vanilla prompts we assumed that the LLMs are to an extent aware of the labels for classification.Here, we provide the definitions as an additional context to the LLMs.This can help the LLMs understand the classification tasks better.These definitions are added as a list where each label's definition is separated by a new line.We note this prompt template as Vanilla + Defn in Appendix A, Table 7. Individual definitions of the tasks are added in Appendix B. (+) explanations Recently, there has been a huge interest in developing explainable deep learning models where the prediction decision is supported by an explanation (Bhatt et al., 2020).We test two hypotheses -(a) whether providing explanations to LLMs as inputs (corresponding to the templates Vanilla + Exp (input) and Vanilla + Defn + Exp (input)) improve its labelling decisions and (b) whether asking LLMs for an explanation about its labelling decision forces it to predict better labels and as well generate relevant explanations (corresponding to the templates Vanilla + Exp (output) and Vanilla + Defn + Exp (output)).For the HateXplain and the ToxicSpans dataset the ground truth explanations are in the form of rationales (i.e., a part (parts) of the post that the annotator marked as the reason for his/her labelling decision).For the implicit hate speech dataset the ground truth explanations are in the form of implied statements.In the templates Vanilla + Exp (output) and Vanilla + Defn + Exp (output) we use two variables explanation_type and explanation_format (see Appendix A, Table 7).For each dataset, we note the values in these lists below.
• HateXplain: Here, explanation_type is "extract the words from the post that you found as hate speech or offensive" and explanation_format is "the list of extracted words, separated by ".Enclose the list with < < < > > >" • ToxicSpans: Here, explanation_type is "extract the words from the post that you found as toxic" and explanation_format is "the list of extracted words, separated by ".Enclose the list with < < < > > >" • Implicit hate: Here, explanation_type is "with an explanation in 15 words" and explanation_format is "the explanation enclosed in < < < > > >" In the templates Vanilla + Exp (input) and Vanilla + Defn + Exp (input) we use a single variable explanation (see Appendix A, Table 7).For each dataset, we note the value in this list below.
• HateXplain: explanation is "the rationales {rationales} as an explanation".• ToxicSpans: explanation is "the span {span} as an explanation".(b) whether asking LLMs for the target information forces it to predict better labels and as well generate correct targets (corresponding to the templates Vanilla + Tar (output) and Vanilla + Defn + Tar (output)).In the templates Vanilla + Tar (output) and Vanilla + Defn + Tar (output) we use two variables target_type and target_format (see Appendix A, Table 7).For all the datasets, we replace the variable target_type with "also mention which group of people does it target" and the variable target_format with "list targeted groups enclosed in < < < > > >".In the templates Vanilla + Tar (input) and Vanilla + Defn + Tar (input) we replace the variable targets (see Appendix A, Table 7) with the ground truth target community information which is only applicable for the Hat-eXplain and the implicit hate speech datasets.

Results
In this section, we note the results of the different prompt variations on the models for the three datasets.As a baseline, we consider BERT-HateXplain (Mathew et al., 2020) and run the model on the implicit hate and ToxicSpan datasets.
We have taken the results for the HateXplain dataset from the original paper.Table 2 shows the results of this baseline.
Our key results are noted in Tables 3, 4 and 5 for the HateXplain, implicit hate and ToxicSpans dataset respectively.
(+) definitions: Adding definitions to the prompts does not always help in improving the performance.
In terms of F1-score, for Hat-eXplain, we see an improvement of 25.64% for gpt-3.5-turbo-0301while the performance worsens by 17.78% for text-davinci-003 and 20.34% for flan-T5-large.The situation is reverse for the ToxicSpans dataset where we see an improvement of 1.40% for text-davinci-003 and 13.75% for flan-T5-large while it worsens by 7.35% for gpt-3.5-turbo-0301.For Toxi-cSpans, adding definitions to the input prompt gives the best performance across all the prompting strategies.For implicit hate, there is an improvement of 12.50%, 16.67% and 4.76% for gpt-3.5-turbo-0301,text-davinci-003 and flan-T5-large respectively.
(+) explanations: As discussed earlier, we exploit the power of explanations/rationales in two ways.For the case where we ask the model to generate explanations along with the label in the output, we see a similar trend like adding definitions.In terms of F1-score, for HateXplain, we see an improvement of 23.07% for gpt-3.5-turbo-0301while the performance worsens by 8.89% for text-davinci-003 and 10.17% for flan-T5-large.The situation is similar for the ToxicSpans dataset we see comparable performance for gpt-3.5-turbo-0301while it worsens by 9.86% for text-davinci-003 but it improves by 18.84 % for flan-T5-large.For implicit hate, there is improvement of 6.25% and 22.22% for gpt-3.5-turbo-0301and text-davinci-003 respectively but it worsens by 11.11% for flan-T5-large.We further compare the generated explanations with the already available ground truth explanations using BERTScore for the implicit hate dataset and sentence-BLEU for the other two datasets.We observe that the gpt-3.5-turbo-0301model generates better explanations (averaged over all data points) than the text-davinci-003 model for the HateXplain and the implicit hate datasets.For the ToxicSpans dataset the results are reversed.
(+) targets: We exploit the power of targets again in two ways.For the case, where we ask the model to generate target along with the label in the output, we again see a similar trend like adding definitions.In terms of F1-score, for HateXplain, we see an improvement of 28.20% for gpt-3.5-turbo-0301while the performance worsens by 15.56% for text-davinci-003 and 11.86% for flan-T5-large.The situation is similar for the ToxicSpans dataset we see comparable performance for gpt-3.5-turbo-0301while it worsens by 11.27% for text-davinci-003 and 18.84% for flan-T5-large.
For the case, where we use the target as inputs in the prompt we see improvement for both the datasets where the target is present in the ground truth except for flan-T5-large.In terms of F1-score, for HateXplain, we see an improvement of 33.33% for gpt-3.5-turbo-0301and 26.67% for text-davinci-003 and a drop in performance by 6.78% for flan-T5-large.For implicit dataset, the ground truth targets are present only for the implicit hate data points.Thus while comparing with the Vanilla setup, we only consider the implicit hate data points to have a fair comparison.The revised F1 scores for the Vanilla setup become 0.52 for gpt-3.5-turbo-0301and 0.68 for text-davinci-003.Comparison with this revised values show an improvement of 17.30% for gpt-3.5-turbo-0301while for text-davinci-003 the results remain roughly unchanged.In case of flan-T5-large the performance drops by 9.52%.For ToxicSpans we do not have the target annotated and thus we skip this experiment for this dataset.Overall target as inputs outperforms most of the other prompt strategies across both these models and the two datasets.This leads us to believe that future annotation exercises for training hate speech detection models should almost always benefit if the target information is also annotated.Combinations: Next, we evaluate the cases when we utilise definitions as well as another additional prompt strategy (target/explanation at input/output).The performance of adding explanations either at input/output along with definitions does not help the models much with the performance remaining comparable, e.g., for gpt-3.5-turbo-0301 on HateXplain (+) explanation (input) with definition vs with (+) explanation (input) or worsens e.g., for gpt-3.5-turbo-0301 on HateXplain (+) explanation (output) with definition vs with (+) explanation (output).The quality of the explanations generated (in terms of average BERTScore or sentence-BLEU) are almost always worse in presence of the definitions.worse Similar situation arises when we add definitions with target (input/output) where the performance either remains comparable or worsens.Only in the case of the ToxicSpans dataset, we see that adding both explanation (input) and definition for the gpt-3.5-turbo-0301and flan-T5-large model and gives the best performance.
Cases of misclassification: For the implicit hate dataset we observe that in case of the gpt-3.5-turbo-0301model, for all the different prompt variants the largest number of misclassifications is from the non-hate to the implicit hate class.
For the text-davinci-003 model the major observations are as follows.In the vanilla setup (with or without definition), the implicit hate class is most often confused with the explicit hate class.However, if the prompt has an explanation component (either at input or at output), once again, there is a large number of misclassifications from the nonhate to the implicit hate class.For flan-T5-large we observe that the model fails to classify the implicit hate speech class.The implicit hate is classified as non-hate or explicit hate following the trend of the other two models.
For the HateXplain dataset we observe that in case of the gpt-3.5-turbo-0301model, for all the different prompt variants the largest number of misclassifications is from the normal to the offensive class.Another confusion that the model often faces is between the hate and the offensive class; if the prompts do not contain the definition (of hate/offensive speech), then offensive speech is largely mislabelled as hate speech while the results are exactly reversed if the prompts contain the definition.For the text-davinci-003 model once again we observe for all the different prompt variants the largest number of misclassifications is from the normal to the offensive class.Further, irrespective of whether the prompts contain the definition or not, hate speech is heavily mislabelled as offensive speech.For flan-T5-large both offensive and hatespeech classes are missclassified as normal speech for all prompt variations.For the ToxicSpans dataset in case of the gpt-3.5-turbo-0301model, the non-toxic data points are heavily misclassified as toxic for all prompt variants.Curiously, adding definition to the prompts increases this misclassification.These observations remain similar for the text-davinci-003 model.For flan-T5-large non-toxic data points are misclassified as toxic for all the prompt variations except for when target is asked to be generated in which case toxic datapoints are misclassified as non-toxic.
In order to better elicit the above observations, we present the confusion matrices for the best prompt combinations for each model and each dataset in Appendix C. Typology induction: In order to analyse the data points where the models are most vulnerable we employ the following heuristics for each dataset.In these heuristics we always consider the Vanilla + Defn + Exp (output) setup.
For the implicit hate dataset we sort the data points in non-decreasing order based on the BERTScore between the ground truth implied statement and the generated explanation at the output.Starting from the top of this list, we consider the data points that are either not_hate in the ground truth and misclassified as implicit_hate or vice versa by all the three models.Note that these data points constitute the cases where the models misclassify and provide poor explanation for their classification decision.These data points are then passed through LDA (Jelodar et al., 2019) (number of topics, K = 3) to induce the types.
For the HateXplain and ToxicSpans datasets we sort the data points in non-decreasing order based on the sentence-BLEU scores between the ground truth rationales and the generated rationales at the output.We select 80 low-scoring data points according to the BLEU/BERTScore measure.For the HateXplain (ToxicSpans) dataset, starting from the top of this list, we consider the data points that are either normal (non_toxic) in the ground truth and misclassified as hate speech/offensive (toxic) or vice versa by all the three models.These points are then passed through LDA (number of topics, K = 3) to induce the types.For each topic, we choose four words which have the highest probability of association with the topic.These collected set of words are then manually coded with topic names by two researchers with long experience in hate speech research.In order to obtain a semantic clarity upon the derived topic names, a meticulous manual annotation process is undertaken.This annotation task is done by three domain experts, and notably, unanimous consensus is reached among all annotators regarding the semantic interpretation and nomenclature of each delineated topic name.We note the emerging typologies in the Table 6.Observations from the induced typology: For each dataset, we note some of the interesting typologies that emerged.(1) For the implicit hate dataset, one curious case of misclassification (nonhate → implicit hate) is the presence of the 'racist' word, while in another category we find the model marked something as implicit hate because it has pro-white sentiments.Even when post representing news articles or opinion pieces contain sensitive words like 'blacks', 'antifa' etc., the models label them as implicit hate.(2) For the HateXplain dataset, we find that hateful or offensive posts get misclassified to normal due to multiple reasons -(a) presence of negations like 'neither', 'nor', 'no' etc. in the post, (b) vocabulary gap where the models do not know the usage of words like 'muzrat' used in the post, and (c) presence of polysemous words in the post with one of the meanings usually being a slang (usually not found in standard English dictionaries), e.g., 'dyke', 'furries', 'furfaggotry' etc. (3) For ToxicSpans dataset, we note that ideological posts and posts with fact-checking related statements about political news often get misclassified as toxic.On the other hand, posts that are implicitly toxic often get misclassified as non-toxic by the models.

Conclusion
In this work, we extensively study three LLMs across three datasets on a variety of prompt setups.Overall these models, though advanced, still face lot of challenges in hate speech detection in their vanilla setting.Our prompt strategies, when applied individually, improve the performance of such models.However, merging of the prompt strategies do not help much.Last, we do a detailed error analysis and typology coding to find 'jailbreak' points where these models are vulnerable.

Limitations
We mostly focus on English datasets in this paper since we wanted to present a detailed analysis of the use of additional context (explanation, target community) which is often not present in multilingual datasets.We use these LLMs as a black box hence we do not know the inner workings of the proprietary LLMs.Lastly, although we cover a lot of well thought out prompt variations in our paper, these variations are not exhaustive.
10 Ethical statement Here we discuss the ethical considerations that were not explicitly noted in the body of the paper.We use three of the LLMs gpt-3.5-turbo-0301,text-davinci-003 and flan-T5-large to detect hate speech.These experiments were purely done from a research point of view; the actual application of such model in real world should be done with caution.This is also evident from the challenges these models face while classifying different forms of hate.
A Prompt strategies The Table 7 shows the different prompt variants we use in this study.

B.3 ToxicSpans dataset
• Toxic: In social media and online forum, toxic content can be defined as rude, disrespectful, or unreasonable posts that would make users want to leave the conversation.• Non_toxic: The speech or text that is not toxic and is fit for use in conversation.

C.3 Flan-T5
The confusion matrices for the three best prompt combinations for the three datasets in connection to the flan-T5-large are illustrated in Figures 7, 8, and 9.      Figure 9: Vanilla + explanation_input for ToxicSpans.
(+) targets A very important information in any hate speech detection pipeline is the victim community the abusive/hate speech targets.Here again we test two hypotheses -(a) whether providing target to LLMs as inputs (corresponding to the templates Vanilla + Tar (input) and Vanilla + Defn + Tar (input)) improve its labelling decisions and

Table 6 :
The typologies induced from the error cases for each dataset.GT: ground truth, PR: prediction, n_h: not_hate, imp_h: Implicit hate speech is defined by coded or indirect language that disparages a person or group on the basis of protected characteristics like race, gender, and cultural identity.• explicit_hate: Explicit hate refers to openly expressed, direct forms of hatred and prejudice toward individuals or groups based on their characteristics.• not_hate: This class refers to speech or actions that do not involve any form of hatred, prejudice, or discrimination toward individuals or groups based on their characteristics.Any speech or text that attacks a person or group on the basis of attributes such as race, religion, ethnic origin, national origin, gender, disability, sexual orientation, or gender identity.• offensive: The text or speech which uses abusive slurs or derogatory terms but may not be hate speech.• normal: The text which is neither offensive or hate speech and adheres to social norms.