Unveiling Identity Biases in Toxicity Detection : A Game-Focused Dataset and Reactivity Analysis Approach

,


Introduction
Online spaces are valuable for exchanging ideas and discussing common interests globally.However, these interactions are often marred by toxic comments and content, evident on platforms like Facebook (Ciftci et al., 2017), Twitter (Watanabe et al., 2018), and Reddit (Mohan et al., 2017).The videogame industry is also not immune to harm and harassment, as evidenced by the rising toxicity in written communications among players (ADL, 2022).This high level of toxicity not only affects gaming choices but also the personal lives of players involved.Consequently, platforms (Hanu and Unitary team, 2020;Muralikumar et al., 2023) and the videogame industry (Miller, 2019;Shi, 2019;Unity, 2021) have turned towards language models for toxicity detection and content moderation due to their excellent performance and contextual understanding.
Although these models can effectively capture toxic content, they can perpetuate and amplify social biases present in their training datasets (Angwin et al., 2016;Caliskan et al., 2017;Dixon et al., 2018;Savoldi et al., 2021).Biases can emerge during dataset creation when practitioners sample data, annotators label data based on personal understanding, culture, and experiences, and practitioners aggregate labels.In this study, we specifically focus on the issue of models overestimating the toxicity of terms associated with certain concepts, leading to problematic false positives and even false negatives (Dixon et al., 2018;Kiritchenko and Mohammad, 2018;Prabhakaran et al., 2019;Sap et al., 2019;Garg et al., 2022).Existing research has primarily focused on biases in toxicity detection without considering the specific use-case of in-game chat, despite the widespread presence of toxicity in that particular context.
To fill this research gap, we begin by meticulously constructing a dataset that aims to uncover identity biases present in language models.This dataset encompasses biases commonly observed within the English-speaking gaming community in North America.Next, we introduce a novel approach that combines reactivity analysis and the model's performance to identify highly sensitive terms.We apply this method to assess the effectiveness of ToxBuster (Yang et al., 2023), a model specifically trained for in-game chat, as well as Perspective API.Through our evaluation, we demonstrate the efficacy of our approach and its potential for evaluating various forms of biases in toxicity detection models.Both the dataset and the method are described in this paper as a proof of concept, as they do not cover every possible bias.Additional iterations are possible and strongly encouraged.
In summary, our contributions are two-fold, presenting a prototype of the following: 1.An identity bias benchmark dataset for toxicity detection models, 2. A novel method that combines reactivity analysis and model performance to identify sensitive terms possibly conveying biases.

Related Work
Toxicity detection is inherently complex and subjective, with different definitions and interpretations among researchers (Garg et al., 2022;Kowert, 2020).Biases also vary across communities, influenced by culture, origin and socio-political context.In this study, we define biases as "prejudice in favour of or against one thing, person, or group compared with another usually in a way that's considered to be unfair" (University of California).Natural language processing encompasses a wide range of types of biases, categorized by their sources or the type of harm they cause (Sap et al., 2019;Garg et al., 2022).Our focus lies specifically on lexical identity biases, which refer to biases conveyed by terms related to one's identity or characteristic (Zhou et al., 2021).Initially, we address the questions posed by Blodgett et al. ( 2020), more precisely "what kinds of system behaviours are harmful, in what ways, to whom, and why".The harmful behaviour we examine are false positives and false negatives concerning identity biases.In other words, we aim to identify terms that the model consistently tags as toxic or non-toxic, even when they are used in the opposite manner.A false positive resulting from an identity bias prevents marginalized and possibly minority communities from discussing and engaging with members of their own group (Zhou et al., 2021) thereby erasing proper representation of that social group (Dev et al., 2021a;Blodgett et al., 2022).Conversely, a false negative neglects to flag a sentence containing a term that should be identified as toxic, usually associated with an oppressing or majority community.
Previous work on bias detection has focused on creating evaluation metrics (Prabhakaran et al., 2019) or corpora (Kiritchenko and Mohammad, 2018) that consider binary dimensions, such as gender (male, female) or race (black, white), to determine the model's leanings.However, online spaces such as videogames encompass diverse communities that cannot be easily classified into such binary labels, rendering this analysis insufficient.Kiritchenko and Mohammad (2018) developed a dataset of sentence templates and human names to identify biases in a model, however these longer sentences and names are not typically found in videogame chat.

Methodology
Here, we present our method for identifying biases and oversensitive terms toxicity models flag in four sections.Section 3.1 discusses the generation of a synthetic dataset using sentence templates and a set of manually identified terms associated with potential biases.Section 3.2 outlines the annotation process for this dataset.Section 3.3 focuses on using reactivity analysis to highlight terms and biases.Lastly, in section 3.4, we describe the two toxicity detection models that we measure.

Dataset Creation
Following Kiritchenko and Mohammad (2018), we create a set of sentence templates and a corresponding set of terms.Initially, a list of terms was generated by identifying categories commonly assessed in identity bias research, such as race, gender, sexual orientation, and age (Dev et al., 2021b;Garg et al., 2022).To ensure templates and terms relevance to the gaming domain, we sample thousands of chat lines containing this first list of terms to identify sentence formulations frequently used within the chat of an online competitive multiplayer game where communication is permitted among teammates as well as with enemy teams.Each team consists of 5 players.
In order to ensure inclusivity in our template selection and bias categories, we sought input from volunteer members of various employee resource groups within our organization.These groups consist of employees with diverse backgrounds and perspectives.The four employees consulted provided valuable feedback on the chosen terms, sentences, bias dimensions, and proposed additional ones.
Through this process, we compile a final list consisting of 46 terms categorized into 12 categories, as shown in Appendix A. We prioritize terms that exhibited ambiguity in their offensive nature, as they can be used in either a toxic or non-toxic manner -e.g.you look gay would be toxic, while they discussed gay marriage is not (Wang et al., 2014;Zhou et al., 2021).Terms considered to be insults or slurs without any ambiguity (e.g.ret*rd) regarding their offensive nature in North American English were not included in this list.We note that the term fairy is included in the sexual orientation category, as it can be used to refer to a homosexual man.
We carefully select 22 grammatically correct sentence templates presented in Table 5 of Appendix A to reflect sentences found within in-game chat lines.These templates are also chosen to ensure variation in the syntactic function of the evaluated terms, allowing them to serve as either subjects or objects within sentences.Each sentence includes one or two "tags", representing variables to be replaced by a single term or noun phrases consisting of two terms.The <Attribute> tag must be a single-term adjective, such as black, white, trans, old or young.The <Person> tag represents a singular noun (female, male, artist, developer. . .), either alone or preceded by an <Attribute> if it is not already included in the sentence (e.g., black female, black male, black artist).Similarly, the <Groups> tag functions like <Person>,but with the noun in its plural form (e.g., black females, black males, black artists).Note that a term belonging to <Attribute> in its singular form and <Groups> in its plural form (black woman, but also trans blacks) cannot be combined with itself (black blacks).
Using the sentence templates and list of terms, a total of 16,008 synthetic chat lines were generated.Examples are seen in Table 1.The dataset is diverse and encompasses categories of biases that are not limited to binary options; its main purpose is to evaluate the model's reactivity to each term irrespective of its category.We emphasize that the dataset serves as a prototype, indicating the potential use of a broad range of terms and templates to subsequently expand the dataset.

Dataset Annotation
Our dataset is annotated through a two-step process.Firstly, a sample of the dataset is manually annotated.Secondly, a random forest model is trained to propagate these annotations for the entire dataset.
We circumvent making assumptions about the toxicity or lack thereof in a term, sentence template, or their combination by including various categories for terms and sentiment polarity for templates.This is to avoid rejecting terms and templates that may seem inoffensive in themselves or combined, yet might be evaluated as toxic by a human.Human annotations will decide which sentence is toxic or non-toxic.Our method may then reveal unexpected biases, while unbiased terms and templates will manifest their neutrality.

Manual Annotation
We obtain a set of ground truth labels from four participants.These participants were recruited from within the game company that developed ToxBuster (Yang et al., 2023), which is further described in Section 3.4.1.
A total of 1,363 lines, a subset of the complete dataset, was annotated.The decision to annotate only a fraction of the dataset is due to both the exploratory nature of this research and the limitation of resources to annotate a large dataset.We further discuss the decision and motivation in the limitations.Annotations guidelines and details are provided in Appendix B. The process ultimately allowed us to obtain a binary label for each line of the subset.

Annotation Propagation
We propagate the manual annotations to the full dataset by training a Random Forest.In particular, we perform a 5-fold CV over 6 mtry parameters (see section C) with a 20-80 train-test split.The best performing parameters are ntree=500 and mtry=15, with a F1 score of 90.4% on the test set.
Ideally, this step would not be needed as the whole dataset would be manually annotated.

Reactivity Analysis
We will now elaborate on the process of identifying biases and models reactivity to certain terms in the dataset.
Our objective is to compare the analysis results of each toxicity detection model with the ground truth annotations from the annotated dataset.
We conduct a reactivity analysis by calculating the average predictive difference in the probability of toxicity for each lemmatized term.In other words, we measure the difference of each sentence's toxic probability when the specific term is present or absent (Gelman and Hill, 2006), providing insights on the influence of the presence of the term on toxicity.
We calculate the reactivity score of a term by determining the average predictive difference over all sentence templates.The predictive difference between the absence and presence of a term is calculated where u (0) represents the absence of the term, u (1) represents the presence of term, and υ represents the vector of all other inputs at that data point, as shown in Equation 1.We utilize the coefficients from a regularized logistic regression to estimate the probabilities.
The reactivity score can be interpreted in the following way.The highest possible score a term could obtain is 100, while the lowest is -100.A score of zero indicates that a term's presence has no impact on the toxicity of a sentence and the context of the sentence matters more.A positive score suggests that the presence of the term increases the likelihood of the sentence being flagged as toxic.Conversely, a negative score indicates that the sentence is more likely to be flagged as non-toxic when the term is present.In our specific use-case, a model with scores closer to zero is preferred as it indicates that the model is less likely to systematically react to the presence of a term.

Models Specifications
We detail here the two toxicity models we perform the reactivity analysis on.

ToxBuster
ToxBuster is a model based on BERT (Devlin et al., 2019) that is currently being developed by Ubisoft La Forge as a research and development effort (Yang et al., 2023).The model was fine-tuned specifically to predict toxic spans of text in in-game chat, utilizing 8 different classes of toxicity.It achieves a F1 score of 83.25.For this analysis, we adapt the model by considering a sentence to be toxic if any token within the sentence is predicted to be toxic.

Perspective API
Perspective API is a tool built by Jigsaw with the purpose to "help mitigate toxicity and ensure healthy dialogue online."(Lees et al., 2022).The model undergoes regular updates, and the complete dataset is not publicly available.The dataset used in part includes the Jigsaw datasets (citations), which comprise comments from Wikipedia and news posts.In our analysis, we utilize version v1alpha1.According to the API guide, the toxicity threshold can vary depending on the specific usecase.For our analysis, we consider a sentence as toxic if the toxicity score returned by the API is ≥ 0.5.

Results and Discussion
After obtaining all the annotations and predictions, we calculate the proportion of toxic labels for each source, as depicted in Right away, we notice a disparity in toxic label proportions across annotations and the two toxicity detection models, suggesting the presence of biases and a varying effectiveness.It prompts a closer examination of the model's performance, which will be detailed in the next sections.We first performed the reactivity analysis on the models and the human annotations propagation.The precision, recall and F1 scores are used to evaluate the models' predictions on the dataset, using the propagation as a gold label.

Human Annotations Propagation
To establish a ground truth and reference point, we present the reactivity scores for the toxicity labels resulting from propagation.Table 3 displays the top ten highest and lowest scores.Terms not included have scores between -0.5 and 0, indicating a low impact on the annotators' decisions.
The results highlight that, from the annotators' perspective, the term homo has a stronger association with toxicity compared to other terms.This raises questions as to how this term is perceived both in-game and in general, particularly in North America.Insights gathered during the discussion when selecting identity terms suggest that the expression "no homo" is often used by people to assert their non-homosexuality and can perpetuate the notion that displaying feminine characteristics implies homosexuality.Although not included in the dataset, this illustrates how the term can be commonly used by individuals who do not identify as part of the referenced community, and this usage can be seen as harmful and toxic.Among the lowest scores, we observe three terms related to objects, as well as the term fairy.This aligns with expectations as these terms are relevant within the context of a video game, even though fairy falls under the sexual orientation category.Additionally, four other terms with negative scores are related to the occupation category, indicating that these terms may not be problematic or not discussed for players in general.This table will serve as a baseline to evaluate the two models under scrutiny.

Perspective API
The same analysis was conducted with Perspective API.The terms with the 15 highest and 15 lowest reactivity scores can be found in Appendix D, where the terms are arranged by their precision and recall.Figure 1 provides a visualization of the main cluster in the plot.For reference, Perspective API achieves a F1 score of 62.82% on the annotated dataset.On the same dataset ToxBuster is trained on, it obtains an F1 score of 36.81%(Yang et al., 2023).This disparity suggests that assessing the model using this dataset and method is worthwhile.Even if the dataset is built to reflect real-world game chat and the model is trained mostly on social media data, the latter demonstrates a great performance on the former.
The reactivity score of each term is represented by the colour of the points.The complete table can be found in Appendix D. The remaining terms' scores range between 0 and 1.02, with only 3 terms exceeding zero.As previously mentioned, this is a desirable outcome.We note that sexual orientation dominates the top three reactivity scores.These results align closely with the observations obtained from the ground truth data, where the term homo is also at the very top.In total, five terms (homo, gay, trans, black and jew) appear in both the Perspective API's top 10 results and the propagation results.The five other terms (lesbian, fat, autistic, autist, obese) introduce different categories compared to those seen in the propagation, namely weight and neurodivergence.The higher reactivity score to these terms compared to the ground truth raises the question of whether there is a need for a higher sensitivity to these terms than others.Figure 1 depicts all terms with a positive score, forming a cluster on the top-right edge in Figure 4. Terms with a score of zero are in the low-end of the cluster.A high precision but low recall for these terms indicate the model is conservative, less prone to false positives and overall has low sensitivity.
Analyzing the negative scores, objects and the term fairy are the lowest.Age and occupation can also be found, but are much closer to zero. Figure 4 Figure 1: Perspective: Performance metrics and reactivity analysis results.
shows that many terms associated to these negative scores are predominantly located in the lower left, and predicted with low precision and recall.This aligns with the reactivity analysis, suggesting that the model is indeed less sensitive to these terms as well.

ToxBuster
The analysis results of ToxBuster top 15 highest and lowest reactivity score are shown in Figure 3. Figure 2 illustrates the main cluster.ToxBuster achieves an overall F1 score of 67.16%.Similar to the Perspective API, there are no particularly strong outliers for any term.However, the reactivity score for terms not included are all above zero, with only three terms approaching zero:guy (0.02), fireman (0.02) and student (0.81).The remaining scores range between 3.28 and 11.52.Just like the propagation results, the same three sexual orientation terms are in the top scores.We do notice that the term homo is ranked lower this time, but still has the highest F1 score (90.4%).Our reactivity analysis results align with the gold standard.Terms that exhibit substantial variation compared to the propagation and Perspective API results are asian and mexican from the "origin" category, muslim from "religion", black and brown from "color" and finally trans from "gender".These are all terms having a meaningful impact on the model's decision, which may indicate an exaggerated sensitivity and the presence of biases.
In terms of performance, these terms form a large cluster characterized by high recall (91% to 100%) and low precision (48% to 61%).The high recall scores are expected since the system identifies most terms as toxic, including those that shouldn't be.This is also consistent with the toxic label proportion of 88.38%, which is considerably higher than annotation propagations and Perspective API's.The low precision for most terms indicates a significant number of false positives.
On the negative scores side, "occupation" and "objects" categories are the lowest.One term that stands out is yellow, which appears in the highest scores of the propagation.It is possible that the data ToxBuster is trained on lacks sufficient examples of yellow being used in a toxic way, even though human annotators perceive it as offensive.

Common Findings
For both ToxBuster and Perspective API, the terms weapon, gun, house and fairy have the lowest F1 scores, despite being identified as non-toxic by both the models and human annotators.Our hypothesis is the context around these terms impacts toxicity more than the term itself.For example, the term weapon is often accompanied by an <Attribute> that would trigger the model even if weapon by itself strongly indicates non-toxicity.Generally, terms with negative reactivity scores have lower recall compared to terms with positive scores, validating the fact that the models are less sensitive to the presence of these terms.
Although models' results differ, they are still comparable and serve as great examples of how the dataset and analysis paradigm can be used to learn substantial information about their predictions.

Discussion
Based on the combined analysis of reactivity and performance, we can create a watchlist of terms and categories that are commonly used to express biases.Terms with a high reactivity score in both the ground truth and the models and have better performance should not be included on the list.
As an example, for Perspective API, we would consider "weight" and "neurodivergence" categories as well as the term lesbian.Additionally, the term homo is included for a different reason: it has a high reactivity score for both the model and the ground truth, which contradicts the low recall score associated with it.This suggests that the reactivity for homo in Perspective API might not be high enough.
For ToxBuster, we would put terms from the "origin" category on the watchlist, notably the terms asian and mexican.The terms black, brown and trans would also be added.From the negative reactivity scores, the term yellow would be included.

Conclusion
We have developed an evaluation dataset that includes examples of in-game chat lines with a range of terms related to identity biases.Our approach proposes a novel method that utilizes reactivity analysis and model performance to identify sensitive terms with biases in toxicity detection models and would need further human interventions.We have applied this analysis to ToxBuster and Perspective API, demonstrating the potential application of our method and dataset to models beyond the gaming domain.
Through this process, we have generated an interesting list of terms that trigger the models to varying degrees, along with insights on establishing a baseline through human annotations.These findings can contribute to providing explanations for the models' predictions, which is crucial in bias and fairness research.We now have a clearer roadmap for future steps, including obtaining a reliable ground truth through diverse human annotators, evaluating the models with different settings and parameters, and incorporating linguistic and sociolinguistic considerations to enhance our understanding of how these terms operate in both gaming and non-gaming contexts.

Limitations
As previously mentioned, the dataset was created with a North American English-speaking linguistic community in mind.This signifies that the dataset can be used to evaluate models that have been trained in English, but also that the identity biases it evaluates are only relevant to this specific community.Covering biases for multiple communities, even for the same language, can have complex implications.
We mentioned in section 3 that only a portion of the dataset was annotated (1363 lines out of 16 008).The number of lines was determined to ensure the participants would have sufficient time to annotate the lines while also taking as many breaks as possible in the span of two weeks.As this was an internal test, participants had other assignments they had to attend to, and were not required to be full-time on the task.Considering that there are 48 terms and 22 sentence templates, the lines were selected randomly, with some manual adjustments to add or remove sentences to ensure that a term present 10 times was not 10 times in the same type of sentence.However, it was impossible to cover all the interactions between terms and templates, which makes the small dataset inherently unbalanced.
Although the propagation of human annotations method was chosen carefully, there is still a risk that the algorithm inserted other biases, or that predictions were not representative of the human annotations.We wish to mitigate this issue, as well as the unbalanced issue, by annotating the complete dataset to get more solid results.
The overall performance of ToxBuster and Perspective API is considerably low considering their performance on other datasets.For this, we have several hypotheses to be addressed in future works.For instance, the models may not have been specifically tuned for this type of dataset, or there could be a possibility that the models' sensitivity is not adequately adjusted for the task.These factors may also contribute to the observation that terms with higher reactivity scores have better performance in both models compared to the average.searchers, unless they themselves chose to mention their involvement.As addressed in section 5, the annotation process was done on a small part of the dataset as a way to assess the difficulty of the task on annotators.However, to better comply with ethical principles and a purpose of better diversity and inclusion, the whole dataset would need to be annotated.

A Categories, Terms and Templates
Table 4 shows the different categories considered, the terms they contain and the tags associated.Table 5 details the possible sentence templates.

B Manual Annotation Guidelines
Each participant was given a two-week period to annotate the lines at their convenience, with an estimated time requirement of 3 to 5 hours to annotate all the lines.The lines were presented to each participant in a random order.To align them in the annotation task, the following instructions were provided: 2. Your job is to decide if the chat line you read targets and is harmful to a specific person or a group of people, and to indicate your level of confidence in your answer.
3. When reading a line, do not try to infer what is the author's intent: simply consider the line in itself.For the assessment of targeted harm in chat lines, we collect a binary label (i.e.Does this chat line target and is harmful to a specific person, or a group of people?).Given the highly subjective nature of evaluating toxicity and harmful content, we incorporate a Likert scale to gauge the annotators' confidence, where 1 represents "Very uncertain" and 5 indicates "Absolutely certain".
Subsequently, we aggregate the annotations by calculating a weighted score based on the annotators' confidence levels.
To determine the final binary label of toxic and non-toxic, we consider any sentence with a score ≥ 0.5 as toxic.

C Propagation Algorithm
For training and evaluating the model, we used the R package Caret (Kuhn, 2008), using the random forest ("rf") method on a 20-80 train-test split of the annotated dataset.The random forest model was trained using 5-fold cross-validation.The tuning grid, automatically determined by the Caret package, consisted of the "mtry" hyperparameter with values [2,15,28,41,54].Model evaluation was performed using accuracy.The optimal configuration was found at mtry = 15, resulting in the following performance metrics on the holdout test set : accuracy = 0.89, precision = 0.87, recall = 0.94, and F1-score = 0.90.It is important to note that Caret uses a default value of ntree = 500.

D Performance Metrics and Reactivity
Here, two figures (ToxBuster : Figure 3 and Perspective API : Figure 4) allow for a visualization of terms with the 15 highest and 15 lowest reactivity scores, arranged according to their respective recall and precision scores.The results for all terms are found in corresponding tables (ToxBuster : Table 6 and Perspective API : Table 7).

Figure 4 :
Figure 4: Perspective API: Visualization of performance metrics and reactivity analysis results for the 15 highest and 15 lowest reactivity scores.

Table 2 .
Source of label % of Toxic labels

Table 2 :
Proportion of toxic labels for each label source.

Table 3 :
Propagation -Ten highest and ten lowest reactivity scores.

Table 4 :
1. Imagine you are in a multiplayer game, with the chat open to communicate with other players.Categories, terms and their associated tags.

Table 5 :
Sentence templates for generating synthetic dataset.

Table 7 :
Perspective API : Full table with performance metrics and reactivity scores.Ordered from the highest reactivity score to the lowest.