Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI

Detecting “toxic” language in internet content is a pressing social and technical challenge. In this work, we focus on Perspective API from Jigsaw, a state-of-the-art tool that promises to score the “toxicity” of text, with a recent model update that claims impressive results (Lees et al., 2022). We seek to challenge certain normative claims about toxic language by proposing a new benchmark, Selected Adversarial SemanticS, or SASS. We evaluate Perspective on SASS, and compare to low-effort alternatives, like zero-shot and few-shot GPT-3 prompt models, in binary classification settings. We find that Perspective exhibits troubling shortcomings across a number of our toxicity categories. SASS provides a new tool for evaluating performance on previously undetected toxic language that avoids common normative pitfalls. Our work leads us to emphasize the importance of questioning assumptions made by tools already in deployment for toxicity detection in order to anticipate and prevent disparate harms.


Introduction
Social media and online discussion forums are plagued with toxic content.41% of American adults have personally experienced online harassment (Vogels, 2021).
This problem can be further exacerbated by learning algorithms with unchecked incentives.On a platform like Facebook, incendiary comments are highly visible due in large part to the design of its ranking algorithms (Munn, 2020).Some experiments by Cheng et al. (2017); Kim et al. (2021) have shown that users are more likely to write toxic or "troll" posts when the initial comments in a conversation exhibited toxic or "trolling" behavior, which can be interpreted as evidence that "trolling is contagious." This creates the potential for a negative feedback loop where companies are incentivized to promote toxic content and toxic content in turn provokes users to post more malicious content.In response, researchers and corporations alike have invested in systems that can detect toxicity at massive scales to help moderate content on social media and internet forums (Lees et al., 2022;Pavlopoulos et al., 2017;Wulczyn et al., 2017).Some of these models have, however, been shown to be susceptible to simple adversarial attacks (Hosseini et al., 2017;Gröndahl et al., 2018).
Text Toxicity (Perspective) Toxicity (Human) You look so good in that dress, and probably even better with it off.
22.9% 85.0% Table 1: A toxicity score given by PERSPECTIVE to a SASS example from the "Sexual Harassment" category.
Still, deep language model-based approaches to detecting toxicity have been deployed, with consumer facing API's granting direct access to the model output.One such model is PERSPECTIVE (Lees et al., 2022), which assigns text a "toxicity" score ∈ [0, 1], provided by Jigsaw (a Google subsidiary).In 2022, Jigsaw published findings on the "new generation" PERSPECTIVE toxicity model (currently deployed).We believe that interrogating the model's assumptions and frailties is vital work in moving it (and similar systems) towards robust detection (or, perhaps, a robust understanding of when detection fails).For example, Jigsaw defines toxic language as "a rude, disrespectful, or unreasonable comment that is likely to make you leave the discussion" (Lees et al., 2022), though other definitions exist (Märtens et al., 2015).We were unable to find an original source for Jigsaw's definition.Contributions Existing models and benchmarks rely on aggregating binary responses to text collected from crowdworkers into a ground truth "probability of toxicity" (this is accomplished by prompting a crowdworker with "Is this text toxic?", and then calculating the aggregate P r[toxic] = |yes_responses| |total_responses| , which is the "toxicity score").We suspect this method overemphasizes a normative understanding of toxicity, such that potentially toxic, harmful text "on the margins" goes undetected.Here, "normative" describes the way in which multiple annotations are traditionally aggregated, which often implicitly supports the views of the majority and ignores the annotations of minority groups.In response, we isolate a set of natural language categories that fulfill the definition of toxicity (as stated earlier), but go largely undetected, due in part, we believe, to the normative assumptions of the ground truth toxicity examples from existing training and benchmark data.Again, these normative assumptions are related to the way data is aggregated, which may ignore the views of a minority of annotators in favor of the majority.
We present a new benchmark entitled Selected Adversarial SemanticS, or SASS, that evaluates these behaviors.SASS contains natural language examples (each approximately 1-2 sentences in length) across previously underexplored "toxicity" categories (like manipulation and gaslighting) as well as categories that have received attention (like "sexism" (Sun et al., 2019)), and includes a "human" toxicity score ∈ [0, 1] for each example.Table 1 shows an example from the "Sexual Harassment" category.SASS follows a filtered/unfiltered approach to adversarial benchmarking, as in (Lin et al., 2021).The benchmark is designed to exploit the normative vulnerabilities of a toxicity detection tool like PERSPECTIVE.Specifically, PERSPEC-TIVE makes ambiguous claims that they can "identify abusive [or toxic] comments" (Jigsaw), but do not clarify that these abusive comments are determined by essentially using the majority opinion of random annotators.Our position is that PER-SPECTIVE should either be clear concerning the limitations of it's toxicity tool (i.e. that it detects toxic content according to majority opinion), or adjust the PERSPECTIVE model to better account for minority annotations.
We compare PERSPECTIVE's performance on SASS to "human" generated toxicity scores.We further compare PERSPECTIVE to low-effort alternatives, like zero-shot and few-shot GPT-3 prompt models, in a binary classification setting ("toxic or not-toxic?")(Brown et al., 2020).Code for our project can be found in this repository.

Related Work
Past PERSPECTIVE Model Works such as (Hosseini et al., 2017) and (Gröndahl et al., 2018) focused on generating adversarial attacks to test how the former version of PERSPECTIVE responded to word boundary changes, word appending, misspellings, and more.(Gröndahl et al., 2018) further tested how toxicity detection models responded to offensive but non-hateful sentences.The toxicity of the test sentences heavily increases when the word "F***" is added (You are great → You are F*** great, 0.03 → 0.82).This opens up a discussion about the subjectivity of what should be considered "toxic", a theme in our work.We pose new open questions that draw a clear connection between "toxicity" and normative concerns (Arhin et al., 2021).Another promising approach to fortifying toxicity detectors is by probing a student model with a few annotated examples to detect veiled toxicity, mostly annotated incorrectly, from a pre-existing dataset, then re-annotating, thus making the model more robust (Han and Tsvetkov, 2020); we do not attempt this in our work.
Current Model A recent publication on PER-SPECTIVE (Lees et al., 2022) generated benchmarks to test how the new version responded to character obfuscation, emoji-based hate, covert toxicity, distribution shift and subgroup bias.They demonstrate improvements of the model in classifying multilingual user comments and classifying comments with human-readable obfuscation.Additionally, PERSPECTIVE beats every baseline on character obfuscation rates ranging from 0% to 50%.Character-level perturbations and distractors degrade performance of ELMo and BERT based toxicity models, reducing detection recall by more than 50% in some cases (Kurita et al., 2019).Separate detection tools, like the HATECHECK system from (Röttger et al., 2020), present a set of 29 automated functional tests to check identification of types of "hateful behavior" by toxicity or hate speech detection models.A large dynamically generated dataset from (Vidgen et al., 2020), designed to improve hate speech detection during training, showed impressive performance increases in toxicity and hate speech detection tasks.Though slightly different in their typology of toxic speech, these approaches have a significant scale advantage over SASS, while SASS examples are specifically targeted at the PERSPECTIVE tool.

Benchmarking with SASS
The SASS benchmark contains 250 manually created natural language examples across 10 nuanced "toxicity" categories (e.g.stereotyping, classism, blackmail).These categories were selected via a process of literature review and vulnerability testing on PERSPECTIVE and other toxicity tools, to determine their weaknesses/strengths.As we sought to challenge PERSPECTIVE and other toxicity tools, we believe this to be a sufficient process for determining our categories, although acknowledge that it introduces some unavoidable author bias.The examples are each 1-2 sentences long and are designed to exploit vulnerabilities in toxicity detection systems like PERSPECTIVE.Samples from SASS in each category are shown in Table 2.
Eight of SASS's categories are aimed at generating "False Negative" (FN) scores (a score that significantly underestimates the toxicity of some text), one category is aimed at "False Positive" (FP) scores (a score that overestimates toxicity), and one category is "Neutral," a control, demonstrating the model's performance on "normal," non-toxic sentences.SASS is heavily biased towards examples that generate a FN score, which we argue may be more harmful than a FP score, as a FN means toxic content has gone undetected.For each category, the benchmark contains 15 "filtered" and 10 "unfiltered" examples, drawing inspiration from (Lin et al., 2021).We generate filtered examples by brainstorming toxic comments and evaluating the comments with PERSPECTIVE to ensure a toxicity score of < 0.5.Then, we generate an additional set of 10 examples per category using the knowledge gained from creating the filtered examples without first testing them on PERSPECTIVE.
Human Ground Truth The benchmark also contains a "human" toxicity score ∈ [0, 1] for each comment, which can be used as a baseline for evaluating toxicity detection tools using SASS.The human toxicity scores are an average of the toxicity scores of the authors per comment (scored blindly).
Here, we scored examples on a scale of 0-10, using Jigsaw's definition of toxicity, i.e. "how likely [the example is to] make [a user] leave the discussion" (0=highly unlikely, 10=highly likely).Significantly, we aligned these ratings with assumptions laid out in A.2.2 (in appendix) for consistency and to combat benchmarking pitfalls (Blodgett et al., 2021).
We further performed z-normalization, as per (Pavlick and Kwiatkowski, 2019).Each author may have treated the "0-10 toxicity scale" differently, so this normalization process ensures that the final aggregate scores are not overly biased by any single author's interpretation of the scale.
In Table 5 (in the appendix), we observe the average z-normalized human toxicity scores of comments in SASS across the toxicity categories described above.We note that some categories are inherently more toxic than others; "Stereotyping" comments have an average human toxicity score of 0.81 versus 0.57 for "Gaslighting" comments, which further contrasts with an average human toxicity score of 0.007 for "Neutral" comments.

Experiments and Discussion
Binary Toxicity Classification We showcase the utility of SASS by evaluating PERSPECTIVE and GPT-3 against the human baseline in a binary classification setting.It's important to note that PERSPECTIVE and GPT-3 are very different systems, trained with distinct objectives, amounts and sources of data.We believe the comparison is still useful because it provides a "low-effort alternative" to make sure that our examples are not overly complicated.Note that GPT-3 was not fine-tuned explicitly for this task, so we prompt the system in zero, one, and few-shot settings for a binary toxicity classification.We binarize the PERSPECTIVE and z-normalized human baseline toxicity scores by labeling scores > 0.5 per comment as "toxic".
The binarized ground truth human labels on SASS contain 72.4% toxic labels versus 27.6% non-toxic labels.We use these thresholded human labels as ground truth and evaluate PERSPECTIVE and GPT-3's performance on SASS in Table 3.
Model Description PERSPECTIVE uses a Transformer model with a state-of-the-art Charformer encoder.The model is pretrained on a proprietary corpus including data collected from the past version of PERSPECTIVE and related online forums.This dataset is mixed in equal parts with the mC4 corpus, which contains multilingual documents (Lees et al., 2022).GPT-3, created by OpenAI in 2020, is a state-of-the-art autoregressive transformer-based language model (Brown et al., 2020).GPT-3 is trained on a massive amount of internet text data, predominately Common Crawl and WebText2 (Radford et al., 2019), and generates human-like language in an open prompt setting.
Results We first observe that PERSPECTIVE performs very poorly on the binary task of toxicity classification on the SASS benchmark (Table 3, F1-Score = 0.08).Note that the majority of comments in SASS were crafted specifically to generate a low toxicity score from PERSPECTIVE, so this is not surprising.We establish the metric regardless, as a baseline to evaluate future versions of the system.
We also examine the performance of GPT-3 in multiple prompt settings for binary (true/false) See Appendix A.1 for details on prompt generation.Recall that "Neutral" and "False Positive" categories are inherently non-toxic, accounting for 20% of non-toxic labels.
https://commoncrawl.org/ toxic content classification in Table 3.Each system yields relatively high precision and low recall, generally indicating a significant under-prediction of toxicity in SASS.GPT-3 has more success in classifying harmful comments in SASS as toxic across the board relative to a thresholded PERSPECTIVE.GPT-3-FEW (F1-Score = 0.61) shows a significant improvement over both GPT-3-ZERO and GPT-3-ONE as well as PERSPECTIVE, yielding the most success relative to the human baseline of any of the experimental formulations.
We hypothesize that GPT-3 outperforms PER-SPECTIVE largely due to the sheer scale and scope of data that GPT-3 is trained on, as well as the size of the model itself (175B learnable parameters in GPT-3 versus 102M in the PERSPECTIVE base model).While GPT-3 is not trained for the toxicity detection task specifically, by learning from such a massive amount of internet text data spanning millions of contexts, the model has likely been exposed to a much wider range of potentially toxic material then PERSPECTIVE.
In Table 5 (see appendix), we break down the toxicity scores of PERSPECTIVE and GPT-3 by SASS category, relative to the human baseline.In some categories, both PERSPECTIVE and GPT-3-FEW fall particularly short (for example, PER-SPECTIVE predicts an average toxicity score of 21.9% for "Sexual Harassment" comments versus the 80% human baseline).Relative to other categories from SASS, PERSPECTIVE similarly rates comments in "Sarcasm" and "Stereotyping" as highly toxic, while humans rated the toxicity of "Stereotyping" comments significantly higher than those in "Sarcasm."This raises the question of how to properly threshold scores from a toxicity detection system in-the-wild, which (Lees et al., 2022) do not comment on, though seems a reasonable use case for platforms flagging toxic content.
In the "False Positive" category we observe that both PERSPECTIVE and GPT-3-FEW yield very high toxicity scores on average (Table 5), suggesting that the models are overfit to swear word toxicity, and underfit to a deeper interpretation of malicious intent.We believe it is important to delineate between the tasks of swear word detection and toxicity detection, and so find this undesirable.Allowing harmful comments to slip through the cracks is arguably more dangerous than unintentionally removing content with positive intent, but both of these scenarios could be upsetting to a downstream user.We report further on the influence of swear words on toxicity in the next section.
Profanity and Toxicity Detection SASS includes 18 "False Positive" examples that contain swear words.
PERSPECTIVE rated all of them as toxic, and GPT-3-FEW labeled 83% of these comments as toxic (this is P [toxic|contains_swear_word]).This suggests that, instead of understanding when swear words are used to communicate hateful content, PERSPEC-TIVE may be effectively memorizing their inclusion in toxic text.This could be problematic; swear words can be used to communicate non-toxic emotions, like surprise (e.g.Holy f*** I got the job!) or excitement (e.g.Oh sh**!Congratulations.) and should not necessarily be treated equivalently to toxic speech.Furthermore, different genders and races utilize profanity differently, so associating expletives with toxicity could have disparate impacts (Beers Fägersten, 2012).Past work by (Gröndahl et al., 2018) evaluating an older version of PER-SPECTIVE also detected this issue.
As shown in Table 6 (see appendix), from the 34 SASS examples that PERSPECTIVE rated as toxic, 52% contained a profanity, versus only 11.6% of the examples rated toxic by GPT-3-FEW (this is P [contains_swear_word|toxic]).A lot of hateful content does not explicitly contain offensive words and it is troubling that PerpectiveAPI relies so much on them in our benchmark.
TweetEval We were surprised that GPT-3-FEW performed better in the binary classification scenario on the SASS benchmark than PERSPECTIVE, and so sought to validate the finding with another prominent toxicity benchmark, TweetEval.Thus we selected 1,000 examples from the 'Hate Speech Detection" benchmark randomly (Barbieri et al., 2020).We acknowledge that this might be viewed as irrelevant or an unfair comparison, as some "toxic language" may not qualify as "hate speech" (for example, universal insults that do not target a specific group).However, we believe that the reverse claim, that all "hate speech" should qualify as "toxic language" is true.Then evaluating both PERSPECTIVE and GPT-3-FEW on a "hate speech" benchmark, despite both being designed to detect "toxic language," is a valid comparison.We found that PERSPECTIVE had an F1-Score of 0.48 and GPT-3-FEW had an F1-Score 0.52 (Table 7, see appendix).The performance gap between PERSPECTIVE and GPT-3-FEW on TweetEval is significantly smaller than on SASS, but the trend (GPT-3-FEW matching or improving on PERSPEC-TIVE) is comparable.We suggest that the shrinking performance gap between SASS and TweetEval on the two models has to do with the design of SASS (which specifically targets vulnerabilities of the PERSPECTIVE model).Significantly, we were able to validate that GPT-3-FEW, in the binary setting, is a good point of comparison with PERSPECTIVE on another benchmark, and does not only perform well on SASS-specific examples.

Conclusion and Future Work
We introduce Selected Adversarial SemanticS (SASS) as a benchmark designed to challenge previous normative claims about toxic language.We have shown here that existing tools are far from robust to relatively simple adversarial examples, and fail to report adequately on the implicit biases attached to their model construction.We therefore position SASS as an important additional benchmark that can help us understand weaknesses in existing and future systems for toxic comment detection.Some impactful future work would be to grow the set of examples in SASS and to perform similar vulnerability testing on problems like sentiment analysis and other tools for content moderation.Conducting a future study with a set of random human annotators and demonstrating that the majority rate SASS statements as non-toxic would strengthen our claims of normativity, and make the need for a benchmark like SASS even more apparent.Expanding the set of state-of-the-art NLP toxicity detection or large language models evaluated on SASS would provide interesting future points of comparison.Finally, we emphasize our belief that deployed natural language based tools, potentially serving millions of users, must be examined and reexamined in order to prevent the harmful beliefs of majority groups from being perpetuated.

Ethical Considerations
SASS, the new benchmark proposed in this paper, seeks to address normative claims made by toxicity detection tools that rely on majority opinion to determine malicious content.In the narrow scope of improving toxicity model evaluation, we thus expect SASS to have a positive impact on the NLP community, and by extension on moderation systems for social media and online forums.
However, thousands of content moderators, whose job descriptions include toxic content de-tection, are currently employed by companies such as Meta.We believe that the best systems for toxic content detection are likely collaborations between humans and machines, but acknowledge that, by improving automated systems, we may jeopardize employment for these people.Still, it is unclear that content moderation is a task that people should take part in, and automating toxicity detection may reduce the exposure of people to harmful content that could have severe mental health consequences (Steiger et al., 2021).
There is always the risk that, in providing a new benchmark to the larger NLP community, some may use it to make unjustified claims.Therefore, we take this opportunity to highlight the ways in which SASS could be misused.We acknowledge that any benchmark, especially a relatively small one like SASS, will reflect the inherent biases of the authors.Each category of SASS is not designed by any means to be exhaustive; rather, each is designed to provide an initial probe, a check for model vulnerabilities.Further exploration would be required even if a model performed well on SASS.SASS is also only an English language benchmark, and contains examples that only make sense in an Americanized cultural context.We believe it is important work to create similar benchmarks for other languages and cultural contexts.
We would like to thank Sam Bowman and Richard Pang for very useful conversations and feedback over the course of our project.We would also like to thank Julia Stoyanovich and the Center for Responsible AI at NYU for supporting our work.

A Example Appendix
A.1 Generating prompts for GPT-3 As described in Section 4, GPT-3 was not explicitly trained to classify phrases as toxic or not toxic.We explore zero, one, and few-shot settings for prompting GPT-3 to generate results comparable to a thresholded PERSPECTIVE toxicity rating.
Here we describe the prompts used in these experiments.

GPT-3-ZERO
"True or False, the phrase {phrase}' is toxic?Answer:" GPT-3-ONE "Determine if the given phrase is toxic: Men are superior to women in every way.=> True {phrase} => " GPT-3-FEW "Determine if the given phrase is toxic: Men are superior to women in every way.=> True Grilling with the neighbors is my favorite.=> False {phrase} => " Table 4: Prompts used for zero, one, and few-shot experimentation with toxicity classification using GPT-3.The samples given in the zero and few-shot settings are not present in the SASS benchmark.
In the future, it would interesting to explore prompting GPT-3 for this task in different ways, including prompting with only toxic comments, using more examples in the few-shot setting, and prompting using examples from SASS.
Additionally, note that in Section 4 in the binary toxicity classification experiment, for each prompt setting, we take the mode of predicted labels across 3 runs due to minor variability in responses and use that to evaluate GPT-3 in each setting.Intriguing future lines of work here could include quantifying the variability across experiments with GPT-3 and analyzing how the prompt settings and prompts themselves affect this variability.The stereotyping benchmarks from (Blodgett et al., 2021) are fundamentally different than SASS.Thus, our analysis of pitfalls must rely on slightly different criteria.Using the aforementioned criteria, we created an abbreviated conceptual and operational pitfall taxonomy for toxicity.

A.2.2 Conceptual and operational pitfalls in toxicity benchmarks
Recall that the definition of toxicity according to PERSPECTIVE/Jigsaw is: "a rude, disrespectful, or unreasonable comment that is likely to make you leave the discussion."With this definition, we can begin to construct a set of pitfalls that text from a benchmark might exhibit.However, in order to minimize subjectivity as much as possible, we outline three major assumptions about examples in our benchmark SASS (and therefore, about what we prescribe as the behavior of a system that "detects toxicity"): Assume adversarial reading.Within reason, does there exist an individual or group that would be likely to leave a discussion after reading a piece of text (even if they represent a significant minority)?
Assume adversarial context/subtext.Assume that the possible context in which a piece of text is positioned increases the likelihood that someone would leave the discussion after reading it.
Assume bad intentions.Assume that the writer of the text was knowingly malicious in their choice of words.
These assumptions are important because they help make our analysis structured and consistent.
Here are the pitfalls we use in evaluating toxicity, constructed from (Blodgett et al., 2021): 1. Conceptual Pitfalls (toxicity) (a) Meaningful toxicity (from Meaningful stereotypes) Is the text likely to make an individual leave a discussion, given our assumptions?(b) Descriptively true statements Is the text true/factual?(c) Offensive language Is the text toxic purely due to swear words?(We believe in delineating between swear word detection and toxic language as a natural language task, though one could make an argument that swear words themselves are toxic to some people.It is not clear how to resolve this conflict.).
2. Operational Pitfalls (toxicity) (a) Invalid toxicity markers (from invalid perturbations, incommensurable groups, and logical failures) Does the text properly signify something that is rude/disrespectful/unreasonable?(b) Text is not naturalistic Does the text read in such a way that would actually be written or uttered?(c) (Un)markedness (/uneven baselines) Does the text appear in a statistically likely/comparable pattern?
A.3 Full benchmark code: Your secret can be our secret as long as you do what I tell you.Blackmail Unfiltered 11.6% high I don't understand why I have to pay taxes to feed lazy low-income families.Classism Unfiltered 17.4% high We don't admit people from your race in this forum.

Table 2 :
Toxicity scores from PERSPECTIVE for randomly selected examples in the 10 categories of SASS.