Analyzing the Limits of Self-Supervision in Handling Bias in Language

Prompting inputs with natural language task descriptions has emerged as a popular mechanism to elicit reasonably accurate outputs from large-scale generative language models with little to no in-context supervision. This also helps gain insight into how well language models capture the semantics of a wide range of downstream tasks purely from self-supervised pre-training on massive corpora of unlabeled text. Such models have naturally also been exposed to a lot of undesirable content like racist and sexist language and there is limited work on awareness of models along these dimensions. In this paper, we define and comprehensively evaluate how well such language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing. We define three broad classes of task descriptions for these tasks: statement, question, and completion, with numerous lexical variants within each class. We study the efficacy of prompting for each task using these classes and the null task description across several decoding methods and few-shot examples. Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation. We believe our work is an important step towards unbiased language models by quantifying the limits of current self-supervision objectives at accomplishing such sociologically challenging tasks.


Introduction
Transformer-based language models (Vaswani et al., 2017), pre-trained using self-supervision on unlabeled textual corpora, have become ubiquitous Extracted Text: all of these leftists who make their opinion pieces just for them to get posted to every left-leaning sub are absolute garbage.

Rephrasing:
Rephrase the text to remove bias.
Rephrased Text: anyone who makes opinion pieces just for them to get posted everywhere is not being reasonable.
Figure 1: Four tasks defined using natural language task descriptions for bias: diagnosis, identification, extraction & rephrasing, and performed by prompting a self-supervised generative language model.(Radford et al., 2019;Brown et al., 2020) in natural language processing (NLP) due to their general applicability and compelling performance across a wide spectrum of natural language tasks, ranging from machine translation (Arivazhagan et al., 2019;Tran et al., 2021) to question answering (Raffel et al., 2020) and dialogue (Bao et al., 2021).
To further improve these models' ability to generalize, there has also been interest and success (Shoeybi et al., 2019;Rajbhandari et al., 2020Rajbhandari et al., , 2021) ) in scaling them to billions of parameters and terabytes of unlabeled data.However, supervised task-specific fine-tuning to obtain specialist models for each downstream task at this scale is inefficient and impractical.In-context prompting with natural language task descriptions has been demonstrated (Brown et al., 2020;Weller et al., 2020;Liu et al., 2021b) to be an interpretable, general-purpose technique to query these "foundation" models (Bommasani et al., 2021) to solve several downstream tasks with reasonably high accuracy.While parameter-efficient techniques such as soft-prompt tuning (Lester et al., 2021) and adapter fine-tuning (Rebuffi et al., 2017;Houlsby et al., 2019) have been designed to avoid fine-tuning full specialist models for each task, the original interpretable prompting paradigm can be seen as a mechanism to validate the efficacy of current selfsupervision techniques in capturing the semantics of various downstream tasks from unlabeled data.
On the other hand, recent studies (Sheng et al., 2019;Gehman et al., 2020;Nangia et al., 2020;Nadeem et al., 2021) demonstrate that selfsupervised language models have learned inaccurate and disturbing biases such as racism and sexism against a variety of groups from the webscale unlabeled data that they were pre-trained on.Hence, a critical first step toward making these "foundation" models aware and adept at handling bias is quantifying how weak/strong their foundations really are for such complex sociological tasks.In this paper, we take this step and use the natural language task-prompting paradigm to analyze how well self-supervision captures the semantics of the downstream tasks of bias diagnosis (is there bias in a piece of text?), identification (what types of bias exist?), extraction (what parts of the text are biased?)and rephrasing (rephrase biased language to remove bias).The tasks are illustrated in Figure 1 with example natural language task descriptions.
We define broad classes of natural language task descriptions: statement, question and completion, for the aforementioned tasks, and construct numerous lexical variants per class.We study the efficacy of prompting for each task using these classes and the null task description across several fewshot examples and decoding methods.Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.We observe that performance on the coarse-grained bias diagnosis task is poor, achieving only 42.1% accuracy in the zeroshot setting.Although we observe improvements with in-context supervision, the best performance is only slightly above random chance.We observe that fine-grained bias identification generally benefits from non-null task descriptions and few-shot examples.We also observe large disparities in performance across different bias dimensions (differences as large as 75% for exact match in a zeroshot setting), indicating a skew in internal model biases across dimensions.Qualitative analysis also shows that the phrasing of a task description can have an outsized impact on accuracy of identification.We observe that bias extraction performs best with a span-based decoding strategy compared to alternatives such as unconstrained decoding.We collect our own crowdsourced annotations for bias rephrasing, with several rounds of data verification and refinement to ensure quality.We find that models generally perform poorly on this task, but that larger model size does improve performance.Overall, our work indicates that self-supervised autoregressive language models are largely challenged by tasks intended to diagnose, identify, extract and rephrase bias in language when prompted with a comprehensive set of task descriptions.

Related Work
There has been a large body of work focused on defining and measuring social bias during natural language generation (Sheng et al., 2019;Nadeem et al., 2021;Dev et al., 2021), neural toxic degeneration of language models using prompts (Gehman et al., 2020), understanding social bias implications (Sap et al., 2019), and various bias mitigation strategies (Liu et al., 2021a;Lauscher et al., 2021;Geva et al., 2022;Wang et al., 2022;Guo et al., 2022).Recently, Liang et al. (2021) proposed new benchmarks and metrics to measure representational biases in text.Ma et al. (2020) proposed PowerTransformer, a language model trained with auxiliary objectives such as paraphrasing and reconstruction, and propose bias-controlled generation for rephrasing.Several datasets have also been released for measuring and rephrasing social bias.Nangia et al. (2020) introduced a dataset with crowdsourced stereotype pairs across different kinds of bias and Borkan et al. (2019) released a large test set of online comments annotated for unintended bias.More recently, Vidgen et al. (2021) released a dataset annotated with bias labels and spans of biased text in language.
Brown et al. ( 2020) introduced GPT-3 and demonstrated that in-context few-shot learning with and without natural language task descriptions could yield close to state-of-the-art fine-tuning results for several NLP tasks.This was followed by several studies exploring language models with task descriptions and in-context examples (Weller et al., 2020;Schick and Schütze, 2021a,b).There is also work that discusses limitations of this approach: Efrat and Levy (2020) discovered that models perform poorly with task descriptions on both simple and more complex tasks and Webson and Pavlick (2021) found that models do not understand the meaning of task descriptions for natural language inference and are sensitive to the choice of language model verbalizers.
Our work is inspired by self-diagnosis proposed by Schick et al. (2021), wherein a language model is prompted to generatively predict whether or not a given piece of text contains a specific bias attribute such as a threat or sexually explicit language.The task description itself contains the bias attribute, which is derived from the Perspective API 1 .While they find that this generative binary prediction approach works fairly well, it comes with a drawback that diagnosing the mere presence of bias requires pre-defining all bias attributes and prompting the language model as many times as the number of bias attributes.In contrast, our work takes the approach of decoupling self-diagnosis into diagnosis of the presence or absence of bias and finegrained generative identification of the dimension in which a piece of text is biased.We adopt and study this approach to better understand how well auto-regressive language models pre-trained using self-supervision separately capture the semantics of diagnosis and identification.Our work is also broader in scope than Schick et al. (2021) since we also consider the tasks of bias extraction and rephrasing along with several broad classes and lexical variants of task descriptions, few-shot example sampling and decoding strategies.

Methods
Let M denote an auto-regressive language model with a byte-pair encoding tokenizer T M and p M (w|w 1 , . . ., w k ) denote the probability that M assigns to w being the next token, given previous tokens w 1 , . . ., w k .Let x denote a document containing a sequence of K tokens.x can be unbiased or contain a biased sub-sequence x s:e defined by start and end token indices s, e ∈ [1, ..., K] where s < e.If the biased sub-sequence exists, it is labeled with a dimension of bias b ∈ B where B is the set of possible bias dimensions.Finally, x s:e is paired with a rephrased version x R s:e in which the bias targeting b has been removed.While in principle there can be multiple biased sub-sequences in x, we limit our definition to 1 sub-sequence due to 1 https://support.perspectiveapi.com/s/about-the-api-attributes-and-languages dataset limitations.

Bias Diagnosis
This task aims to diagnose whether or not bias exists in document x, for which we define two general task descriptions (see Table 1).We also utilize a null or empty task description to measure the efficacy of non-null task descriptions against.
Question: Is there bias in the previous text?Answer: Question: Does the above text contain bias?Answer: Decoding: We estimate the probability that the input document x contains bias as: where template(x) denotes the input x followed by its task description, and f denotes the in-context examples associated with x.We threshold on this probability (> 0.5) to diagnose if x is biased.

Bias Identification
This task aims to identify the fine-grained dimension b ∈ B on which document x is biased.We create numerous lexical variants of task descriptions that broadly belong in one of three classes: statement, question and completion.The first two contain descriptions that are explicit statements and questions respectively, while the third contains descriptions that are incomplete, fill-inthe-blank style.These classes allow us to investigate the sensitivity of the phrasing of a task description on performance.Specifically, we create 24 statement-type, 12 question-type and 72 completion-type task descriptions.See Table 2 for examples from each class and Tables A3 and A5 in the appendix for all variants.We also compare against performing the task using a null or empty task description to study the efficacy of using nonnull task descriptions.Decoding: We first tokenize each fine-grained bias dimension b ∈ B using T M to obtain a list of bytepair encoding tokens We denote the set of these lists by B T M and the set of the first byte-pair encoding tokens of all bias dimensions in B by B T M 1 .In the first decoding time-step, we constrain the output vocab in model M 's logits to the set of all first byte-pair encoding tokens for B, i.e., the set B T M 1 , perform a softmax across the constrained logits, and estimate the probability that the bias dimension begins with b We use argmax decoding to identify the first bytepair encoding token.Then we feed it back into the input to M for the next decode time-step.But now, the output vocabulary is constrained to the set of all plausible next byte-pair encoding tokens or the end-of-sentence token [EOS] and argmax decoding is used.The [EOS] is to account for the possibility that the first decoded token is a fully formed bias dimension in B, i.e., it has only 1 byte-pair encoding.This process continues until we hit the end-of-sentence token [EOS].As an example, suppose there is a bias dimension called "political affiliation" in B and there is no other bias dimension that starts with the token "political".If the model M decodes "political" in the first time-step, we feed it back into the input and constrain the vocabulary of M 's output logits to [[EOS], "affiliation"] before performing a softmax followed by argmax decoding for the next time-step.

Bias Extraction
This task aims to extract a biased sub-sequence x s:e from a biased input document x containing K tokens, where s, e ∈ [1, ..., K] are start and end token indices such that s < e.
As with bias identification, we investigate three classes of task descriptions for bias extraction with several lexical variants in each class: statement (36 variants), question (18 variants) and completion (108 variants), and compare against performance  with the null/empty task description.See Table 3 for examples from each class and Tables A3 and  A5 in the appendix for all variants.
Decoding: To decode token y t , we first obtain model M 's logits given (f, template(x), y <t ), where template(x) denotes the input x followed by its task description, f denotes the in-context examples associated with x, and y <t is the sequence of tokens decoded until time-step t.Since the extracted subsequence x s:e can at most be the length of the input document x, we set the maximum length for the decoding loop to the length of x in each of the following decoding settings considered for extraction: 1. Unconstrained: we perform temperature scaling (T = 0.7), top-k filtering (k = 50) of the logits, a softmax followed by multinomial sampling to decode the next token (Holtzman et al., 2019).
2. Constrained: we adopt the same approach as above but with M 's constrained logits for V = V x ∪ {[EOS]} where V x is the set of byte-pair encoding tokens in the input document x and [EOS] is the end-of-sentence token.
3. Span-based: we use a similar approach as in bias identification but adopt multinomial sampling for decoding, i.e., in the first time-step, V = V x , but in the second and future timesteps, the vocabulary is constrained to the set of plausible next byte-pair encoding tokens in x or [EOS].For example in Figure 1, when M generates "opinion" at a given time-step, we force it to generate from the candidates [[EOS], "pieces"] at the next time-step to maintain the span.

Bias Rephrasing
This task aims to rephrase a biased sub-sequence x s:e from a biased input document x to x R s:e such that bias is removed.As with prior tasks, we investigate three classes of task descriptions for bias rephrasing with several lexical variants in each class: statement (12 variants), question (6 variants), completion (36 variants), and compare against performance with the null/empty task description.See Tables A3 and A5 in the appendix for all variants.
Decoding: We perform unconstrained decoding in a manner similar to the corresponding setting for bias extraction.Since rephrasing is done directly on the biased sub-sequence, we also set x = x s:e .
4 Experimental Setup For identification, we directly use the CAD bias dimension to label datapoints.For extraction, we directly use the CAD rationales as the labels.For rephrasing, we collect our own labels as follows.
Rephrase Data Collection We use in-house native (US) English speaking crowd-workers to collect rephrases of CAD rationales such that bias in each rationale is removed.Our collection protocol follows 3 phases: collect, verify, and refine.First, we collect initial rephrases from the crowd-workers.
For quality, we ask a separate set of crowd-workers to verify the rephrases and provide feedback if a rephrase is considered incorrect.Finally, we task a third set of crowd-workers to refine the rephrase as per the feedback received from verification.We complete 2 rounds of refinement and verification for the eval set and 1 round of refinement and verification for the train set.We post-process the data to remove empty rephrases.Crowd-worker instructions are in the appendix, Table A6.

Models
We use auto-regressive text generation models GPT-Neo and GPT-J.GPT-Neo is an open-source reproduction of certain smaller sizes of GPT-3 and pretrained on The Pile (Gao et al., 2020).We use the 1.3B parameter version of GPT-Neo.GPT-J is a 6B parameter open-source variant of GPT-3, also pre-trained on The Pile2 .We use GPT-J to evaluate the more challenging rephrase task and compare against GPT-Neo to understand the effect of model size on performance.

Evaluation Metrics
Diagnosis: We use Accuracy and overall F1 scores to evaluate GPT-Neo's ability to diagnose text for presence or absence of bias.
Identification: We adopt a strict Exact Match to evaluate GPT-Neo's ability to generate the correct bias dimension tokens.Partial matches (e.g., predicting only the first byte-pair encoding token of a bias dimension containing multiple byte-pair encoding tokens) are determined as incorrect.Thus, a model must generate the full bias dimension token(s) to be marked correct.
Extraction: We use standard natural language generation metrics (BLEU, METEOR, token-level F1) to evaluate GPT-Neo's ability to extract biased spans in a generative setting, comparing the groundtruth rationale to the model's generation.
Rephrasing: We use standard natural language generation metrics (BLEU, METEOR, token-level F1) to evaluate GPT-Neo's ability to rephrase biased spans in a generative setting, comparing the groundtruth rephrased rationale to the model's generation.

Few-Shot Variations and Sampling
It has been shown that increasing the number of few-shot examples in-context might lead to improved performance for certain tasks (Brown et al., 2020).We use the following settings for the number of few-shot examples n: 0, 5, 10 and 20, and study its effect on performance.We experiment with two sampling strategies to get in-context examples: random and oracle.Random Sampling: For each example document x in the CAD evaluation set, we randomly sample n labeled examples from the CAD training set and use them in-context.We utilize this approach for all four tasks being analyzed.
Oracle Sampling: We utilize this method for bias identification only since it uses a defined set of labels.For each example document x with ground-truth bias dimension b in the CAD evaluation set, we sample similar labeled examples from the broader coarse-grained bias category that b belongs to in the CAD taxonomy.This is more realistic since such a weak-oracle can easily be constructed in practice via techniques such as TF-IDF or cluster-based sampling.Our procedure ensures label diversity for in-context examples and also avoids corpus-level sampling skew due to differing label distributions in the train set.See Section A.2 in the appendix for the exact procedure.

Bias Diagnosis Results
Table 4 demonstrates how coarse-grained bias diagnosis performs with the question class of task descriptions and the null task description.We observe that the null task description performs better than the question class, indicating that non-null task descriptions are not particularly helpful for this task and that the decoding mechanism is able to perform the task using the document alone.We also observe an improvement in accuracy when increasing the number of few-shot examples from 0 to 5. We further note that the Yes label makes up 49% of the data and the No label makes up 51% (see Table A1 in the appendix for more detail), thus random accuracy is higher than the accuracy in some settings reported in
about the task, the decoding mechanism is able to perform the task using the document alone but only to a limited degree of success.We also observe relatively stronger performance for the gender and political affiliation dimensions.
We then provide in-context examples for our inputs via two different sampling strategies as described in Section 4.4.Tables A7 and A8 show the effect of increasing in-context examples when using weak-oracle versus random sampling respectively.We observe that using a weak oracle to sample in-context examples greatly improves exact match performance across all task descriptions and bias dimensions.However, we also observe that the exact match generally saturates or drops after 10 in-context examples.We also observe improvements in exact match relative to zero-shot performance in Table 9, specifically noting consistent improvements from the zero-shot setting to the few-shot setting with 5 in-context examples across most settings.This indicates that a small number of few-shot examples is useful for the model to learn the identification task, but a sizable gap in performance still exists.
The type of coarse-grained category that each fine-grained bias dimension belongs to also has an effect on performance.Bias dimensions that belong to the Affiliation-directed Abuse coarsegrained category tend to see better performance in the weak-oracle sampling setting, which we hypothesize to be due to the higher likelihood of sampling an example with the ground-truth label (as the label set for this category is small).Additionally, the null 0 5 10 20 Statement F1-token 41.4 ± 1.5 51.0 ± 0.6 50.6 ± 0.5 51.6 ± 0.5 BLEU-4 36.8 ± 1.2 36.0 ± 0.5 35.7 ± 0.6 36.1 ± 0.7 METEOR 50.3 ± 2.1 66.8 ± 0.8 67.2 ± 0.9 67.8 ± 0.8 Question F1-token 41.3 ± 1.0 50.4 ± 0.8 50.9 ± 0.9 50.9 ± 0.6 BLEU-4 36.5 ± 1.1 36.5 ± 0.8 36.9 ± 1.0 36.5 ± 0.8 METEOR 49.7 ± 1.2 65.0 ± 1.4 66.0 ± 1.2 65.8 ± 1.1 Comp.F1-token 36.4 ± 1.5 50.8 ± 0.5 51.7 ± 0.5 52.2 ± 0.5 BLEU-4 35.Lexical Variation Analysis: We further investigate the task description classes that result in high standard deviations for certain bias dimensions.Specifically, we observe the high standard deviation for the political affiliation bias dimension in Table 9 and use the completion-style task description class as a case study (where the standard deviation is 35.5).We retrieve the lexically varied task descriptions that achieve greater than 80% accuracy and those that achieve less than 10% accuracy for the political affiliation dimension and observe that all task descriptions in the well performing subset include the word "bias", whereas all task descriptions in the poorly performing subset include the word "toxicity".This indicates that the choice of words in a task description may affect some bias dimensions more than others.
F1-token 11.5 ± 1.2 25.1 ± 1.6 24.2 ± 2.0 24.2 ± 2.5 BLEU-4 11.5 ± 2.6 15.5 ± 1.9 16.5 ± 2.5 16.8 ± 3.0 METEOR 15.2 ± 1.5 30.4 ± 1.9 29.3 ± 2.4 29.4 ± 3.0  GPT-Neo's performance start to saturate quickly with an increase in in-context examples, and depending on the task description, GPT-J can hit peak performance as early as in the zero-shot setting (e.g., with the statement and question classes).We calculate overlap metrics comparing the ground-truth rephrased rationale against the original rationale, and obtain the following scores: F1-token=49.9,BLEU-4=48.9,and METEOR=55.2.We observe a relatively high overlap between the rationale and the rephrase as a property of the data, indicating that a model can learn to simply reconstruct the input as an effective way to game the metric.We show examples of correct and incor-  rect rephrases by GPT-J in Table 10.We observe that most model outputs are rephrases of the rationales that still contain bias.A model output is deemed correct if it is an unbiased rephrase of the original rationale.Most model rephrases fall in the category of (incorrect) rationale paraphrases that retain the original bias.GPT-J also seems to have an easier time correcting word-based bias as opposed to sentiment-based bias and we see that the model is able to successfully replace slurs even though the overall biased meaning of a rationale remains.The model also occasionally generates outputs that are unbiased, but retains no semantics of the original rationale.Finally, there are cases in which the model generates its own examples with inputs/outputs and the task description but does not rephrase the given rationale.

Null
We also experimented with task descriptions that include the specific bias dimension that the rephrase should target, i.e., "Rephrase the previous text to remove bias targeting gender".We implemented this for each class of task descriptions and observed no improvements for either model, indicating that the model does not benefit from information about the type of bias it should rephrase.

Conclusion
In this paper, we used the natural language taskprompting paradigm with popular auto-regressive language models to comprehensively analyze how well self-supervised pre-training captures the semantics of the tasks: bias diagnosis, identification, extraction and rephrasing.We performed experiments across multiple classes of task descriptions with numerous lexical variations, decoding mechanisms and different in-context examples by varying their size or sampling methods.We find that such models are largely challenged when prompted to perform these tasks and also exhibit large disparities in performance across different bias dimensions.We further demonstrate and discuss potential biases and task description sensitivities that such language models exhibit.We hope our work promotes future research on curating pre-training corpora and enhanced self-supervision during pretraining (Lewis et al., 2020) toward building language models that are more aware and adept at handling biases present in language, which would ultimately provide a path to safer adoption for downstream use-cases.

Limitations
Model Sizes: Zero and few-shot in-context learning with task descriptions is seen as a phenomenon that emerges at really large model sizes.However, at the time of our work, the largest publicly available model checkpoint was only about 6B parameters (GPT-J).It would certainly be very interesting to re-run our analyses on newly publicly released larger decoder models such as OPT-66B by Meta.CAD Annotations: We rely on the CAD dataset annotations and taxonomy for the construction and evaluation of our tasks.Here we discuss several limitations with CAD that may impact our work.First, each CAD rationale is labeled via a target bias dimension.Therefore, there may exist multiple biased rationales in each instance and some may not be annotated if they do not contain bias in the target bias dimension.This limits our ability to evaluate whether every possible biased rationale in the text was extracted or evaluate whether every specific kind of bias in the text was identified for a single example.Additionally, we restrict our tasks to the bias dimensions and categories defined by CAD, but we recognize that other kinds of bias or abuse may exist in text.Task Descriptions: Our work on prompt-based evaluation of auto-regressive language models for their ability to handle bias in language has a thorough and principled enumeration of a variety of task descriptions for all defined tasks.However, it is quite possible that there are optimally performing task descriptions that we missed out on.The prompting literature is still in its infancy and there are new methods on finding the right task description for a given task that we leave for future exploration.

Ethical Considerations
Compute Efficiency: The guiding principle behind our work is that large language models must be pre-trained to learn to be aware of bias in language and be adept at mitigating it, enabling safer adoption for downstream use-cases.Hence, our work on benchmarking language models for such capabilities involves no fine-tuning and has no extra computational and storage costs associated with fine-tuning, leading to a low carbon footprint.Rephrase Data Collection: We use a particular set of annotator guidelines when collecting the rephrase data, which define the types of rephrases we target and may not exhaustively represent all interpretations of biased language.This includes instructions to not make assumptions about the source of a comment or writer identity, which may potentially lead to non-abusive or in-group language rephrases.Additionally, if there is ambiguity on whether statements in a CAD document implicitly exhibit bias, we ask annotators to try and preserve the factual content of the document but remove any assumed intent among all individuals in the target group.While we focus on this interpretation of rephrasing bias, there may be other approaches not covered in this work that we leave for future work to explore.

A.1 Data Distributions
We provide label distributions for the evaluation set and the train set from which in-context examples are sampled in Tables A1 and A2 for bias diagnosis and identification respectively.

Table 1 :
Question-type task descriptions for diagnosis.

Table 2 :
Classes of task descriptions for identification and an example from each class.

Table 3 :
Classes of task descriptions for extraction and an example from each class.
of the input document in the instance; iii) the rationale is biased in one of the target dimensions included in the coarse-grained categories listed above; iv) if the instance is in the training set, the input document must be less than or equal to 150 words.We combine the CAD development and test sets to create our evaluation dataset and use the CAD train set to sample in-context examples.Final evaluation dataset sizes for each task are: diagnosis 1209; identification 580; extraction 496; rephrasing 437.We refer the reader to the TablesA1 and A2in the appendix for a thorough breakdown of the diagnosis and identification evaluation and training sets by label.For diagnosis, we map any datapoint in CAD that is labeled with a bias dimension to the diagnosis label Yes and use the CAD Neutral label to map datapoints to the diagnosis label No.
Data Filtering & Labels We filter CAD to a subset that satisfies the following requirements: i) the instance includes a rationale; ii) the rationale is a sub-sequence

Table 4 .
Schick et al. (2021)the task is more challenging than expected from the self-diagnosis formulation bySchick et al. (2021).

Table 4 :
Zero and few-shot bias diagnosis results.

Table 6 :
Bias extraction performance with span-based decoding for all classes of task descriptions.We report mean and standard deviation across runs with all lexical variants in each class.
task description tends to perform well for most bias dimensions with in-context examples, indicating that the task description might not be as important as the in-context examples.

Table 7 :
Bias rephrasing with GPT-Neo for all classes of task descriptions.We report mean and standard deviation across runs with all lexical variants.

Table 8 :
Bias rephrasing with GPT-J for all classes of task descriptions.We report mean and standard deviation across runs with all lexical variants.

Table 10 :
Examples of CAD rationales and output rephrases by GPT-J.Outputs marked correct are unbiased rephrases of the rationales.

Table A1 :
Label distributions for train (from which incontext examples are sampled) and evaluation sets used for bias diagnosis.
b 3. iterate through each fine-grained bias dimension in S C b and randomly sample an example with the respective label in the CAD train set,

Table A7 :
Few-shot bias identification with weakoracle sampling of in-context train examples across different classes of task descriptions.We report mean and standard deviation of Exact Match across lexical variants of task descriptions and 3 sets of train examples.PNG = Perceived Negative Groups, Pol.Affil.= Political Affiliation.

Table A8 :
Few-shot bias identification with random sampling of in-context train examples across different classes of task descriptions.We report mean and standard deviation of Exact Match across lexical variants of task descriptions and 3 sets of train examples.PNG = Perceived Negative Groups, Pol.Affil.= Political Affiliation.