Towards Robust NLG Bias Evaluation with Syntactically-diverse Prompts

We present a robust methodology for evaluating biases in natural language generation(NLG) systems. Previous works use fixed hand-crafted prefix templates with mentions of various demographic groups to prompt models to generate continuations for bias analysis. These fixed prefix templates could themselves be specific in terms of styles or linguistic structures, which may lead to unreliable fairness conclusions that are not representative of the general trends from tone varying prompts. To study this problem, we paraphrase the prompts with different syntactic structures and use these to evaluate demographic bias in NLG systems. Our results suggest similar overall bias trends but some syntactic structures lead to contradictory conclusions compared to past works. We show that our methodology is more robust and that some syntactic structures prompt more toxic content while others could prompt less biased generation. This suggests the importance of not relying on a fixed syntactic structure and using tone-invariant prompts. Introducing syntactically-diverse prompts can achieve more robust NLG (bias) evaluation.


Introduction
Pre-trained language models (LMs) like GPT2 (Radford et al., 2019) and BART (Lewis et al., 2019) have been used for various downstream language generation tasks (Qiu et al., 2020) like machine translation (Liu et al., 2020), dialog systems (Zhang et al., 2019) and story generation (Guan et al., 2020).Past research has shown biases in NLG systems (Sheng et al., 2021c;Barikeri et al., 2021) like machine translation and dialog (Mehrabi et al., 2021;Prates et al., 2020;Henderson et al., 2018;Sheng et al., 2021a,b;Sun et al., 2022).Despite these empirical studies showing evidence of bias, there has been less work on evaluating the bias evaluation approaches for NLG systems (Zhou et al., 2022;Schoch et al., 2020).It is important to perform a systematic, robust and automated bias analysis to help build equitable NLG systems.Specifically, Sheng et al. (2019) introduce prefix templates to prompt LMs, analyze bias in the generated text and introduce the concept of regard.Past works use fixed prompts to evaluate the fairness in NLG (Sheng et al., 2019;Yeo and Chen, 2020;Honnavalli et al., 2022) and NLU (Bolukbasi et al., 2016;Zhou et al., 2019;Rudinger et al., 2018;Zhao et al., 2018;Lu et al., 2020).These fixed prompts could generate different outputs when paraphrased and are not syntactically diverse enough to bring out all the stereotypical aspects of LMs.Past work has shown that LMs are highly sensitive to the formulation of prompts (Liu et al., 2021a;Suzgun et al., 2022;Cao et al., 2022;Sheng et al., 2020).Fixed handcrafted prefix prompting could lead to unreliable bias analysis with results that are not generalizable or robust.To overcome this, we propose a robust and rich bias analysis methodology by automatically generating 100 paraphrased versions of Sheng et al. (2019)'s fixed prompts and analyzing the regard scores (Sec 2, Sheng et al. (2019)) of the generated outputs.Past works (Qin and Eisner, 2021;Liu et al., 2021b;Li and Liang, 2021) have optimized a mixture of prompts to find the most effective prompts but have not analyzed them from a fairness perspective.We also aid interpretability by analyzing which syntactic structures generate the least/most toxic content.
Our results show similar overall trends with past works.A fine-grained analysis suggests that LMs propagate stereotypical behaviour which could be toxic towards any demographic group (more towards disadvantaged groups).Prompts that are more assertive, assume a positive trait or are self-identifications, generate more positive content while prompts that generalise to "people" instead of "person" generate more stereotypical and negative outputs.We show how our analysis methodology is robust and rich.Some syntactic structures lead to contradictory conclusion compared to previous works.Some syntactic structures tend to generate toxic content for all demographic groups while some others are good for all; some favor advantaged groups while some also favour disadvantaged groups.This suggests the importance of not relying on a single fixed prompt and introducing syntactically-diverse prompts for bias evaluation in NLG systems.This also suggests instability of prompts and that we should carefully prompt LMs.
We summarize our contributions as: (1) LMs are sensitive to prompt formulation.We propose a robust approach for NLG bias evaluation using automatic prompt paraphrasing.This approach increases the similarity in bias evaluation conclusions by 99.9%.(2) We propose a fine-grained analysis methodology for bias evaluation in a syntacticallysegregated manner which is not possible with fixed hand-crafted prompts.(3) Our methodology can be extended to other prompt-based evaluation approaches to make it more robust and rich.
2 Background: NLG Bias Evaluation Sheng et al. (2019) introduce the use of prompting templates for NLG evaluation as: The woman x worked as y z .
Manual Prompts.x represents the demographic group i.e. man, woman, white person, black person, straight person, gay person.y is verb phrases that we use to elicit the model including worked as, earned money by, had a job as, had a part-time job as, started working as, was described as, was known for, was regarded as, was thought of as and was well-known for.The combination of x and y forms the prompt which is of 2 types: occupation (first 5 from above) and respect (latter 5).The difference between respect and occupation can also be understood from Table 2 of (Sheng et al., 2019) Regard.Sheng et al. (2019) introduce the concept of regard as a measure of bias in language.While sentiment measures overall polarity of a sentence, regard measures language polarity towards a demographic group and is positive, negative or neutral.
For examples of sentiment and regard, refer to Table 3 of Sheng et al. (2019).

Problem Formulation
While past works stop at fixed prompts and evaluate potential bias, we ask whether using different syntactic structures to paraphrase and prompt the LMs will lead to different conclusions of bias evaluation.We then get 10 GPT-2 generated texts in z (Section 2) for each demographic group.We illustrate our task as follows: Paraphrase.We use AESOP (Sun et al., 2021) to generate 100 paraphrases for each prompt.Specifically, we use 50 syntactic structures retrieved from ParaNMT and 50 from QQP-Pos dataset using AE-SOP.Retrieved syntactic structures from ParaNMT and QQP-Pos will guide generation through declarative and interrogative prompts.QQP-Pos is collected from Quora, while ParaNMT is collected by back-translating English references.
Generation.Following the setting in (Sheng et al., 2019), we use GPT-2 small with top-k sampling and complete the sentence S after the prompts or its paraphrases.We use 10 random seeds to ensure the reliability and generalizability.For each demographic group, we have 10 (number of verb phrases V P ) * (100 + 1) (number of paraphrased prompts P P with corresponding syntactic structure SP + original fixed prompt OP ) * 10 (random seeds).
Evaluation.We get the REGARD score from the regard classifier trained by Sheng et al. (2019) to measure the bias.We also perform a human evaluation of the regard classifier, the details of which are mentioned in Appendix A. We get the RE-GARD score for each completed sentence S which includes S op and 100 S pp for 10 random seeds, then we calculate the average score and the standard deviation.To further understand if the distribution of the REGARD scores we perform extensive evaluations and analysis detailed in Sections 4 and 5.
Robust & Rich.We define a robust bias analysis technique as the one that does not change its result even when we change the syntactic structure or the tone of the prompt to the LM for the same set of randomly selected seeds.We define a rich bias analysis algorithm as one that gives us more insight into the results and is more interpretable for which we do the segregated analysis.

Individual Group Evaluation
We summarize our overall methodology in Fig. 1.We analyze the ratio of positive, negative and neutral regards for generated outputs from various demographic groups and syntactic structures.The values that we calculate include: Aggregated Analysis.For each demographic group, we average the regard score across all syntactic structures, prompt types and seeds to get the average and standard deviation of the distribution of regard scores.We compare this with the case of using one fixed syntactic structure as in Sheng et al. (2019).We do this using our methodology as Sheng et al. (2019) use human annotation for their analysis and train their regard classifier based on that.This is also to facilitate a more direct comparison with sample ratio consistency.We also plot the percentage of positive, negative and neutral regard scores to further understand if the distribution of regard scores are similar to those of the past works.Analysis Segregated By Syntactic Structures.
For each demographic group and syntactic structure, we average across 10 prompt types and seeds to get the average regard scores.We then find the 5 best and worst syntactic structures based on their average regard scores for each demographic group and take an intersection of these syntactic structures across the demographic groups.We then take a union of the regard scores for the best and the worst cases for all demographic groups and plot the average in Fig. 3(b).This helps us understand the variance of toxicity in between different syntactic structures for all demographic groups.We want to further answer the following: • Are the overall regard score trends similar to past works after using syntactically diverse prompts?• Will using paraphrases of certain syntactic structures lead to more biased/less biased generation compared to the case with original prompt?
5 Pair-wise Group Evaluation For pair-wise group evaluation, we compute the gap between pairs of groups including females v.s.males, black v.s.white and gay v.s.straight.For each pair, we get the gap between the advantaged and disadvantaged group, which can further provide answer to two research questions.Technically, we use two ways to evaluate the gap.Aggregated Analysis.First, we consider the absolute value of the gap following: where 10 is the number of prompt types, 100 is the number of syntactic structures that we use to guide the paraphrase generation and S pp refers to the sentence(S) generated with the paraphrased prompt(P P ).We calculate the Score general for each demographic group, and calculate the pairwise gap with Score advantaged_group − Score disadvantaged_group .
We do the same for a fixed syntactic structure (Sec 2) using our methodology for a more direct and scalable comparison.different demographic groups.For this, we evaluate the 5 best and worst syntactic structures based on the gap and analyse the average regard score gap for gender, race and sexual orientation.This helps us distinguish the syntactic structures which favor advantaged groups from the ones that favor the disadvantaged groups.We want to further answer: • Do the pairwise results follow similar trends as compared to the past works when prompted with syntactically diverse prompts?• For each demographic group, will using different ways to prompt the model derive different fairness conclusions?For eg., using the original prompt, GPT-2 may be more biased towards woman, while it may be more biased towards men after paraphrasing this prompt.

Results
The results described below are specific to GPT-2.

Individual Group Analysis
Aggregated Analysis: From Fig. 3(a), we see that average regard scores for various demographic groups follow trends similar to baseline as both plots are almost similar.We also observe that texts generated from gay person prompts are classified as more negative compared to all other demographic groups.Prompts for both black person and white person generate almost similar positive, negative and neutral trends (Fig. 2(b)) but positive outputs for white person are higher by 1%.These trends become more clear when we observe Fig. 2(b).An interesting observation here is that, that the overall results for "all" are more negative than positive which shows that our LMs generate more toxic content than positive.Also, texts generated for gay person have 51% probability of being negative.Hence, it is imperative to analyze the regard of text generated using multiple syntactic structures.
Analysis Segregated by Syntactic Structures: We find the best and worst syntactic structures by taking an intersection of these parses for all the demographic groups and plot them in Fig. 3 (b).From this we observe that, some syntactic structures have a higher average regard score for all demographic groups than the others which shows that syntactically manipulating the prompts given to the LMs can help reduce toxicity of the text generated (examples in Table 1 and App B).

Pair-wise Group Analysis
Aggregated Analysis: In Fig. 3(c), we have plotted the gap between the average regard scores from male v/s female, straight person v/s gay person and white person v/s black person.For the ease of understanding we have names these gaps as gender, orientation (sexual orientation) and race respectively.These trends show that there is a notable positive gap favouring the advantaged groups as compared to the disadvantaged groups but this is most evident in the case of sexual orientation where the content generated for gay person prompts is toxic.We compare this with the baseline and observe that the trends are similar but the results with a single syntactic structure are unreliable when we look at the segregated analysis.Next we calculate the pairwise KL divergence in Table 2. From this we observe similar trends as we observed in the individual analysis.Almost all the demographics have a high divergence from the gay person.This shows that the regard categorical probability distribution of gay person is different than others and is more negative (Fig 2(b)).We see that the divergence is not that high for man v/s woman. 3In general, we observe that prompts that are more assertive, assume a positive trait or are self-identifications generate more positive content.While prompts that generalise to "people" instead of "person" generate more stereotypical and negative outputs.Examples of these trends can be seen in Table 1 and App.B.
Analysis Segregated by Syntactic Structures: In Fig. 3(d) we observe that while some syntactic structures are more favorable to advantaged groups some other are more favorable to disadvantaged groups.This can be observed by the difference in the average regard gap plots.Here, the upper(magenta) line (more positive gap) shows outputs being more favorable to man, straight person and white person while the lower(green) line (more M W S G B Wh M 0.00 0.02 0.01 0.31 0.19 0.18 W 0.02 0.00 0.01 0.20 0.09 0.08 S 0.01 0.01 0.00 0.31 0.15 0.14 G 0.29 0.21 0.32 0.00 0.16 0.15 B 0.14 0.07 0.11 0.15 0.00 0.00 Wh 0.14 0.06 0.11 0.14 0.00 0.00 Table 2:  negative/lower gap) shows outputs being more favorable to disadvantaged groups i.e. woman, gay person and black person.We observe that syntactic structures like (ROOT (SINV (LS ) (VP ))), (ROOT (S (LS ) (ADVP ) (VP ) (. ))) and (ROOT (FRAG (WHADJP ) (. ))) that assume that a person is already "well-known" or assumes another positive trait are generally more positive for disadvantaged groups.Another interesting observation is that even for the best prompts, the gap for sexual orientation still isn't negative which could indicate that our LMs are discriminatory towards gay person.

Robust & Rich Analysis
To verify the robustness of our approach we calculate 2 values.For the first, we randomly sample 10 syntactic structures and calculate the average regard score for each demographic group.This gives us a 6 dimensional vector for each syntactic structure.Then we calculate the average pairwise cosine similarity between these ten 6-dim vectors.This gives us an estimate of how similar the bias evaluation results are when a fixed syntactic structure is used.For the second, we randomly split the 100 syntactic structures into 2 halves.For each of the 2 splits, we get the average regard scores for each demographic group.This gives us two 6 dimensional vectors between which we calculate the cosine similarity.We perform 10 such random splits and find the average cosine similarity.This gives us an estimate of how similar the bias evaluation results are when an ensemble of syntactic structures are used.
The first value comes out to be 0.587 and the second is 0.998 resulting in an increase in similarity in fairness conclusions by 99.9%.This shows that the bias evaluation results do not change when different syntactic structures are used as opposed to when only a single is used.Hence, our methodology is more robust than past works.Our automatically generated syntactically-rich prompts also enable us to perform a syntactically-segregated rich analysis which is not possible using limited hand-crafted prompts and gives a lot more insight.We are able to analyze which prompts are more toxic and which syntactic structures reverse general trends of gap.

Conclusion
In this work we present a robust methodology for a rich demographic bias evaluation in NLG systems using syntactically diverse prompts by paraphrasing.We perform an individual and pairwise analysis over the demographic groups in an aggregated and syntactically-segregated manner.Our results show that the overall trends are the same across demographic groups but we find that some syntactic structures lead to contradictory results.We find that some syntactic structures consistently generate more toxic content towards all demographic groups while others are positive for all.Some syntactic structures have a negative regard gap and are more favorable to disadvantaged groups while some are favorable to advantaged groups.This shows that bias analysis using fixed and limited hand-crafted prompts is not robust to paraphrased prompts and does not provide rich insights.A more robust and syntactically-diverse setting is required to evaluate fairness in NLG systems.

Limitations
We acknowledge that although our work builds a robust and rich methodology for demographic bias analysis in NLG systems, there are certain limitations associated with our work.Firstly, although we perform a human evaluation of the regard classifier on a randomly selected portion of our samples, the accuracy of regard classifier is not perfect and there could be some error in predicting the regard polarity for harder texts.Another limitation of our work is that we define regard gap in a binary manner i.e. male v/s female, black person v/s white person and gay person v/s straight person; we acknowledge the limitation of not using other demographic groups in our analysis methodology.A possible future direction of our work could include other demographic group categories.Lastly, we acknowledge that although we only use 100 syntactic structures for our analysis, there could be many more.Future work could include more syntactic structures and more random seeds using our analysis methodology.

Ethical Considerations
We acknowledge that although we take a step in the direction of fair NLG systems, there still are certain ethical concerns associated with our work.Firstly, we acknowledge the ethical consideration associated with the error propagation of the regard classifier.We also acknowledge the ethical consideration of not using other genders, sexual orientations and races in our analysis.Our paper focuses more on building the methodology from the past works for a robust bias analysis.Future work could include other demographic group categories for analysis using our methodology.Lastly, we acknowledge that there could be some bias however minimal associated with paraphrasing the input prompts which could further propagate the bias.

A Regard Classifier Manual Check
We perform a human evaluation over 100 randomly selected NLG outputs from GPT2 to evaluate the performance of the classifier.The subjects are shown the generated output and the regard score predicted by the classifier and ask them if they think the score is correct.We obtain an average accuracy of 82.67% with an inter-annotator agreement (Fleiss Kappa) of 0.23.Since we obtain an accuracy of more than 80% we move forward with using the regard classifier for our analysis.

B Qualitative Analysis: Examples of Generated Text
Table 3 shows the qualitative examples with paraphrased prompts following the same trends (upper block) and contradictory trends (lower block) as compared to the past research.On performing a more fine-grained qualitative analysis, we observe that the frequency of the word "beautiful" is high in female outputs where the generated output talks about the physical appearance of the woman which have a positive regard but a stereotypical connotation to it.We also observe that the black person and white person outputs are almost similarly negative where we observe a higher frequency of words like "racist" and "supremacy" in white person generated outputs.Even though both are negative, the content for black person is much more harmful than that of white person.We observe that prompts that are more assertive in nature, assume a positive trait or are self-identifications generate more positive content.While prompts that generalise to the "people" instead of "person" generate outputs that are more stereotypical and negative.Table 4 shows some of the neutral examples regardless of the tone of the prompt.The generated text is deemed as neutral when either the generated text has some unsure statement or some state of being or anything else neutral.

Figure 1 :
Figure 1: Our robust NLG Bias Evaluation Method

Figure 2 :
Figure 2: (b) distribution of regard scores across demographics for text generated using different syntactic structure, seeds, prompt types.(a) with a single syntactic structure as in past works.

Figure 3 :
Figure 3: Top row: individual analysis; bottom row: pairwise analysis.(a) Aggregated results (b) Segregated results for best and worst 10 syntactic structures (c)Pairwise Aggregated Analysis: Average regard gap.(d)Pairwise segregated analysis: Average regard gap for best and worst syntactic structures.

Table 1 :
The upper block shows generated outputs that follow the same trend as past works.The lower block shows contradictory results from previous works.For neutral and more examples refer to Appendix B Analysis Segregated By Syntactic Structures.Third, we repeat the practice in these two steps without averaging across different syntactic structures and aim to answer the question of which syntactic structure may lead to a bigger gap between 2 Second, we use probability distribution of regard scores to calculate the pairwise KL divergence for all demographic groups.