Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources

Warning: this paper contains content that may be offensive or upsetting. Commonsense knowledge bases (CSKB) are increasingly used for various natural language processing tasks. Since CSKBs are mostly human-generated and may reflect societal biases, it is important to ensure that such biases are not conflated with the notion of commonsense. Here we focus on two widely used CSKBs, ConceptNet and GenericsKB, and establish the presence of bias in the form of two types of representational harms, overgeneralization of polarized perceptions and representation disparity across different demographic groups in both CSKBs. Next, we find similar representational harms for downstream models that use ConceptNet. Finally, we propose a filtering-based approach for mitigating such harms, and observe that our filtered-based approach can reduce the issues in both resources and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models.


Introduction
Commonsense knowledge is important for a wide range of natural language processing (NLP) tasks as a way to incorporate information about everyday situations necessary for human language understanding. Numerous models have included knowledge resources such as ConceptNet  for question answering (Lin et al., 2019), sarcasm generation (Chakrabarty et al., 2020), and dialogue response generation (Zhou et al., 2018(Zhou et al., , 2021, among others. However, commonsense knowledge resources are mostly human-generated, either crowdsourced from the public Sap et al., 2019) or crawled from massive web corpora (Bhakthavatsalam et al., 2020). For example, ConceptNet originated from the Open * equal contribution  Mind Common Sense project that collects commonsense statements online from web users (Singh et al., 2002) 1 and GenericsKB consists of crawled text from public websites. One issue with this approach is that the crowdsourcing workers and web page writers may conflate their own prejudices with the notion of commonsense. For instance, we have found that querying for some target words such as "church" as shown in Table 1 in ConceptNet, results in biased triples. The potentially biased nature of commonsense knowledge bases (CSKB), given their increasing popularity, raises the urgent need to quantify biases both in the knowledge resources and in the downstream models that use these resources. We present the first study on measuring bias in two large CSKBs, namely ConceptNet , the most widely used knowledge graph in commonsense reasoning tasks, and GenericsKB (Bhakthavatsalam et al., 2020), which expresses knowledge in the form of natural language sentences and has gained increasing usage. We formalize a new quantification of "representational harms," i.e., how social groups (referred to as "targets") are perceived (Barocas et al., 2017;Blodgett et al., 2020) in the context of CSKBs.
We consider two types of such harms in the context of CSKBs. One is intra-target overgeneralization, indicating that "common sense" in these resources may unfairly attribute a polarized (nega-tive or positive) characteristic to all members of a target class such as "lawyers are dishonest." The other is inter-target disparity, occurring when targets have significantly different coverage in the CSKB in terms of both the number of statements about the targets (e.g., "Persian" might have much fewer CS statements than "British") and perception toward the targets ("islam" might have more negative CS statements than "christian").
We propose a quantification of overgeneralization and disparity in CSKBs using two proxy measures of polarized perceptions: sentiment and regard (Sheng et al., 2019). Applying the proposed metrics of bias to ConceptNet and GenericsKB, we find harmful overgeneralizations of both negative and positive perceptions over many target groups, indicating that human biases have been conflated with "common sense" in these resources. We find severe disparities across the targets in demographic categories such as professions and genders, both in the number of statements and the polarized perceptions about the targets.
We then examine two generative downstream tasks and the corresponding models that use Con-ceptNet. Specifically, we focus on automatic knowledge graph construction and story generation and quantify biases in COMeT (Bosselut et al., 2019) and CommonsenseStoryGen (CSG) (Guan et al., 2020). We find that these models also contain the harmful overgeneralizations and disparities found in ConceptNet. We then design a simple mitigation method that filters unwanted triples according to our measures in ConceptNet. We retrain COMeT using filtered ConceptNet and show that our proposed mitigation approach helps in reducing both overgeneralization and disparity issues in the COMeT model but leads to a performance drop in terms of the quality of triples generated according to human evaluations. We open-source our data and prompts to evaluate biases in commonsense resources and models for future work 2 .

Representational Harms
Representational harms occur "when systems reinforce the subordination of some groups along the lines of identity" and can be further categorized into stereotyping, recognition, denigration, and underrepresentation (Barocas et al., 2017;Crawford, 2017;Blodgett et al., 2020). This work aims to formalize representational harms specifically for a set of statements about some target groups (Nadeem et al., 2020), e.g "lawyer is related to dishonest" is a statement about the group "lawyer." When measuring such harms, we consider the core concept of polarized perceptions: non-neutral views that can take the form of either prejudice that expresses negative views 3 or favoritism that expresses positive views toward a certain target perceived in the statement (Mehrabi et al., 2021).
More formally, let S = {s 1 , s 2 , ..., s n } indicate the set of n natural language statements s i and let T = {t 1 , t 2 , ..., t m , } indicate the set of m targets such that each t j has appeared at least once in S. Each statement s i contains a target t j . We use s +/− i (t j ) to indicate that the statement expresses a positive or negative perception toward the target t j . We are interested in quantifying the representational harms toward T in this set S.

Two Types of Harms
To adapt the definition of representational harms to a sentence set, we define two sub-types of harms, intra-target overgeneralization and inter-target disparity, aiming to cover different categories of representational harms (Barocas et al., 2017;Crawford, 2017). We consider overgeneralization that directly examines whether targets such as "lawyer" or "lady" are perceived positively or negatively in the statements (examples in Table 1), covering categories including stereotyping, denigration, and favoritism. Then we consider disparity across different targets in representation (do some targets have fewer associated statements and lower coverage) and polarized perceptions (whether some targets are more positively or negatively perceived). Intra-target Overgeneralization The ideal sentence set depicting a target group such as "lawyer" or "lady" should have neither favoritism nor prejudice toward the target group. Overgeneralization means unfairly attributing a polarized (negative or positive) characteristic to all members of a target group t j . In the context of sentence sets S, we define overgeneralization to be polarized sentences in the set regarding the targets. When the statement s i contains negatively-polarized views toward a target group such as "lawyers are dishonest," it demonstrates prejudice toward lawyers. It is an overgeneralized statement because it implies that all lawyers are not honest. The same logic applies to positively-polarized statements such as "British people are brilliant" that constitute favoritism.
To measure the polarization in a sentence s i , we use two approximations: sentiment polarity and regard (Sheng et al., 2019). We define an intra-target overgeneralization score using these measures for every target in S. Formally, for each target t j , we collect all statements in S that contain target t j to form a target-specific set S t j , we then apply the sentiment and regard classifiers individually on all statements in S t j and report the average percentages of statements that are classified as positive or negative sentiment or regard labels for the target t j . The negative and positive overgeneralization bias (prejudice and favoritism) for t j is quantified as: where |S +/− t j | is the number of statements with positive/negative polarization measured by sentiment or regard for the target t j , i.e. s +/− i (t j ), and |S t j | is the number of statements in S with the target t j . Inter-target Disparity In addition to overgeneralized non-neutral views for each target group, we also study inter-target disparityi.e., how different a target t j is perceived in the set S compared to other targets. We consider two aspects of disparity across T in S: 1) representation disparity, defined as the difference in the number of associated statements: |S t j | between targets t j ∈ T, denoted by D R (S, T); and 2) the difference in the computed overgeneralization bias: O +/− (S, t j ) between targets t j ∈ T, denoted by D O (S, T). Note that compared to O +/− (S, t j ), D O (S, T) is calculated over the full population of targets, thus measuring inter-target disparity. For both aspects, we measure disparity using variance as follows: where |S t | indicates the average number of statements for targets in T and O +/− (S, t j ) is the average overgeneralization bias for targets, "+" for favoritism and "-" for prejudice. The expectation E is taken over all targets t j ∈ T.

Measuring Polarized Perceptions
Prior work (Sheng et al., 2019) demonstrated that sentiment and regard are effective measures of bias (polarized views toward a target group). Although this is still an active area of research, for now, these are promising proxies that many works in ethical NLP also have used to measure bias (e.g.  2021)). However, we acknowledge that there still exist problems with these measures as proxies for measuring bias and acknowledge the existence of noisy labels using these measures as proxies. To put this into test and to show that these measures can still be reliable proxies despite the aforementioned problems, we perform studies both including human evaluators in the loop as well as comparison of these measures with a keyword-based approach in this section. In order to determine the polarization of perception associated to a statement toward a group, we apply sentiment and regard classifiers on the statement containing the target group and obtain the corresponding labels from each of the classifiers. We then categorize the statement into favoritism, prejudice, or neutral based on the positive, negative, or neutral labels obtained from each of the classifiers. Crowdsourcing Human Labels To validate the quality of these polarity proxies, we conduct crowdsourcing to solicit human labels on the statement polarity. We asked Amazon Mechanical Turk workers to label provided knowledge from Generic-sKB (Bhakthavatsalam et al., 2020) and Concept-Net  with regards to favoritism, prejudice, and neutral toward a target group. 3,000 instances were labeled from ConceptNet and more than 1,500 from GenericsKB. The inter-annotator agreement in terms of Fleiss' kappa scores (Fleiss, 1971) for this task was 0.5007 and 0.3827 for GenericsKB and ConceptNet respectively. Alignment with Human Labels We compare human labels with those obtained from sentiment and regard classifiers to check the validity of these measures as proxies for overgeneralization. As shown  Table 3: Comparison of sentiment, regard, and baseline keyword-based approach in terms of favoritism and prejudice recall/precision/F1 scores.
in Table 2, we found reasonable agreement in terms of accuracy for sentiment and regard with human labels. This was also confirmed in previous work (Sheng et al., 2019) in which sentiment and regard were shown to be good proxies to measure bias.

Comparison with Keyword-based Approach
We also compare the sentiment and regard classifiers to a keyword-based baseline, in which we collect a list of biased words that could represent favoritism and prejudice from LIWC (Tausczik and Pennebaker, 2010) and Empath (Fast et al., 2016). This method labels the statement sentences from ConceptNet and GenericsKB as positively/negatively overgeneralized if they contain words from our keyword list. As shown in Table  3, this method has a significantly lower recall and overall F1 value in identifying favoritism and prejudice compared to sentiment and regard measures.

Representational Harms in CSKBs
Our formalization of representational harms is defined over statements. GenericsKB uses naturally occurring sentences, so our measures (in Sec. 2) directly apply. For ConceptNet, we convert each KB triple containing the targets t j ∈ T into a natural language statement, as detailed in the following sub-sections.

Data Preparation
Selection of Target Groups Let G be the graph of the CSKB that consists of commonsense assertions in the form of triples (s, r, o) representing "subjectrelation-object". Our study focuses on triples with the subject s or the object o being a member t j ∈ T. We first collect a list of targets t j from the StereoSet dataset (Nadeem et al., 2020). The collected targets are organized within 4 different categories: origin, gender, religion, and profession. We renamed the "race" category from (Nadeem et al., 2020) to "origin" to be more precise as words such as British may not necessarily represent a race but more of the origin or nationality of a person. Each of these 4 categories contain different target words, adding up to 321 targets. We further include some additional targets which were missing in Nadeem et al. (2020), such as "Armenian," resulting in a total of 329 targets (see Appendix Table 14-15 for the full list).

Collection of CSKB Triples
We collect all the triples from ConceptNet 5.7 4  which contain the target words in each category, resulting in more than 100k triples. For GenericsKB, we use the GenericsKB-Best set (Bhakthavatsalam et al., 2020), which contains filtered, high-quality sentences and extract those that have one of our target words as their annotated topic of the sentence, resulting in around 30k statements (sentences).

Converting Triples to Statement Sentences
We convert every triple to a sentence by mapping the relation r in the triple to its natural language form r, and concatenate s,r, and o to be a statement s i in S. To convert triples from ConceptNet into natural sentences, we use the same mapping as that used in COMeT (Bosselut et al., 2019), covering all standard types of relations in ConceptNet plus the "InstanceOf " relation. For instance, the triple (American, IsA, citizen_of_America) which contains the target "American" will be converted to the sentence "American is a citizen of America". After having these statements, we apply sentiment and regard classifiers to obtain the labels for these statements that can measure polarized perceptions. Quantifying Harms During the classification process using sentiment and regard classifiers, we mask all the demographic information from the sentences to avoid biases in sentiment and regard classifiers that may affect our analysis. We obtain sentiment and regard labels for the masked sentences using the VADER sentiment analysis tool (Gilbert and Hutto, 2014) which is a rule-based sentiment analyzer. For regard, we use the fine-tuned BERT model from (Sheng et al., 2019). After obtaining the labels, we use Eq. (1) and (2) to measure overgeneralization and Eq. (4) for disparity in overgeneralization.

Analysis of Representational Harms
Results on Overgeneralization We quantify overgeneralization using Eq. (1) and (2)  We find outlier target groups with high regard and sentiment percentages that show the severity of overgeneralization issues. We also find large variation/disparity in the number of negative or positive triples for groups in the same category indicated by the span of boxes.

Negligible Bias Favoritism Prejudice Both
Negligible Bias Favoritism Prejudice Both Figure 2: Examples of targets from the "Profession" and "Religion" categories from Nadeem et al. (2020) labeled by the regard measure. Regions indicate favoritism, prejudice, both prejudice and favoritism, and somewhat neutral. Higher negative regard percentages indicate prejudice-leaning and higher positive regard percentages indicate favoritism-leaning. We also compare ConceptNet  and GenericsKB (Bhakthavatsalam et al., 2020) on the "Religion" category and find similar polarized perceptions of certain groups, despite a much larger percentage range for GenericsKB. triples in ConceptNet is 4.5% (4.6k triples) for sentiment and 3.4% (3.6k triple) for regard. For Gener-icsKB, the percentages are 36.5% for sentiment (11k triples) and 38.6% for regard (11k triples). We find that both KBs consist of sentences that contain polarized perceptions of either favoritism or prejudice; and among the two, GenericsKB has a much higher rate.
In a closer look, Figure 1 presents the box plots of negative and positive regard/sentiment percentages for targets in 4 categories for both CSKBs. The presence of outliers in these plots are testaments to the fact that targets can be harmed through overgeneralization -their sentiment and regard percentages can span up to 30% for positive sentiment in ConceptNet and 80% in GenericsKB; 17% for negative regard in ConceptNet and 100% in GenericsKB. We again find some similar trends of representational harms across the two KBs qualitatively, such as the box shapes for "Gender" and "Religion" categories, indicating common biases in knowledge resources. Echoing previous findings on range of overgeneralization rates in GenericsKB, we find the scales of biased percentages are much higher than ConceptNet.
Regions of Overgeneralization By plotting the negative and positive regard percentages for each target along the x and y coordinates, Figure 2 demonstrates the issue of overgeneralization in different categories. For example, for "Profession," some target professions such as "CEO" are associated with a higher positive regard percentage (blue region) and thus a higher overgenaralization in terms of favoritism. In contrast, some professions, such as "politician" are associated with a higher negative regard percentage (red region) representing a higher overgenaralization in terms of prejudice. In addition, some professions, such as "psychologist" are associated with both high negative and positive regard percentages (purple region) and high positive and negative overgenaralization.
ConceptNet vs GenericsKB We compare Con-Clever dick has the context british. Brilliance has the context british.
Cynicism is related to greek.
Greek is related to merry.
Cock slave is related to man. Man is related to integrity.
Slut is related to woman. Beauty is related to woman.
Saffron terror is related to hindu.
War on terrorism is related to muslim. Muslim is a creationist.
Teacher is capable of subject student to humiliation. Teacher causes the desire to study.
Ugly american is related to businessperson.
Saffron terror is related to hindu.
War on terrorism is related to muslim. Muslim is a creationist.
Teacher is capable of subject student to humiliation. Teacher causes the desire to study.
Ugly american is related to businessperson. Figure 3: Four different representations from four categories each demonstrating a certain aspect of bias. In "Origin" category, we can observe extreme overgeneralization toward "british," in "Gender" category both target groups are overgeneralized, in "Religion" extreme prejudice toward "muslim," and in "Profession" extreme favoritism toward "teacher" target group. Each case is accompanied with an example of negative and positive associations detected by sentiment. : Box plots demonstrating the representation disparity in terms of number of triples/sentences for "Gender" and "Profession" categories from Concept-Net and GenericsKB. We find similarly severe disparities in two KBs with the number of sentences ranging much more for GenericsKB compared to ConceptNet. ceptNet and GenericsKB on the "Religion" category and see certain targets contain similar biases, such as "christian" contains both biases and "sharia" is prejudiced against in both KBs. Furthermore, we find interesting discrepancies between the two KBs: GenericsKB's overall percentages of positive and negative biases are much higher than ConceptNet, indicated by the scale on x and y axis (0-60% for GenericsKB and 0-16% for Con-ceptNet). This also aligns with our findings that GenericsKB has a higher rate of overgeneralization. Figure 3 further demonstrates how severe the problem of overgen-eralization can be, along with some concrete examples. For instance, in the "Origin" category, "british" is overgeneralized because the bar plot shows high values for both the positive (blue) and negative (red) sentiment. In addition, from the "Profession" category, we can see an example for favoritism toward "teacher" because the bar plot shows high values for positive (blue) sentiment. In another instance from the "Religion" category, the high negative sentiment percentage for the "muslim" target illustrates the severity of prejudice toward the "muslim" target.

Representation Disparity
We first quantify the disparity in terms of the number of triples for each target (word) in the 4 categories, using Eq. (3). Table 4 shows extremely high variance in both CSKBs. Figure 4 shows the boxplots for the numbers of triples available in ConceptNet and sentences in GenericsKB for different targets within two categories. We can see that the number ranges from 0 to thousands triples for different targets in two KBs, and GenericsKB has more severe outliers that have as much as around 6k. We also include some sample bar plots for some of the targets within each of the categories separately in detail to highlight the existing disparities amongst them.

Overgeneralization Disparity
We further analyze the disparities amongst targets in terms of overgeneralization (favoritism and prejudice perceptions measured by sentiment and regard) using Eq. (4), shown in Table 4. We find that GenericsKB has much higher variance compared to ConceptNet. To better illustrate the disparity, boxplots in Figure 1 show the variation of overgeneralization across different groups for 4 categories. These plots illustrate the dispersion of negative sentiment/regard percentages which represent  prejudices against targets as well as positive sentiment/regard percentages for favoritism toward targets. We can observe that targets such as"muslim" (shown in Figure 3) may be perceived negatively significantly more than others. The same trend also holds for positive sentiment and regard scores. Figure 2 also shows qualitatively that the targets are not clustered at some point with similar negative and positive regard percentages, but rather spread across different regions.

CSKB Completion
As a popular downstream application, we first consider the task of commonsense knowledge base completion which looks to automatically augment a CSKB with generated facts (Li et al., 2016). We focus our analysis on the COMeT model (Bosselut et al., 2019), built by fine-tuning a pre-trained GPT model (Radford et al., 2018) over ConceptNet triples. COMeT has been shown to generate unseen commonsense knowledge in ConceptNet with high quality, and much recent work has used it to provide commonsense background knowledge (Shwartz et al., 2020;Chakrabarty et al., 2020).
Data We collect statements in COMeT as follows: we input the same target words used in ConceptNet as prompts and collect triples by following all relations existing in the model. Specifically, we collect the top 10 generated results from beam search for all 34 relations existing in COMeT learned from ConceptNet. We generate triples for all the targets we consider, resulting in 112k statements converted from triples and masked target words, the same process as we do for ConceptNet.
Overgeneralization From the results of the analysis on statements generated by COMeT, one can observe that the overgeneralization issue still exists in the generated statements. For instance for the "Religion" category, the mean of the negative regard is approximately 25%. This illustrates the prejudice toward the targets in the religion category in terms of overgeneralization. In addition, senti-

Source Examples
COMeT (mexican, Causes, poverty) (muslim, Causes, terrorism) (policeman, Causes, death) (brother, CapableOf, be mother fxxker) ment scores as high as 50% for some of the targets in some categories represent the severity of overgeneralization bias. Some additional qualitative examples are also included in Table 5.

Disparity in Overgeneralization Notice that in
COMeT we do not have the data imbalance problem since COMeT is a generative model, and we generate an equal number of statements for each target. Disparity in number of triples is not an issue for this task. However, the disparity in overgeneralization is still an issue in COMeT. For instance, the results from COMeT shown in Figure 5 demonstrate the fact that variances exist in both regard and sentiment measures which is an indication of disparity in overgeneralization. This means that some targets are still extremely favored or disfavored according to regard and sentiment percentages compared to other targets, and that this disparity is still apparent amongst the targets. Data To analyze bias in the story output for CSG, we prompt the CSG model using sentences that are about the social perception of a certain target. We split our targets into: people, locations, professions, and others. Next, we manually come up with 30 templates inspired by the prefix templates for bias in NLG (Sheng et al., 2019). Some examples are listed in Table 6. We then generate prompts by filling the corresponding templates with target names, Origin Origin Origin Origin Figure 5: Negative and positive sentiment and regard results from COMeT and CSG.

Category Template
People XYZ was asked about what did they work as.
People are discussing about XYZ.

Locations
People are discussing about people from XYZ.
People are discussing about traveling to XYZ.

Professions
People are talking about what they think of XYZ. The XYZ was asked about what was they regarded as

Others
People are discussing about XYZ. People are talking about what they think of XYZ. Table 6: Example prompt templates for story generation for different targets inspired by (Sheng et al., 2019). XYZ is replaced with the target name. resulting in around 3k prompts for CSG. CSG generates a total of 12k sentences and we calculate regard and sentiment percentages based on all the sentences for a given story.
Overgeneralization From Figure 5, we observe similar patterns in terms of the existence of the overgeneralization issue. For instance, as shown in the results in Figure 5, categories like religion span up to having 60% negative associations in terms of regard and sentiment scores.

Disparity in Overgeneralization
Similar to the COMeT model since we generated equal amount of statements for this task, we do not observe the disparity in the number of statements as we did with ConceptNet. However, as illustrated in the results presented in Figure 5, the disparity in overgeneralization is still problematic. For instance, as in Figure 5 the disparity in the "Religion" category on the negative sentiment spans from 0% to 60%. In addition, the "Origin" category for the CSG task has a significant spread similar to other categories, such as "Religion" and "Gender".

Bias Mitigation on CSKB Completion
To mitigate the observed representational harms in ConceptNet and their effects on downstream tasks, we propose a pre-processing data filtering technique that reduces the effect of existing representational harms in ConceptNet. We apply our mitigation technique on COMeT as a case study.  Mitigation Approach Our pre-processing technique relies on data filtering. In this approach, the ConceptNet triples are first passed through regard and sentiment classifiers and only get included in the training process of the downstream tasks if they do not contain representational harms in terms of our regard and sentiment measures. In other words, in this framework, all the biased triples that were associated with a positive or negative label from regard and sentiment classifiers get filtered out and only neutral triples with neutral label get used.

Results on Overgeneralization
To measure effectiveness of mitigation over overgeneralization, we consider increasing the overall mean of neutral triples which is indicative of reducing the overall favoritism and prejudice according to sentiment and regard measures. We report the effects on overgenaralization on sentiment as Neutral Sentiment Mean (NSM) and regard measure as Neutral Regard Mean (NRM). As demonstrated in Table 7, by increasing the overall neutral sentiment and regard means, our filtered model is able to reduce the unwanted positive and negative associations and reduce the overgeneralization issue.

Results on Disparity in Overgeneralization
To measure effectiveness of mitigation over disparity in overgeneralization, we consider reducing the existing variance amongst different targets. We report the disparity in overgeneralization on sentiment as Neutral Sentiment Variance (NSV) and on regard as Neutral Regard Variance (NRV). Shown in Table 7, our filtered technique reduces the variance and dis-parities amongst targets over the standard COMeT model in terms of regard and sentiment measures.

Human Evaluation of Mitigation Results
In addition to reporting regard and sentiment scores, we perform human evaluation on 3,000 generated triples from standard COMeT and COMeT-Filtered models to evaluate both the quality of the generated triples and the bias aspect of it from the human perspective on Amazon Mechanical Turk. From the results in Table 7, one can observe that COMeT-Filtered is construed to have less overall overgeneralization harm since humans rated more of the triples generated by it to be neutral and not containing negative or positive associations. This is shown as Human Neutral Mean (HNM) in Table 7. However, this came with a trade-off for quality in which COMeT-Filtered is rated to have less quality compared to standard COMeT in terms of validity of its triples. We encourage future work to improve for higher quality. In addition, we measure the interannotator agreement and report the Fleiss' kappa scores (Fleiss, 1971) to be 0.4788 and 0.6407 for quality and representational harm ratings respectively in the standard COMeT model and 0.4983 and 0.6498 for that of COMeT-Filtered.

Related Work
Work on fairness in NLP has expanded to different applications and domains including coreference resolution (Zhao et al., 2018a), named entity recognition (Mehrabi et al., 2020), machine translation (Font and Costa-jussà, 2019), word embedding (Bolukbasi et al., 2016;Zhao et al., 2018b;Zhou et al., 2019), as well as surveys (Sun et al., 2019;Blodgett et al., 2020;Mehrabi et al., 2021). Despite the aforementioned extensive research in this area, not much attention has been given to the representational harms in tools and models used for commonsense reasoning. Injecting commonsense knowledge into NLP tasks is gaining attention (Storks et al., 2019;Chang et al., 2020). In our work, we study two downstream tasks in this area and show how they are affected by existing biases in upstream commonsense knowledge resources like Concept-Net. Although Sweeney and Najafian (2019) have previously shown that ConceptNet word embeddings (Speer, 2017) are less biased compared to other embeddings, we demonstrate that destructive biases still exist in ConceptNet that need to be carefully studied.

Conclusion
Incorporating commonsense knowledge into models is becoming a popular trend as it is important for our models to mimic humans and the way they utilize commonsense knowledge in performing different tasks. One danger of mimicking humans is adopting their biases. We performed a study to analyze existing representational harms in two commonsense knowledge resources and their effects on different downstream tasks and models. We analyzed two harms, overgeneralization and disparity using models of sentiment and regard. In addition, we introduced a pre-processing mitigation technique and evaluated this approach considering our measures as well as human evaluations. Future directions include designing more effective mitigation techniques with no harm to the quality of models.

Ethics and Broader Impact
This work primarily advocates for having more ethical commonsense reasoning resources and models. In the near future, there will likely be more efforts to incorporate commonsense in NLP models. Conflating human biases with commonsense is harmful. Thus, pointing out the existing problems and proposing simple solutions to them can have a significant broad impact to the community. We acknowledge that our paper had disturbing content, but these egregious examples are representative of the knowledge supplied to NLP models. Our goal is not to devalue any work or any target group, but to raise awareness of these problems in the AI community. We also acknowledge that we do not cover all the possible existing target groups in each category, such as non-binary gender groups. However, we incorporated groups from (Nadeem et al., 2020) and made extensions to fill gaps in these groups. Additionally, during our studies, we made sure that we consider these ethical aspects. For instance, while doing Mechanical Turk experiments using human workers we made sure to keep the workers aware of the potential offensive content that our work may contain, and we also made sure to pay workers a reasonable amount for the work they were putting in (around $11 per hour, well above the minimum wage). We hope that our material will help the research community to consider these problems as serious issues and work toward addressing them in a more rigorous fashion.

A Qualitative Examples
We include details in the appendix section both in terms of providing more qualitative analysis and also some detailed experimental results that we could not include in the main text due to the space limitation. For instance, in Table 9 we include more of qualitative results and demonstrate some destructive triples existing in ConceptNet. In addition to ConceptNet examples, Table 9 includes some examples from the COMeT model. Similarly,

B Mitigation Framework
In addition, we provide a visual for our mitigation framework in Figure 7 and detailed results of COMeT vs COMet_Filtered comparisons over different categories.

C Human Evaluation
COMeT vs Filtered-COMeT For human evaluations, we sample the top 3 generated triples for each of the "CapableOf ", "Causes", and "HasProperty" relations for all the groups in each category resutling in around 1,000 triples for each model and ask three mecahnical turk workers to rate each of the triples in terms of their quality (whether a triple is a valid commonsense or not) and bias (whether a triple shows favoritism or prejudice or is neutral toward the demographic groups). This gave us around 3,000 triples to be rated for each of the models (around 6,000 triples in total for all the models). Figure 10, includes a sample from our survey on Amazon Mechanical Turk platform. We also recorded the inter-annotator agreement with the Fleiss' kappa scores in the main text. These numbers are reasonable agreements. Specifically, the annotators agreed on rating bias higher compared to the quality which was the main strength of our COMeT-Filtered model. While it is easier for the annotators to annotate if something is bias or not, it might be harder for them to annotate the quality of a generated commonsense. With that being said, the agreements are reasonable and acceptable for both tasks.

Agreement with Human
COMeT Regard 71.6% COMeT Sentiment 59.2% COMeT-Filtered Regard 72.8% COMeT-Filtered Sentiment 54.7% Table 8: Percentages represent how much regard and sentiment labels ran on COMeT and COMeT-Filtered triples agree with labels coming from humans. The higher the percentage, it means that the measure agrees with human's perception of bias more closely and can serve as a good proxy to measure biases.
ConceptNet vs GenericsKB For this task we also asked three mechanical turk workers to rate 1,000 instances from ConceptNet and more than 500 instances from GenericsKB. The statement sentence triples were chosen randomly. We also made sure that we have good amount from each type (favoritism, prejudice, and neutral) being represented.

D Experimental Details
Sentiment Analysis For sentiment analysis, we used a threshold value of greater than or equal  Figure 9: Examples of targets and the regions they fall under within each category considering sentiment as a measure. The corresponding regions are: prejudice, favoritism, and negligible bias regions.

Survey Instructions (Click to expand)
In this task you will rate machine generated commonsense knowledge in terms of quality and bias. More specifically, you will rate sentences that should represent commonsense knowledge. First, you will rate the quality of the sentence based on if it is a valid commonsense knowledge or not (1 indicating valid knowledge and 2 invalid knowledge). Second, you will rate it for the presence of positive bias, negative bias, or lack of bias (neutral) towards the demographic group mentioned in the sentence. For instance, the sentence: Judge works in court is a valid commonsense knowledge, while judge works in school is an invalid commonsense knowlege. In addition, Americans are good people has positive bias towards Americans, while Americans are bad people has negative bias towards Americans. Americans live in America is a neutral commonsense knowledge.
1.Please rate the following commonsense knowlege Hispanic causes discrimination Quality:

1-valid 2-invalid
Bias: positive bias neutral negative bias ATTENTION Responses will be checked for quality and low quality HITs will be rejected.
(Optional) Please provide any comments that you have about this HIT. Thanks for doing our HIT! We appreciate your input! Figure 10: Example of a survey provided to mechanical turk workers for human evaluation.
to 0.05 for positive sentiment classification and a threshold value of less than or equal to −0.05 for negative sentiment classification as per suggestion in (Gilbert and Hutto, 2014). Filtered-COMeT and COMeT We used the same configurations for training Filtered-COMeT as con-fig_0.json in the COMeT repository 5 (details for training COMet can be obtained from the same 5 https://github.com/atcbosselut/ comet-commonsense repository as well). The train, test, and two dev sets were adopted from the COMeT repository (ConceptNet train100k.txt, test.txt, dev1.txt, and dev2.txt) and augmented according to our filtering approach. Our model is pre-trained on GPT model with 768 hidden dimensions 12 layers and heads similar to COMeT. We used Nvidia GeForce RTX 2080 to train the Filtered-COMeT model using the Adam optimizer for 100,000 iterations. Commonsense Story Generation Experimental details can be found at CommonsenseStoryGen repository 6 .  we are not sure about the situation. a lot of people are complaining about the way ukraine is going. we are not sure about it. it is interesting to listen. people are discussing about people from germany. they are very skilled in germany. they can speak german. people are very good at germany. the people have a lot of germany skills. people are discussing about hindu.

Resource
i found out a lot of people believe hinduism. i tried to convince people that hinduism was wrong. i tried to talk to people that believe just for the sake of their faith. i ended up getting into a fight with a friend of my that's right.     british-negative sentiment british-negative regard female-negative sentiment female-negative regard Figure 12: Wordcloud of phrases that appear in triples with negative regard and sentiment labels for "british" and "female" targets.  We considered triples that these words appeared in them.