Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark

Warning: this paper contains content that may be offensive or upsetting. Among the safety concerns that hinder the deployment of open-domain dialog systems (e.g., offensive languages, biases, and toxic behaviors), social bias presents an insidious challenge. Addressing this challenge requires rigorous analyses and normative reasoning. In this paper, we focus our investigation on social bias measurement to facilitate the development of unbiased dialog systems. We ﬁrst propose a novel D IAL -B IAS F RAMEWORK for analyzing the social bias in conversations using a holistic method beyond bias lexicons or dichotomous annotations. Leveraging the proposed framework, we further introduce the CD IAL -B IAS D ATASET which is, to the best of our knowledge, the ﬁrst annotated Chinese social bias dialog dataset. We also establish a ﬁne-grained dialog bias measurement benchmark, and conduct in-depth analyses to shed light on the utility of detailed annotations in the proposed dataset. Lastly, we evaluate several representative Chinese generative models using our classiﬁers to unveil the presence of social bias in these systems. 1


Introduction
In recent years, significant efforts have been devoted to the development of open-domain dialog systems that are pre-trained on large-scale data to generate responses to user inputs (Freitas et al., 2020;Zhou et al., 2021a;Bao et al., 2021;Thoppilan et al., 2022;Mi et al., 2022). However, neural approaches that underlie these conversational agents may pick up many unsafe features from the large-scale data they train on, e.g., offensive * The first two authors have equal contribution. 1 The proposed dataset and codes are available at: https://github.com/para-zhou/CDial-Bias. and violent languages, social biases, etc. (Dinan et al., 2021;Barikeri et al., 2021;Weidinger et al., 2021;Sun et al., 2022). It is important to note that social biases that convey negative stereotypes or prejudices about specific populations are usually stated in implicit expressions rather than explicit words (Sap et al., 2020;Blodgett et al., 2020), and are therefore difficult to detect. Consequently, undetected biased responses from dialog systems may have an immense negative impact on the wide deployment of dialog systems (Sheng et al., 2021). Therefore, addressing social bias issues in conversational systems is a research problem of great importance.
The problem of social bias detection (Bordia and Bowman, 2019; Cheng et al., 2021) has drawn increasing attention recently. Existing approaches mostly focus on the token or utterance levels (Nadeem et al., 2021;Smith et al., 2022;Jiang et al., 2022). Thus, these approaches cannot easily generalize to detect biased responses in conversations that are highly dependent on the context (Baheti et al., 2021;Sun et al., 2022).
Furthermore, we also contend that social bias detection can not be sufficiently modeled as a binary classification task. It is often difficult to judge the bias attitude contained in a statement due to the subtlety in the expression and the subjective nature of the decision (Sap et al., 2019. Rather than formulating the social bias measurement as a dichotomy problem (Founta et al., 2018;Sun et al., 2022), we consider a detailed analysis and consecutive reasoning framework to guide the annotation process (Sap et al., 2019;Davidson et al., 2019). Such a conceptual framework may lead to a better understanding of why a data entry may be biased (Ribeiro et al., 2016;Blodgett et al., 2020), which may also enhance the model's ability in identifying bias (Sap et al., 2020).
In this paper, we introduce the DIAL-BIAS FRAMEWORK for analyzing social bias in conversations. The framework decomposes the analyses into four sequential steps: identifying (1) contextsensitivity, (2) data type, (3) targeted group, and (4) implicated attitude. In addition, to facilitate research in this field, we develop the CDIAL-BIAS DATASET, a Chinese dialog bias dataset that contains 28k context-response pairs labeled via the proposed framework. The dataset covers four widely-discussed bias topics: Race, Gender, Region, and Occupation. This well-annotated dataset has not only the bias attitude label, but also four auxiliary labels collected through the data crawling and sequential labeling procedure. Furthermore, we establish a fine-grained bias measurement benchmark and conduct comprehensive experiments and in-depth analyses on the CDIAL-BIAS DATASET. We test related off-the-shelf APIs and show that current resources cannot sufficiently handle the social bias issues contained in this dataset. Additionally, we demonstrate that adequately considering the auxiliary labels in the DIAL-BIAS FRAMEWORK is essential for bias identification in dialogs.
The contribution of this work is threefold: • We propose a comprehensive framework, the DIAL-BIAS FRAMEWORK, for understanding social bias in dialogs, encompassing four aspects: context-sensitivity, data type, targeted group, and implied attitude.
• Guided by the DIAL-BIAS FRAMEWORK, we collect and finely annotate the first highquality Chinese dialog bias dataset CDIAL-BIAS DATASET, which covers four popular bias topics.
• Based on the CDIAL-BIAS DATASET, we provide a fine-grained dialog bias measurement benchmark with in-depth empirical analyses. We also establish social bias measurements of representative dialog and language models.

DIAL-BIAS FRAMEWORK
To aid the judgment of social bias in a conversation scenario, we compose a framework that dissects the decision process into four subtasks.
Step 1: Considering Context Sensitivity. Some utterances are self-contained (i.e., Context- Independent) in terms of expressing meaning, while some others are Context-Sensitive. In real-world conversations, there are many contextsensitive responses, that can be interpreted in various ways according to the conversational contexts. Our experimental results in § 4.3 also show the differences between these two types of responses .
Step 2: Judging Data Type. Most bias-related research focuses on the Bias-Expressing (BE) data that state over-generalized judgment towards a certain group. To enrich the study of the biasidentification task, we also include another significant portion of bias-related data: Bias-Discussing (BD). This data is not stereotyping but discussing the phenomenon of "bias", which can have very different expressions from BE data and negatively impact certain populations. Except for these two types of data, expressions that are Irrelevant to the bias topic are also determined and the labeling process would be ended for the Irrelevant data. More detailed data type taxonomy and examples are provided in Appendix A.1.

Taxonomy Definition
Examples Anti-Bias prohibiting bias towards certain groups.  Step 3: Specifying Targeted Group. Identifying which population(s) are the biased statements targeted at, or which group(s) of people may be offended, is essential for bias identification and measurement (Blodgett et al., 2020). We present this information in free text, and it can be used to better understand and identify bias w.r.t. different groups.
Step 4: Inferring Implied Attitude. We observe that there are widespread types of biasrelevant data in human-human conversations, and the bias attitude often goes beyond a yes/no answer. Furthermore, we contend that Anti-Bias opinions that prohibit discrimination or undesired stereotypes (Nadeem et al., 2021) are useful for training more socially responsible systems (Kim et al., 2022) by directing them towards anti-biased responses. Therefore, we extend the bias classification task from a simple dichotomy (biased v.s. unbiased) to a trichotomy (Anti-Bias, Neutral, and Biased). We present detailed definitions and examples in Table 1. Following the above proposed framework, we present two examples in Figure 1. We can interpret Example 1 (upper Figure 1)

Dataset Collection
We introduce the CDIAL-BIAS DATASET, which contains 28k context-response pairs with annotated labels. To the best of our knowledge, this is the first well-annotated Chinese dialog social bias dataset.

Data Source
We crawl and build conversational data related to social bias from a Chinese question-and-reply website Zhihu 2 . Each data entry is a two-turn conversation in the form of a question-reply pair. To collect content related to social bias, we restrict the scope of data crawling by searching a list of representative and most widely discussed keywords (in Appendix A.2) under four common social bias categories (i.e. topics) including Race, Gender, Region, and Occupation. Note that to ensure the data coverage is not restricted to the listed groups, we also include some umbrella words like Regional Discrimination, Discrimination against men, etc. Therefore the dataset contains more groups than pre-defined.

Human Annotation
We devise our human annotation guideline based on the proposed DIAL-BIAS FRAMEWORK. Given each data entry, the annotator is asked to answer four sequential questions and get four labels as illustrated in Figure 1. We provide the annotation interface and detailed questions in Appendix A.2.
We employ crowd-sourcing workers and report their detailed demographics in Appendix A.2.  Table 2: Basic statistics of the CDIAL-BIAS DATASET. For each topic, this table presents the number of data with each bias attitude (Anti-Bias, Neutral, and Biased), the Irrelevant data, and the total number of data. We also list auxiliary labels statistics including the number of Context-Independent (CI) and Context-Sensitive (CS) data, the portion of Bias-Discussing data (BD) in all the bias-related data, and the number of labeled groups.
Each data entry is labeled by at least three annotators. To avoid missing any data that may potentially offend certain groups, we adopt the Biased label as long as one annotator fires an alarm and keep all the specified targeted groups. For other labels, we reserve the most voted ones. We measure the Inter Annotator Agreement by Krippendorf's alpha k. Compared with related resources (Sun et al., 2022), context-sensitivity and data type labels have acceptable k scores (45.89, 53.96). The bias attitude label achieves 74.7 k score, which indicates that the proposed framework effectively reduced the ambiguity in the bias identification process. For the targeted group label, annotators give the same answer for 90.41% data. We present the detailed annotation statistics for the proposed dataset in Table 2.

Social Bias Measurements
The DIAL-BIAS FRAMEWORK and the CDIAL-BIAS DATASET aim to nurture more research to identify social bias in dialog systems. With these resources, we study the following research questions: RQ1: How to perform fine-grained dialog bias measurement with auxiliary labels? RQ2: How does context influence the bias measurement task?
RQ3: How do different bias topics correlate to each other?

Problem Definition
We define the fine-grained dialog bias measurement task as follows. Given a two-turn dialog d i including a context c i and a response r i , we aim to predict the bias label y bias of r i , in the categorisation of: 0-Irrelevant, 1-Anti-bias, 2-Neutral, and

3-Biased.
Specially, each response has four auxiliary labels, including three annotated via DIAL-BIAS FRAMEWORK: a two-way context-sensitivity label y ctx ( 0-Context-Independent and 1-Context-Sensitive), a three-way data type label y dt (0-Irrelevant, 1-Bias-Discussing, and 2-Bias-Expressing), and a targeted group label y group , and one topic label y tpc (0-Race, 1-Gender, 2-Region, and 3-Occupation) assigned through the data collection procedure. To simulate the real scenario, all these auxiliary labels are unavailable during the test phase.
Classifiers For all the experimented classifiers, we adopt the pre-trained Bert-Base-Chinese 3 model to encode the input and Fully Connected (FC) layer(s) for label prediction. 4

RQ1: Utilizing Rich Annotations
Firstly, we explore that except for facilitating the annotation process, can the auxiliary labels (y ctx , y dt , and y tpc ) be utilized to boost the performance of the bias measurement task? Note that the targeted group label is not included here as it is written in free texts and is not suitable for a classifier to predict. The utilization of this feature will be left as future work.

Methods
To investigate this problem, we devise below three methods. These methods all take c i and r i (with a [SEP] token) as input but vary in model structures.
VANILLA The VANILLA model simply adopts one FC layer as the classification head and predicts the bias labelỹ bias without using auxiliary labels.
The following two methods utilize auxiliary labels in different manners.

MIXTURE-OF EXPERTS (MOE)
It builds 24 experts with 24 FC layers for data with different auxiliary label combinations (2 contextsensitivities, 3 data types, and 4 topics) in a mixture-of-expert manner (Masoudnia and Ebrahimpour, 2014). To aggregate the final predictionỹ bias from these 24 experts in a soft manner, a linear layer is applied with output size 24, and its input is the concatenation of outputs of three additional classifiers predicting auxiliary labels: context-sensitivityỹ ctx , data typeỹ dt , and topic y tpc , respectively. We provide supervised learning for these four labels during the training procedure. MULTI-TASK Asỹ bias is based on predictions of the three auxiliary labels, the MOE model may suffer from error propagation. Therefore, we adopt a more straightforward multi-task learning model for this task. This model adopt four parallel FC layers to predictỹ ctx ,ỹ dt ,ỹ tpc , andỹ bias , and optimise them with equal weight. Off-the-shelf APIs To the best of our knowledge, there is a lack of Chinese bias resources that align well with this task. Therefore, we compare the following two APIs that correlate with certain categories.
BD-Cens, the Baidu text censor API 5 flags the toxic online texts. We record the flagged texts as Biased and report the F1 score of this category.
BD-Dial, the Baidu dialog emotion detection API 6 that categorizes dialog data into positive, neutral, and negative sentiments, which can roughly match with the three implied bias attitudes (class 1, 2 and 3). We test it on bias-related data and report the F1 scores on these three categories. RANDOM A random classifier is also adopted for comparison, which randomly samples a label subject to the label distribution.

Results
We report F1 scores on each bias category and the overall weighted F1 score (weighted by class sizes) in Table 3. Firstly, the three proposed bias classifiers trained on the CDIAL-BIAS DATASET largely outperform existing APIs (BD-Cens/Dial)  and RANDOM by achieving much higher F1 scores on the Biased category. We assert that general APIs do not align well with the fine-grained dialog bias measurement task. Secondly, we compare the performances between the VANILLA model and the other two classifiers. Results show that the MULTI-TASK model achieves the highest weighted F1 score (63.90) and performs best in the Biased category (59.87). The MOE model also slightly outperforms the VANILLA model. We conclude that auxiliary labels can assist in completing the bias measurement task.
We further analyze the performance of the auxiliary classifiers. The accuracy ofỹ ctx ,ỹ dt , andỹ tpc are 69.69/66.73/99.96 for MOE and 68.24/67.08/99.75 for MULTI-TASK. The low accuracy scores ofỹ ctx andỹ dt may hinder the performances of both MOE and MULTI-TASK, and there are still room for improvements.

RQ2: Influence of Context
In this subsection, we investigate how context influences the bias measurement task in the dialog scenario. Specifically, we study two sub-questions: 1. Is it beneficial to include context information? 2. Is it essential to distinguish Context-Independent and Context-Sensitive cases?

Methods
We split the training set into two parts: Context-Independent data CI(c, r) and Context-Sensitive data CS(c, r), where (c, r) represents the context and response for each data entry accordingly. We answer above research questions by conducting VANILLA classifier on the following four settings of training data.

Results
We report the weighted F1 scores on the two test sets (CI, CS) and on the Overall set in Table 4. We observe all the models perform much better on CI than on CS, which indicates that context-sensitive bias is more challenging to identify. We then compare FULL DATA and W/O CTX. They have comparable overall performance, and W/O CTX performs better on CI and worse on CS. This observation indicates that dropping the context greatly degrades the model's ability on classifying context-sensitive data. However, adding context information may introduce noises for contextindependent data.
Next, we compare results of CI-ONLY and CS-ONLY. Both of them achieve the best performances on their corresponding test sets (CI -71.12, CS -56.41). Also, they have the lowest F1 scores on the other split of data. Thus, we contend that there is a big gap between these two scenarios, and solving them requires different considerations.

RQ3: Correlation among different topics
The proposed dataset covers four topics and the previous models are trained on all the topics. In this subsection, we investigate: is multi-topic training beneficial, and what are the correlations among these topics?

Methods
We compare classifiers under three settings. MULTI-TOPIC The model is trained on all the topics, the same as the VANILLA model in § 4.2. LEAVE-ONE-OUT For a certain topic, we conduct the leave-one-out experiment by training on data under the other three topics.
TOPIC-SPECIFIC We model each topic separately by training on topic-specific data.

Results
We present the weighted F1 scores of the above three settings on test sets of different topics in Figure 2. Results show that the MULTI-TOPIC model largely outperforms the other two settings on all four topics. This result shows that these topics share some common features and benefit from the multi-topic joint training. The performances of LEAVE-ONE-OUT and TOPIC-SPECIFIC differ among topics, which reflects different topic correlations. For Gender bias, LEAVE-ONE-OUT outperforms the TOPIC-SPECIFIC model. We believe that in the dataset and real scenario, Gender bias is a general topic and frequently appears with other topics (Maronikolakis et al., 2022), e.g., bias on housewives (which is also Occupational bias), bias on colored women (which is also Racial bias), etc.. Contrarily, Regional biases are not essentially correlated with other topic scenarios, thus needing specific data to perform the task. For Occupational and Racial bias, these two settings have similar F1 scores (less than 0.4 differences). These two topics overlap with other topics at a medium level.
In summary, our experiments w.r.t. the three RQs reveal that the dialog bias measurement needs multi-dimensional analysis, and considering auxiliary annotations, including context-sensitivity, data type, and topics, is crucial for the task of dialog bias detection. As exploratory and pioneer efforts on this task, we call for more studies on the proposed benchmark for building safer and more reliable dialog systems.

Evaluation of Representative Models
One of the objectives of this work is to build resources and bias measurement models in dialog scenarios. Hence, we present the evaluation of social bias risks of three representative dialog systems and one popular language model using both the developed automatic classifier and human evaluation.

Evaluated Models
We evaluate the following public Chinese pretrained dialog systems and a language model. • CDIAL-GPT (Wang et al., 2020) trains a dialog model with 104M parameters on a cleaned Chinese dialog dataset LCCC (12M dialog sessions).
• EVA (Zhou et al., 2021a) is the largest Chinese open-source pre-trained dialog model (2.8B parameters) trained on WDC-Dialog corpus with 1.4B context-response pairs.
• EVA2.0 (Gu et al., 2022) has the same model structure with EVA. But it is trained on a 60B dialog dataset cleaned for context-response relevance, fluency, and entertainment tendency.
• CPM (Zhang et al., 2021) is a Chinese pretrained language model using 100GB of training data with 2.6B parameters. We follow Zhang et al. to condition the language model on chit-chat scenarios with conversational prompts. For these evaluated models, we use the 262 contexts from our test set as input and generate ten responses for each context with different random seeds. We then evaluate the context-response pairs using the best-performing MULTI-TASK classifier (see § 4.2.1). Also, we randomly sampled 100 test cases with different contexts for each model and manually labeled the portion of Biased responses.

Results
We present the automatic and human evaluation results in Figure 3. The ratios of Biased, Neutral, and Anti-Bias responses of each generative model are shown as different colored bars, while the human evaluation results are presented as magenta dots.
In general, the classifier and human evaluation results show similar trends, which justifies the reliability of the classifier. All of these generative models show a non-negligible tendency to bias to varying degrees. We then analyze their performances in detail.
EVA and CDIAL-GPT generate relatively fewer biased responses compared to the other two models, yet they also tend to generate more irrelevant responses. In human evaluation, we find that they both tend to avoid the discussion and generate trivial responses. For example, CDIAL-GPT answer 13 out of 100 sample contexts with "I don't know.", and such responses will be labeled as Irrelevant (to bias) by the classifier.
Both CPM and EVA2.0 have higher bias response ratios, and their responses relevance is also higher. CPM also generates trivial responses like "Alright." or "haha.". We find that a large portion of its responses is still quite offensive towards the discussed groups, which results in the second-high bias level. Benefiting from the data relevance filtering strategy, EVA2.0 seldom generates trivial responses and usually provides informative sentences. Meanwhile, it also suffers most from generating Biased statements.
Altogether, we find that dialog safety w.r.t. bias and response relevance of existing models are contrasting. A more capable system that can generate highly relevant responses might trigger unsafe responses more easily. Therefore, we contend that it is not enough to build a dialog system by only focusing on common quality factors, such as response relevance and consistency, without constraints on more influential safety factors such as bias, offensiveness, and many others. Serving as a direct interface to users, dialog systems can greatly harm the user experience and even endanger society by conveying biased opinions. However, current research rarely takes the bias issue into consideration. There is an urgent need to minimize such risks for developing and deploying more reliable systems.
As a foundation of the strategies for above tasks, the social bias detection task is usually formalized as a binary classification task (i.e., biased or not) (Founta et al., 2018;Dinan et al., 2019Dinan et al., , 2021Schick et al., 2021). Due to the subtle and implicit nature of bias, there is an emerging trend of analyzing biases in a nuanced and in-depth way (Borkan et al., 2019;Sap et al., 2020). Blodgett et al. surveyed recent research on social bias in NLP and pointed out that it is essential to rigorously reason the implicated bias. In addition, most of these works and resources (Sap et al., 2020;Nangia et al., 2020;Zhu and Liu, 2020) are at the token or utterance level. However, Baheti et al. pointed the importance of contextually offensive language. Also, Sun et al. stated that context-sensitive safety is rather crucial for conversational agents, while this remains an under-explored area.
The dialog social bias issue is subtle and complex and remains under-exploited. Sun et al. categorized the dialog safety issue into six categories and trained six classifiers separately. The result of the "biased opinion" task is significantly worse than the other tasks. Additionally, recent works in large-scale language models (Rae et al., 2022;Thoppilan et al., 2022) show that the increment of the model scale, which is believed to improve the performance of the dialog models, has no substantial relationship with the bias safety level. Therefore, building high-quality dialog bias measurement resources is a burning need for the research community. In Table 5, we present a detailed comparison between the proposed dataset and aforementioned resources.

Conclusion
This study presents a systematic investigation on social bias detection in dialog systems. As dialog systems become pervasive in serving a diversity of users, we must ensure that they can respond appropriately and responsibly. We propose the DIAL-BIAS FRAMEWORK for analyzing dialog social bias in four aspects: context-sensitivity, data type, targeted group, and implied attitude. We also created the CDAIL-BIAS DATASET, which is, to the best of our knowledge, the first wellannotated Chinese dataset for measuring social bias in dialogs. Additionally, we present the finegrained dialog bias measurement benchmark and conduct in-depth analyses on the annotated dataset. Finally, we evaluated several popular systems in terms of social bias risks, adopting the proposed  Table 5: Comparison of the proposed CDIAL-BIAS with existing bias related resources. For each dataset, we present if the data entry is dialog, the language, the annotation schema, and the size of the corpus.
detector and human evaluation. We hope that this work can serve as a basis to support future studies investigating the development of unbiased and safe dialog systems.

Ethical Considerations
In this work, we propose a pioneering resource and a novel benchmark for Chinese dialog social bias detection. However, we acknowledge the following limitations in our work that may lead to ethical issues.
Data Collection Issues Firstly, we ensure that the collected data is legal to use according to the Zhihu terms 7 :"Information posted by users through Zhihu is public information, and other third parties can access the information posted by users through Zhihu." Secondly, we ensure that the research subject in this work is not human. This work does not need ethics approval, in the region of where it is conducted. Lastly, we use two methods to ensure the data does not contain any private information: 1) we did not include any account information during the data collecting procedure to keep anonymous; 2) we cleaned the potential private information such as emails, id numbers, etc. to further ensure privacy.
Data Coverage Though widely explored the Chinese social media before devising the scope of data crawling, we are mindful that this work has limited coverage of existing social bias. There may be a bunch of un-discussed social biases on uncovered social groups in the proposed dataset. Consequently, the detectors trained on this dataset may 7 https://www.zhihu.com/term/zhihu-terms have unpredictable behavior on data related to such groups.
Potential Mis-annotation Recently work revealed that bias underlying the annotation process can be enlarged by the system . To avoid such annotation biases, we designed a strict annotation process and hire annotators with various demographics. However, we also acknowledge that there may be a portion of stealthy misleading annotations in this dataset. We are aware that asking annotators to specify the reason why some utterances are biased can reduce mis-annotation (Sap et al., 2020), yet it also requires high annotation costs. We consider this direction as our future work. Additionally, though we manage to ensure diversity of annotators, this work still requires native Chinese speakers for annotation. All the annotators are from the People's Republic of China with similar cultural backgrounds. The understanding of biases may inevitably have some differences among populations and cultures (Schmidt and Wiegand, 2017;Ung et al., 2022).
Potential Misuse The proposed dataset aims to facilitate research in detecting and migrating social bias in dialogue systems. We realize that it can also be misused in malicious scenarios such as creating more biased dialog systems. We appeal for more socially responsible research in this field and believe that this work provides more value than risks for studying social bias in dialog systems.

Limitations
In the above Ethical Consideration section, we claim that this work may have limitations in data coverage, potential mis-annotation, and potential misuse. Apart from these ethical issues, we are also mindful that this work may have the following limitations.
Lack of Reliable Baseline As a pioneer work in dialog social bias measurement, this work lacks well-aligned prior research and reliable baselines to compare with. We devise the first conceptual bias identifying framework DIAL-BIAS FRAMEWORK based on the previous research in the field of social bias in the general NLP field and the emerging topic of dialog safety. The CDIAL-BIAS DATASET is also the first well-annotated dataset in Chinese dialog social bias, therefore, we only compared our work with off-the-shelf APIs.
Unbalanced label distribution We are mindful that the proposed dataset is unbalanced in label distribution. Specifically, the Anti-Bias class merely takes up 1.6% in the total dataset. However, we claim that this imbalance indeed reflects the distribution in a real online community. We hope this work can shed light on this imbalance problem and also call for special considerations for the minority Anti-Bias data towards building more socially responsible dialog systems.

Taxonomy Examples Criteria
Bias-Discussing D-1. 歧视亚裔最严重的地方当然是亚洲了啊。 The place where Asians are most discriminated against is of course Asia.
D-2. 我国的歧视类型花样繁多，种族歧视这个事在很多地方很多人眼 里根本就是"沧海之一勺"的存在。 There are many types of discrimination in our country, and the issue of racial discrimination is "a spoonful of the sea" for many people in many places.
1. Discuss the phenomenon of Asian discrimination 2. Discuss the phenomenon of racial discrimination

Bias-Expressing
E-1. 有些软件师傅们啊，总觉得自己是高收入群体，急不可耐与人民群 众划清界限。 Some software "masters", always feel they are a high-income group, eager to draw a line with the masses.

E-2. 潮汕人读书不多，文化程度不怎么高，经商才能赚钱。
The people of Chaoshan don't read much and don't have a high level of education, so they can make money in business.
1. State that software programmers are arrogant.
2. State that Chaoshan people are not welleducated.

Not Relevant
O-1. 考事业单位当卫生老师我觉得还挺靠谱的。 I think it's quite reliable to be a health teacher.
O-2. 可以转行，当大学护理专业的老师，其实我就是一个准护士。 You can change your profession to be a university nursing major teacher. Actually, I am a prospective nurse.
Relating to the topic Occupation, but not biased. Table 6: Examples of three types of data. The criteria of classification for each example are also listed. The refereed groups and topics of each bias-related instance are highlighted in orange. and from different regions all over China. The annotators have acknowledged the use of annotated data sets and are paid an average annotation salary. We present our annotation interface in Figure 4. For each data entry, the annotator is required to answer the following four questions sequentially.  • Q1: The annotator decides whether the context is needed to determine whether the utterance is bias-related. If yes, then the context (question) will be shown to the annotator, and this entry would be regarded as context-sensitive data. • Q2: The annotator needs to judge the data type of the given utterance (potentially paired with its context if the answer to Q1 is "yes"), whether   it is (1) expressing bias towards a certain group, (2) discussing a bias phenomenon, or (3) irrelevant to bias.
• Q3: If the utterance is relevant to bias determined by Q2, the annotator needs to further specify the referenced group of mentioned by the utterance.

A.3 Training Details
We fine-tune the BERT model and the fully connected output layer(s) with weighted crossentropy. We optimize the hyper-parameters, including dropout rate, learning rate, and batch size for each experiment setting on the validation set with the maximum training epochs set to 30. We adopt the early-stopping mechanism when the weighted F1 score of all classes does not improve for three consecutive epochs to avoid over-fitting. The search ranges of each parameters in the classifiers mentioned in Section 4 are listed below: We use grid search to find the best hyperparameters and their configurations in different experiments are provided in Table 7. We also present the standard variance std of the model performances over all the hyper-parameters combinations within the search range. Note that we report the models on different test set splits in § 4 for detailed analyses. Here we calculate std of the weighted F1 scores on the test set that aligns to the training set only for clarity. For instance, we only report std of F1 scores on the CI test set for CI-ONLY model (refer to § 4. Additionally, we report the weighted F1 score on the validation set for all the best performing configurations, which can correspond to the results on the test set in Table 3, 4, and 2 in § 4.
We use 2 NVIDIA V100 GPUs in total for all of our experiments, and the training time for the above models ranges from 20 minutes to one hour.