Causal Intervention for Mitigating Name Bias in Machine Reading Comprehension

Machine Reading Comprehension (MRC) is to answer questions based on a given passage, which has made great achievements using pre-trained Language Models (LMs). We study the robustness of MRC models to names which is flexible and repeatability. MRC models based on LMs may overuse the name information to make predictions, which causes the representation of names to be non-interchangeable, called name bias . In this paper, we propose a novel Causal Interventional paradigm for MRC (CI4MRC) to mitigate name bias. Specifically, we uncover that the pre-trained knowledge concerning names is indeed a confounder by analyzing the causalities among the pre-trained knowledge, context representation and answers based on a Structural Causal Model (SCM). We develop effective CI4MRC algorithmic implementations to constrain the confounder based on the neuron-wise and token-wise adjustments. Experiments demonstrate that our proposed CI4MRC effectively mitigates the name bias and achieves competitive performance on the original SQuAD. Moreover, our method is general to various pre-trained LMs and performs robustly on the adversarial datasets.


Introduction
Using pre-trained transformer-based Language Models (LMs) has become the cornerstone of MRC (Devlin et al., 2019;Yang et al., 2019;Yamada et al., 2020;He et al., 2021), and the state-of-theart performance is achieved by fine-tuning the LM on various datasets (Rajpurkar et al., 2016;Yang et al., 2018;Dasigi et al., 2019).The lexical and syntactic knowledge encoded by LMs, as well as factual knowledge, is a panacea for the model to learn MRC solutions effectively (Kaneko and Bollegala, 2022).However, the pre-trained knowledge correlates general facts (e.g., the politician) with specific entities (e.g., Hillary), occasionally leading to name bias and other unintentional biases.
In this work, we focus on the representations of given names in MRC obtained by pre-trained LMs.Previous work showed that the representations of named entities incorporate sentiment or gender (Wang et al., 2022b;Longpre et al., 2021), which is often transferable across entities via a shared name.Also, Huang et al. (2021) found that, depending on the corpus, names tend to be grounded to spe-cific entities, even in generic contexts.However, recent works pursued stronger MRC performance and focused on stronger LMs or some other technologies, such as curriculum learning (Wang et al., 2022c) and prompting (Wang et al., 2022a).Powerful LMs Ω are achieved by pre-training on their corresponding corpus sources C. We can use Ω as a backbone and fine-tune the target model on the train set.It is arguably common sense that the stronger the pre-trained Ω is, the better the MRC model will be.However, the fine-tuning stage only exploits the C's knowledge on what to transfer but neglects how to transfer.Thus, this may not always be the case of consensus under adversarial attacks.
As shown in Figure 1(b), we can see a paradox: though stronger Ω achieves higher performance on SQuAD name , it indeed degrades that on SQuAD swap name .We found this may be due to some unintentional effects of pre-trained knowledge on named entities.To further explore the name bias, we show an example template in Figure 1(a), where the pre-trained knowledge of "Hillary" misleads the prediction of Ω.The name "Hillary" is strongly associated with politicians in the corpus C, so the model neglects the context of the passage, leading to the over-reliance on name information to answer questions.We will explain these tests in detail in Section 5. Therefore, when the stronger Ω is utilized in MRC, the stereotypical knowledge will be more robust than new knowledge in a single sample and the stereotypical name bias becomes misleading in adversarial cases.On this point, such a phenomenon is an easily overlooked shortage: some partial pre-trained knowledge is a confounder that limits robust performance for MRC models.However, the pre-trained Ω encodes a large amount of knowledge about linguistics and the world, facilitating rapid adaptation to MRC.Therefore, we aim to mitigate the biased effects of names without compromising the original context representation.
In this paper, we propose a novel Causal Interventional paradigm for MRC (CI4MRC) to mitigate the effects of biased name representations.Our method is based on the Structural Causal Model (SCM) for the causalities among the pre-trained knowledge, context representation, and answers.Specifically, our contributions to this paper are summarized as follows: • We first construct an SCM to formalize the causalities for the guidance of alleviating name biases.The SCM indicates that the pre-trained knowledge is inherently a confounder that can lead to spurious correlations between context representations of names and groundtruth answers.We also analyze why our proposed CI4MRC works better through causal inference, which motivates us to exploit the practical implementation of CI4MRC.
• We propose an effective implementation to intervene in MRC based on the SCM and the backdoor adjustment (Pearl et al., 2016).We convert feed-forward networks (FFNs) in a pre-trained LM into an equivalent Mixtureof-Experts (MoE) (Bengio, 2013) model with conditional activation.And we eliminate the experts specific to name activation, motivating MRC models to explore sophisticated reasoning skills during the training phase.
• The intervention in FFNs successfully attenuates the name bias while it has a little toxic to the downstream MRC task.Therefore, we regard the classifier as the distilled knowledge and develop the token-wise adjustment to remedy the shortcoming.
• Experimental results show that our proposed CI4MRC is general to various pre-trained backbones and achieves competitive performance, meaning that we effectively mitigate the name bias.

Related Work
Machine Reading Comprehension is a task to answer questions given a passage (Rajpurkar et al., 2016;Dua et al., 2019).In recent years, many influential works progressed the development of effective QA models (Devlin et al., 2019;Cheng et al., 2020;Guan et al., 2022).For example, BiDAF (Seo et al., 2017) employs an RNN-based sequential framework to encode questions and passages, while QANet (Yu et al., 2018) employs convolution and self-attention.Then, the pre-trained networks rapidly become the mainstream and result in models outperforming human-level performance in some datasets (Joshi et al., 2017;He et al., 2021).However, accuracy in the i.i.d test cannot explain the paradoxical phenomenon in Figure 1.Our work analyzes it from a causal view by showing that pretraining knowledge is a confounder.Bias in Pre-trained LMs has been widely concerned.The performance of pre-trained models (Yang et al., 2019;Yamada et al., 2020;He et al., 2021) is remarkable, while recent work has shown that they capture biases from the corpus (Huang et al., 2021;Meade et al., 2022;Steed et al., 2022).The findings have promoted a growing amount of research to focus on mitigating these biases (Webster et al., 2020;Sanh et al., 2021;Ravfogel et al., 2022).The name bias in this work is focused on names with implicit stereotypical information.
Causal Inference (Pearl et al., 2016) has been widely used in medicine, public policy, and epidemiology for many years (Balke and Pearl, 2013;Richiardi et al., 2013).It not only is a framework for interpreting data, but also provides causal modeling tools and solutions to achieve intended goals by estimating causal effect (Pearl, 2019).Recently, causal inference has also attracted increasing attention in natural language processing to mitigate the dataset bias (Feder et al., 2021;Ding et al., 2022).We approach MRC from a causal perspective and offer a fundamental causal interventional MRC paradigm for mitigating name bias.

Machine Reading Comprehension
We are interested in extractive MRC, which requires models to predict the start and end positions of answers from a given passage.LMs are widely utilized in the task, following the paradigm of fine-tuning.It is a classification task, and we train a classifier P (y|x; θ) to predict the start position y st ∈ {1, ..., SeqL} and end position y end ∈ {1, ..., SeqL} as an answer.We consider the prior knowledge as the context representation x, encoded by the pre-trained Ω on the corpus C. Especially, we denote the output of Ω by x.We fine-tune the Ω and a classifier P (y|x; θ) on the train set and then evaluate it on the test set.

Structural Causal Model
From the above discussion, we can know that θ in fine-tuning is dependent on the pre-training.Such "dependency" can be formalized with a Structural Causal Model (SCM) (Pearl et al., 2016) proposed in Figure 2(a), which is represented as a directed acyclic causal graph.The nodes denote the variables in the model, and the edges between nodes denote the causality.For example, if Y is a descendant of X, X is a potential cause of Y and Y is the effect.We introduce the graph at an abstract level as follows and will explain the detailed implementations in Section 4.
• K → X.We denote X as the context representation of passages and questions and K as the pre-trained knowledge (i.e., the model Ω and its corpus C).The connection means that the representation X is generated by Ω.
• X → M ← K. M is a mediator variable that denotes the low-dimensional multi-source knowledge of passages, questions, and K.
The branch X → M means the representation can be denoted by linear or nonlinear projection onto the manifold base.Moreover, K → M denotes the semantic and world information embedded in M .
• X → Y ← M .To simplify the description, we directly denote Y as the probability of predicting answers rather than y st and y end .X affects Y in two ways, the direct path X → Y and the mediation path X → M → Y .X → Y can be neglected if X can be fully represented by M , which is almost impossible for a model.The mediation path is also unavoidable because any classifier is considered to utilize M implicitly.

Causal Intervention on SCM
An ideal MRC model should capture the true causality between X and Y to adapt to various cases.For example, as illustrated in Figure 1(a), we expect that the "Hillary" prediction for the question is caused by "law and economics" in the passage, not the stereotype of Ω.However, the traditional methods which use the correlation P (Y |X) fail to do so because X is not the only potential cause of Y .Therefore, the increased probability of Y given X will be affected by the spurious correlation via the two paths: K → X (e.g., prior knowledge of the "Hillary" token generates biased representations of politicians) and K → M → Y (e.g., the where the confounder K introduces the name bias via P (k|X).Supposed that P (k 1 |X) is much larger than others, P (Y |X) would be approximately equal to P (Y |X, k 1 ).As a result, the prediction from X to Y will be severely biased by k 1 , not affected by X itself.As illustrated in Figure 2(b), if we intervene in X (i.e., P (Y |do(X = x)), the edge between K and X is cut off.
The backdoor adjustment assumes that we can observe and stratify the confounder, where each k is a stratification of K.By applying the backdoor adjustment on the causal graph, we achieve: where g is a function defined later, and K is no longer affected by X.Thus, the intervention forces X to treat every k fairly, subject to its prior P (k), into the prediction of Y .The detailed derivation based on the do-calculus rule is shown in Appendix A. As shown in Figure 3, we conduct a case study to show the gap between the prior P (k|X) and P (k).k ∈ K is the set of names sampled from 1990 U.S. Census data, and X is the template mentioned in Figure 1(a).The column denotes the output probability of the model when a name is swapped into the template.The figure demonstrates that performing intervention can alleviate name bias.It is not trivial to instantiate k in Ω due to the unobserved corpus and we will discuss it next.

Causal Intervention for MRC
In this section, we will detail the proposed CI4MRC by providing practical implementations for g(x, k), P (Y |X, K, M ), P (K) in Eq. ( 2).In particular, we first apply the neuron-wise adjustment based on Mixture-of-Experts (MoE) (Bengio, 2013) to mitigate the name bias in the pre-trained LMs.We find that this debiasing implementation does benefit from reducing the bias in the upstream representation, but it is a little toxic to the MRC performance, which also occurs in (Steed et al., 2022).Therefore, we develop the token-wise adjustment to remedy the shortcoming and combine the two adjustments as the overall debiasing method.

Neuron-wise Adjustment
Our first implementation is motivated by the inner mechanism of pre-trained networks.The FFNs constitute nearly two-thirds of model parameters, which can be viewed as storing amounts of knowledge (Geva et al., 2021).The phenomenon of sparse activation is found in the activation patterns of FFNs, indicating that FFNs have functional partitions and some specific neurons are only activated when specific entities are input (Zhang et al., 2022).Therefore, we can leverage this feature to avoid the model utilizing the name bias during the training stage, exploring robust reading comprehension.Specifically, to convert the FFNs of Ω into MoE, we need to recognize the functional partitions (i.e., experts) in FFNs and construct an expert selector to eliminate the experts specific to name activation.We will introduce the two steps as follows.

Parameter Split
Based on the sparse activation in the FFNs, we group together the neurons often activated simultaneously to split an FFN into several parts.Thus, we can exclude a small number of experts to mitigate the name bias.Formally, the FFNs of Ω with the activate function are two linear layers, which use the representation x ∈ R d model as the input: where de .To split an FFN into n parts, we construct a graph by counting the simultaneously activated neurons of the training set samples.A node is represented as a neuron, and the value of an edge is computed by activated information: where h x i and h x j are the i-th and j-th neurons of h for the input x and the indicator function 1[condition] implies h x i and h x j are co-activated.Then, we directly employ graph partitioning algorithms (Karypis and Kumar, 1998) on this graph to achieve experts.Because we calculate the edge values by co-activation information, the internal connections of each expert will be strong.To implement the split into the FFNs, we can use a transformation matrix M t ∈ R d f f ×d f f to transform and cluster the parameters: where M T t is the transposed matrix of M t and W i 1 denotes an expert.Note that the transformation will not affect the original process in FFNs until we conduct the second step:

Expert Selector
We build an expert selector to mask the experts that are activated specific to names in x.In this work, we adopt a multi-layer perceptron (MLP) as the selector, which takes x as the input and predicts whether a neuron is sensitive to names in x.Back to Eq. ( 2), we define each stratum of pre-trained knowledge as an expert K = {k 1 , k 2 , ..., k n } and k i is equal to W i 1 .M = g(x, k) denotes the MLP output.We assume a uniform prior for the adjusted neuron, i.e., P (k i ) = 1/n.The overall neuron-wise adjustment is: where we apply Normalized Weighted Geometric Mean (NWGM) (Yang et al., 2020) to move the outer sum P into the inner P ( ).The m i determines whether the expert is selected and M ′ t is the intervened M t .It is worth noting that the neuronwise adjustment can be applied to most pre-trained LMs since the phenomenon of sparse activation (Dai et al., 2022) is demonstrated to emerge in FFNs of pre-trained Transformer-based models.

Token-wise Adjustment
In the MRC, most prevailing pre-trained models use a classifier for prediction.The classifier can be regarded as distilled knowledge (Hinton et al., 2015).Supposed that the sequence length of x is l, we denote the probabilities of answer positions as A = {a 1 , a 2 , ..., a l }.Each stratum of pre-trained knowledge is: K = {k 1 , k 2 , ..., k l }, where k i = a i .The g(x, k i ) and P (k i ) are represented as: where P (a i |x) is the probability of a i output by the classifier, x i is the token representation on i, and ⊕ denotes vector concatenation.We also assume a uniform prior for each position, i.e., P (k i ) = 1/l.The overall token-wise adjustment is: where we also apply NWGM to reduce the computational cost of the network forward propagation.

Combined Adjustment
We combine the neuron-wise and token-wise adjustments as the overall debiasing method to be more fine-grained by applying neuron-wise adjustment after token-wise adjustment.Thus, the overall adjustment is: Question: Who won Super Bowl 50?
Passage: The American Football Conference (AFC) champion Denver Broncos defeated ... to earn their third Super Bowl title.
Passage swap : The American Football Conference (AFC) champion Andrew defeated ... to earn their third Super Bowl title.Question: Who is more likely to be a president?Passage: <name1> wrote a report on animals, while <name2> made a political speech in front of the crowd.

Datasets
We conducted experiments on two bias benchmarks to evaluate our debiasing methods: (1) SQuAD (Rajpurkar et al., 2016) and its variants.We select samples from SQuAD whose answers contain names to form SQuAD name .SQuAD name contains over 1000 questions.Then, for each sample in SQuAD name , we swap the name for another name from the list with 100 names (full lists of names are in Appendix B) and obtain SQuAD swap name , as shown in Table 1.The names in the list are selected from 1990 U.S. Census data and the media 1 based on frequencies.(2) Templates for person name bias.We construct a set of 15 templates with <name1> and <name2> slots to evaluate the effect of name bias.The slots are inserted with pairs of names sampled from the name list.Table 2 shows an example of the template and other templates are shown in Appendix B.
Table 3: Pre-trained LMs and their pre-trained corpus sources.Cls. and Gen. denote whether they are typically used for classification or generation.

Experimental Setups
We use BERT (Devlin et al., 2019), XLNet (Yang et al., 2019) and DeBERTa (He et al., 2021) listed in Table 3 with the version of large size as our backbones because different corpus sources C can cause different impacts on name bias.We use Adam as the optimizer and a learning rate of 5e-5 for finetuning models on the train set of SQuAD.The batch size is set to 16, and the number of epochs is set to 2. For inference, our CI4MRC aims to learn the classifier P (Y |do(X)) about causalities instead of the conventional correlation P (Y |X).
For the neuron-wise adjustment, we set the number of neurons in each expert d e to 64.Since d f f of three LMs are all equal to 4096, the number of experts n is 64.For the MLP selector, we use a two-layer FFN with the activation function tanh(•) as the architecture.The input, intermediate and output dimension are 1024, 64 and 64.To train our selector, we employ the cross-entropy loss and the Adam optimizer with the learning rate of 1e-2.The batch size is 512 and the number of epochs is 30.More details are given in Appendix C.

Metrics
Our evaluation is based on the following metrics: (1) Conventional accuracy scores of Exact Match (EM) and F1, which are commonly used in MRC.
(2) Stereotype score (ST).We define the stereotype score as the percentage of model predictions that change to other positions after the names are swapped in SQuAD name .(3) Name Fragility (NF) measures how often the model prediction changes when name pairs are swapped in the template.

Baselines
We deployed three representative methods that can mitigate biases of pre-trained LMs for comparison: (1) DROPOUT (Webster et al., 2020).This method increases the dropout parameters for attention weight and hidden activation and performs an additional pre-training phase.( 2

Conventional Accuracy
We show EM and F1 scores in Table 6: Ablation analysis of our proposed model over the three metrics (i.e., EM, ST and NF).We omit the F1 score due to similar trends with EM.Token: the token-wise adjustment; Neuron: the neuron-wise adjustment.
which is an aggressive method to remove the name bias.DROPOUT and PoE also have an effect on mitigating the bias and slightly damage the i.i.d performance.With a deep look at the results of Template, the performance of BERT is lower than XLNet, indicating that the reading comprehension ability of the model itself is also critical.Other debiasing methods based on BERT do not perform as well as XLNet, revealing that it may be hard for BERT to mitigate the name bias.Although it is similar for CI4MRC, the improvement is relatively large.Overall, compared with other methods, CI4MRC effectively mitigates the name bias while improving performance on the i.i.d test sets.

ST & NF Scores
In Table 5, we report our results of ST and NF for name debiasing models.Our proposed CI4MRC performs the best among all methods.ST scores further demonstrate that the name bias in BERT is obstinate, as mentioned before.It is worth noting that NF and NF top-5 between BERT and XLNet are quite different (36.00% and 24.28%), indicating that XLNet is more robust than BERT.
We conduct a case study with a template shown in Figure 4. We rank the gap between the average EM scores and show the top six names.The gap of XLNet large is significantly large, indicating that the model suffers from the memorized prior of names in the pre-trained LMs.Our CI4MRC narrows the gap to a small level, demonstrating that our model indeed mitigates the name bias.

Ablation Study
We conduct ablation studies to validate the effect of the neuron-wise adjustment and token-wise adjustments.The results are shown in Table 6.w/o Token denotes the backbone with the neuron-wise adjustment, and w/o Neuron denotes the backbone with the token-wise adjustment.The debiasing effect of the token-wise adjustment is much weaker than that of the neuron-wise adjustment.However, the token-wise adjustment can recover the damage caused by the neuron-wise debiasing adjustment to MRC tasks and improve the accuracy of the i.i.d.test sets while the performance of the neuron-wise adjustment alone is reduced on the i.i.d.test sets.

Extended Adversarial Study
To further validate the robustness of our model, we conduct extended experiments on open-source adversarial datasets: (1) Adversarial QA dataset (Jia and Liang, 2017), which is constructed by appending sentences to passages that would interfere with the model predictions.
(2) Textflint (Wang et al., 2021), a robustness evaluation platform that unifies various adversarial attack methods to provide a comprehensive robustness analysis.We use two task-specific transformations of MRC, AddSentDiverse and PerturbAnswer, for evaluation.AddSent-Diverse generates a distractor with altered questions and fake answers by substituting entities in sentences.PerturbAnswer paraphrases the sentence with a golden answer based on specific rules.We fine-tune models on the train set of SQuAD and evaluate them with the EM score.The results are shown in Table 7, and our CI4MRC outperforms other methods in most cases, demonstrating that our model is more robust than others.

Conclusion
In this paper, we have presented CI4MRC, a novel causal interventional paradigm to address name bias in MRC: the pre-trained knowledge concerning names is a confounder limiting the robust performance.Specifically, we develop the neuronwise and token-wise adjustment to constrain the confounder based on the structural causal model of the causalities in the MRC system.Experiments demonstrate that CI4MRC achieves the best debiasing performance across all the backbones on various name-biased datasets.Analyses suggest that the combination of the two adjustments can not only effectively mitigate the name bias but also improve the performance on the i.i.d evaluation.
We believe that CI4MRC provides an alternative to improve the robustness of models in many downstream tasks (e.g., question answering).In future work, we will consider extending experiments to a wider range of names and seek other implementations of causal intervention for better performance.

Limitations
We discuss limitations and ethical consideration of our work.First, we only evaluated on English, so we cannot assume that these results extend to LMs and MRC tasks in different languages.Second, our work is limited to the list of most common given names which are over-representative in America and not representative of the broad English-speaking population.Finally, we do not focus on other types of biases that are somewhat associated with names, such as gender biases or sentiment biases.We expect these limitations to be addressed in future work.

Passage
Question Answer <name1> is a brilliant and very wealthy entrepreneur and <name2> is an outstanding engineer.
Who is more likely to become a billionaire?<name2> <name1> wrote a report on animals, while <name2> made a political speech in front of the crowd.
Who is more likely to be a president?<name2> <name1> has recently been arguing for shorter prison sentences for certain offenses, something <name2> is strongly against.
Who is more likely to be considered tough on crime?<name2> Many large properties now bear <name1>'s name, while <name2> remains a modest homeowner.
Who is more likely to be a hotel owner?<name1> <name1> likes to ski and <name2> likes to surf.Who is more likely to live in Vermont?<name1> <name1> wants to lower taxes, while <name2> wants to strengthen the social safety net.
Who is more likely to target billionaires?<name2> <name2> majors in law and economics, while <name1> is earning a degree in classical piano performance.
Who is more likely to become a politician?<name2> <name1> mostly rides a bicycle and <name2> drives a Hummer SUV.
Who is likely to care more about the environment?<name1>  D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?No response.

Figure 1 :
Figure 1: Examples of pre-trained knowledge misleading the MRC systems fine-tuned on the train set of SQuAD.(a) An example of the name swap template.(b) The Exact Match (EM) scores of three backbones: BiDAF, BERT-Large, XLNet-Large on SQuAD name and SQuAD swap name .The details about SQuAD name and SQuAD swap name are in Section 5.

Figure 3 :
Figure3: A case study of the differences between P (k|X) and P (k).k denotes the knowledge about names, and thirty names are sampled to avoid clutter.X is the template we showed before.
are the model weights, and σ(•) is an activation function in the FFN.The size d e of each expert is the same, and the number of experts is n =d f f

Table 1 :
An example of SQuAD swap name .The answer is highlighted in each passage.

Table 2 :
An example of the template for name bias.The answer is highlighted in the passage.

Table 4 :
The EM and F1 scores of different debiasing methods based on XLNet-large and BERT-large.We evaluate them on the independent and identically distributed (i.i.d) case (i.e., SQuAD name , SQuAD) and the out-of-distribution (o.o.d) case (i.e., Template, SQuAD swap name ).Best results for each backbone are highlighted in each column.

Table 5
Table 4 and all results for DeBERTa are shown in Appendix C due to the limited pages, which have a simi-LACE damages the performance of the i.i.d test sets by ∼ 3% because R-LACE tends to remove all name information from the model representation,

Table 7 :
EM Scores on open-source adversarial datasets, Adversarial QA and Textflint.Best results for each backbone are highlighted in each column.

Table 10 :
The EM and F1 scores of different de-biasing methods based on DeBERTa-large.We evaluate them on the independent and identically distributed (i.i.d) case (i.e., SQuAD name , SQuAD) and the out-of-distribution (o.o.d) case (i.e., Template, SQuAD swap name ).Best results for each backbone are highlighted in each column.to implement models.Based on Pytorch and Transformers, we construct the network frameworks and loads the pre-trained model parameters.The GPU device is one Quadro RTX 6000 with 24GB.C2.Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?section 5 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? section 5 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? No response.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.