Toward Deconfounding the Influence of Entity Demographics for Question Answering Accuracy

The goal of question answering (QA) is to answer any question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, model accuracy analysis reveals little evidence that accuracy is lower for people based on gender or nationality; instead, there is more variation on professions (question topic). But QA's lack of representation could itself hide evidence of bias, necessitating QA datasets that better represent global diversity.


Introduction
Question answering (QA) systems have impressive recent victories-beating trivia masters (Ferrucci et al., 2010) and superhuman reading (Najberg, 2018)-but these triumphs hold only if they generalize; QA systems should be able to answer questions even if they do not look like training examples. While other work (Section 4) focuses on demographic representation in NLP resources, our focus is how well QA models generalize across demographic subsets.
After mapping mentions to a knowledge base (Section 2), we show existing QA datasets lack diversity in the gender and national origin of the people mentioned: English-language QA datasets mostly ask about US men from a few professions (Section 2.2). This is problematic because most English speakers (and users of English QA systems) are not from the US or UK. Moreover, multilingual QA datasets are often translated from English datasets (Lewis et al., 2020;Artetxe et al., 2019). However, no work has verified that QA systems generalize to infrequent demographic groups.
Section 3 investigates whether statistical tests reveal patterns on demographic subgroups. Despite skewed distributions, accuracy is not correlated with gender or nationality, though it is with * Work completed while at Google Research  professional field. For instance, Natural Questions (Kwiatkowski et al., 2019, NQ) systems do well with entertainers but poorly with scientists, which are handled well in TriviaQA. However, absence of evidence is not evidence of absence (Section 5), and existing QA datasets are not yet diverse enough to vet QA's generalization.

Focus on People
Many entities appear in examples (Table 1) but people form a majority in our QA tasks (except SQuAD). Existing work in AI fairness focuses on disparate impacts on people, and model behaviors are prone to harm especially when it comes to people; hence, our primary intent is to understand how demographic characteristics of "people" correlate with model correctness.
The people asked about in a question can be in 1 For NQ, we only consider questions with short answers.
the answer-"who founded Sikhism?" (A: Guru Nanak ), in the question-"what did Clara Barton found?" (A: American Red Cross), or the title of the source document-"what play featuring General Uzi premiered in Lagos in 2001?" (A: King Baabu) is in the page on Wole Soyinka. We search until we find an entity: first in the answer, then the question if no entity is found in the answer, and finally the document title.
Demographics are a natural way to categorize these entities and we consider the high-coverage demographic characteristics from Wikidata. 4 Given an entity, Wikidata has good coverage for all datasets: gender (> 99% ), nationality (> 93%), and profession (> 94%). For each characteristic, we use the knowledge base to extract the specific value for a person (e.g., the value "poet" for the characteristic "profession"). However, the values defined by Wikidata have inconsistent granularity, so we collapse near-equivalent values (E.g., "writer", "author", "poet", etc. See Appendix A.1-A.2 for an exhaustive list). For questions with multiple values (where multiple entities appear in the answer, or a single entity has multiple values), we create a new value concatenating them together. An 'others' value subsumes values with fewer than fifteen examples; people without a value become 'not found' for that characteristic.
Three authors manually verify entity assignments by vetting fifty random questions from each dataset. Questions with at least one entity had near-perfect 96% inter-annotator agreement for CLOUD-NL's annotations, while for questions where CLOUD-NL didn't find any entity, agreement is 98%. Some errors were benign: incorrect entities sometimes retain correct demographic values; e.g., Elizabeth II instead of Elizabeth I. Other times, coarse-grained nationality ignores nuance, such as the distinction between Greece and Ancient Greece.

Who is in Questions?
Our demographic analysis reveals skews in all datasets, reflecting differences in task focus (Table 2). NQ is sourced from search queries and skews toward popular culture. QB nominally reflects an undergraduate curriculum and captures more "academic" knowledge. TriviaQA is popular trivia, and SQuAD reflects Wikipedia articles.
Across all datasets, men are asked about more than women, and the US is the subject of the ma-  3 What Questions can QA Answer?
QA datasets have different representations of demographic characteristics; is this focus benign or do these differences carry through to model accuracy?
We analyze a SOTA system for each of our four tasks. For NQ and SQuAD, we use a fine-tuned BERT (Alberti et al., 2019) with curated training data (e.g., downsample questions without answers and split documents into multiple training instances). For the open-domain TriviaQA task, we use ORQA (Lee et al., 2019) that uses BERT-based reader and retriever components. Finally, for QB, we use the competition winner from Wallace et al.
(2019), a BERT-based reranker of a TF-IDF retriever. Accuracy (exact-match) and average F1 are both common QA metrics (Rajpurkar et al., 2016). Since both are related and some statistical tests require binary scores, we focus on exact-match.
Rather than focus on aggregate accuracy, we focus on demographic subsets' accuracy ( Figure 1). For instance, while 66.2% of questions about people are correct in QB, the number is lower for the Dutch (Netherlands) (55.6%) and higher for Ireland (87.5%). Unsurprisingly, accuracy is consistently low on the 'not_found' subset, where Wikidata lacks a person's demographic value.
Are the differences we observe across strata significant? We probe this in two ways: using χ 2 testing (Plackett, 1983) to see if trends exist and using logistic regression to explore those that do. Figure 1: Accuracies split by demographic subsets in our QA datasets' dev fold for all three characteristics compared to average accuracy (vertical line). For each dataset, we only consider examples that has a mention of a person-entity in either the answer, question or document title. Individual plots correspond to a χ 2 test on whether demographic values and accuracy are independent (Section 3.1) with the significant characteristics highlighted in red (p-value < 0.0167).

Do Demographic Values Affect
Accuracy?
The χ 2 test is a non-parametric test of whether two variables are independent. To see if accuracy and characteristics are independent, we apply a χ 2 test to a n × 2 contingency table with n rows representing the frequency of that characteristic's subsets contingent on whether the model prediction is correct or not (Table 3). If we reject the null with a Bonferroni correction (Holm, 1979, divide the p-value threshold by three, as we have multiple tests for each dataset), that suggests possible relationships: gender in NQ (p =2.36 × 10 −12 ), and professional field in NQ (p = 0.0142), QB (p =2.34 × 10 −7 ) and TriviaQA (p = 0.0092). However, we find no significant relationship between nationality and accuracy in any dataset. While χ 2 identifies which characteristics impact model accuracy, it does not characterize how. For instance, χ 2 indicates NQ's gender is significant, but is this because accuracy is higher for women, or because the presence of both genders in examples lowers the accuracy?

Exploration with Logistic Regression
Thus, we formulate a simple logistic regression: can an example's demographic values predict if a model answers correctly? Logistic regression and related models are the workhorse for discovering and explaining the relationship between variables in history (McCloskey and McCloskey, 1987), ed-   , 1999). Logistic regression is also a common tool in NLP: to find linguistic constructs that allow determiner omission (Kiss et al., 2010) or to understand how a scientific paper's attributes effect citations (Yogatama et al., 2011). Unlike model calibration (Niculescu-Mizil and Caruana, 2005), whose goal it to maximize prediction accuracy, the goal here is explanation. We define binary features for demographic values of characteristics the χ 2 test found significant (thus, SQuAD, the nationality characteristic, and gender characteristic for all but NQ are excluded). For instance, a question about Abidali Neemuchwala would have features for g_male, o_executive but zero for everything else. 5 Real-valued features, multi_entities and multi_answers, capture the effect of multiple person-entities and multiple gold-answers (scaled with the base two logarithm).
But that is not the only reason an answer may be difficult or easy. Following Sugawara et al.
(2018), we incorporate features that reveal the questions' difficulty. For instance, questions that clearly hint the answer type reduce ambiguity. The t_who checks if the token "who" is in the start of the question. Similarly, t_what, t_when, and t_where capture other entity-types. Questions are also easier if evidence only differs from the question by a couple of words; thus, q_sim is the Jaccard similarity between question and evidence tokens. Finally, the binary feature e_train_count marks if the personentities occur in training data more than twice.
We first drop features with negligible effect on accuracy using LASSO (regularization λ = 1) by removing zero coefficients. For remaining features, Wald statistics (Fahrmeir et al., 2007) estimate pvalues. Although we initially use quadratic features they are all eliminated during feature reduction. Thus, we only report the linear features with a minimal significance (p-value < 0.1).

How do Properties Affect Accuracy?
Recall that logistic regression uses features to predict whether the QA system will get the answer right or not. Features associated with correct answers have positive weights (like those derived from Sugawara et al. (2018), q_sim and e_train_count), those associated with incorrect answers have negative weights, and features without effect will be near zero. Among the t_wh * features, t_who significantly correlates with model correctness, especially in NQ and QB, where questions asked directly about a person.
However, our goal is to see if, after accounting for obvious reasons a question could be easy, demographic properties can explain QA accuracy. The strongest effect is for professions (Table 4). For instance, while NQ and QB systems struggle on science questions, TriviaQA's does not. Science has roughly equivalent representation (Table 2), suggesting QB questions are harder.
While multi_answer (and multi_entities) reveal harder NQ questions, it has a positive effect in TriviaQA, as TriviaQA uses multiple answers for alternate formulations of answers (Appendix B.2.1, B.2.2), which aids machine reading, while multiple NQ answers are often a sign of ambiguity (Boyd-Graber and Börschinger, 2020;Si et al., 2021): "Who says that which we call a rose?" A: Juliet, A: William Shakespeare. For male and female genders, NQ has no statistically significant effect on accuracy, only questions about entities with multiple genders depresses accuracy. Given the many findings of gender bias in NLU (Zhao et al., 2017;Webster et al., 2018;Zhao et al., 2018;Stanovsky et al., 2019), this is surprising. However, we caution against accepting this conclusion without further investigation given the strong correlation of gender with professional field (Goulden et al., 2011), where we do see significant effects.
Taken together, the χ 2 and logistic regression analysis give us reason to be optimistic: although data are skewed for all subsets, QA systems might well generalize from limited training data across gender and nationality.

Related Work
Language is a reflection of culture. Like other cultural artifacts-encyclopedias (Reagle and Rhue, 2011), and films (Sap et al., 2017)-QA has more men than women. Other artifacts like children's books have more gender balance but reflect other aspects of culture (Larrick, 1965).
The NLP literature is also grappling with demographic discrepancies. Standard coreference systems falter on gender-balanced corpora ( (2020) show that there are shortcomings in QA datasets and evaluations by analysing their out-of-domain generalization capabilities and ability to handle question variation. Joint models of vision and language suggest that biases come from language, rather than from vision (Ross et al., 2021). However, despite a range of mitigation techniques (Zhao et al., 2017, inter alia) none, to our knowledge, have been successfully applied to QA, especially from the demographic viewpoint.

Discussion and Conclusion
This paper delivers both good news and bad news. While datasets remain imperfect and reflect societal imperfections, for many demographic properties, we do not find strong evidence that QA suffers from this skew.
However, this is an absence of evidence rather  Table 4: Influential features after filtering characteristics based on a χ 2 test ( Figure 1). Highly influential features (p-value < 0.1), both positive (blue) and negative (red). Higher number of ⋆'s signals higher significance.
than evidence of absence: these are skewed datasets that have fewer than a quarter of the questions about women. It is difficult to make confident assessments on such small datasets-many demographic values were excluded because they appeared infrequently (or not at all). Improving the diversity of QA datasets can help us be more certain that QA systems do generalize and reflect the diverse human experience. Considering such shortcomings, Rodriguez et al. (2021) advocate improving evaluation by focusing on more important examples for ranking models; demographic properties could further refine more holistic evaluations.
A broader analysis beyond person entities would indeed be a natural extension of this work. Label propagation can expand the analysis beyond people: the Hershey-Chase experiment is associated with Alfred Hershey and Martha Chase, so it would-given the neighboring entities in the Wikipedia link graph-be 100% American, 50% male, and 50% female. Another direction for future work is accuracy under counterfactual perturbation: swapping real-world entities (in contrast with nonce entities in Li et al. (2020)) with different demographic values.
Nonetheless, particularly for professional fields, imbalances remain. The lack of representation in QA could cause us to think that things are better than they are because of Simpson's paradox (Blyth, 1972): gender and profession are not independent! For example, in NQ, our accuracy on women is higher in part because of its tilt toward entertainment, and we cannot say much about women scientists. We therefore caution against interpreting strong model performance on existing QA datasets as evidence that the task is 'solved'. Instead, future work must consider better dataset construction strategies and robustness of accuracy metrics to different subsets of available data, as well as unseen examples.

Ethical Considerations
This work analyses demographic subsets across QA datasets based on Gender, Nationality and Profession. We believe the work makes a positive contribution to representation and diversity by pointing out the skewed distribution of existing QA datasets. To avoid noise being interpreted as signal given the lack of diversity in these datasets, we could not include various subgroups that we believe should have been part of this study: non-binary, intersectional groups (e.g., women scientists in NQ), people indigenous to subnational regions, etc. We believe increasing representation of all such groups in QA datasets would improve upon the status quo. We infer properties of mentions using Google Cloud-NL to link the entity in a QA example to an entry in the WIKIDATA knowledge base to attribute gender, profession and nationality. We acknowledge that this is not foolproof and itself vulnerable to bias, although our small-scale accuracy evaluation did not reveal any concerning patterns.
All human annotations are provided by authors to verify entity-linkings and were fairly compensated. This section enlists a full set of features used for the logistic regression analysis after feature reduction, each with their coefficients, standard error, Wald Statistic and significance level in Table 5. We also describe the templates and the implementation details of the features using in our logistic regression analysis (Section 3.2) in Appendix B.1, and finally enlist some randomly sampled examples both from NQ and TriviaQA datasets in Appendix B.2 to show how multi_answers feature has disparate effects on them.

B.1 Implementation of Logistic Regression features
• q_sim: For closed-domain QA tasks like NQ and SQuAD, this feature measures (sim)ilarity between (q)uestion text and evidence sentence-the sentence from the evidence passage which contains the answer text-using Jaccard similarity over unigram tokens (Sugawara et al., 2018). Since we do not include SQuAD in our logistic regression analysis (Section 3.2, this feature is only relevant for NQ. • e_train_count: This binary feature represents if distinct (e)ntities appearing in a QA example (through the approach described in Section 2) appears more than twice in the particular dataset's training fold. We avoid logarithm here as even the log frequency for some commonly occurring entities exceeds the expected feature value range.
• t_wh * : This represents the features that captures the expected entity type of the answer: t_who, t_what, t_where, t_when. Each binary feature captures if the particular "wh * " word appears in the first ten (t)okens of the question text. 6 • multi_entities: For number of linked person-entities in a example as described in Section 2 as n, this feature is log 2 (n). Hence, this feature is 0 for example with just single person entity.
• multi_answers: For number of gold-answers annotated in a example as n, this feature is 6 QB questions often start with "For 10 points, name this writer who..." log 2 (n). Hence, this feature is 0 for example with just answer.  Table 5: Influential features revealed through Logistic Regression Analysis (Sec 3.2) over the demographic characteristics deemed significant through the χ 2 test (Figure 1). We report the highly influential features with significance of p-value < 0.1, both positive (blue) and negative (red), and bold the highly significant ones (p-value < 0.05). Number of ⋆ in the last column represents the significance level of that feature.