How Conservative are Language Models? Adapting to the Introduction of Gender-Neutral Pronouns

Gender-neutral pronouns have recently been introduced in many languages to a) include non-binary people and b) as a generic singular. Recent results from psycholinguistics suggest that gender-neutral pronouns (in Swedish) are not associated with human processing difficulties. This, we show, is in sharp contrast with automated processing. We show that gender-neutral pronouns in Danish, English, and Swedish are associated with higher perplexity, more dispersed attention patterns, and worse downstream performance. We argue that such conservativity in language models may limit widespread adoption of gender-neutral pronouns and must therefore be resolved.


Introduction
Many linguistic scholars have observed how technology in general has altered the course of language evolution (Kristiansen and Coupland, 2011;Abbasi, 2020), e.g., through the influence of social media conventions. Language technologies, in particular, have also been argued to have such effects, e.g., by reducing the pressure to acquire multiple languages.
Gender-neutral pronouns is not an entirely modern concept. In 1912, Ella Flag Young, then superintendent of the Chicago public-school system, said the following to a room full of school principals: "The English language is in need of a personal pronoun of the third person, singular number, that will indicate both sexes and will thus eliminate our present awkwardness of speech." The use of gender-neutral pronouns has become much more popular in recent years (Gustafsson Sendén et al., 2021). In 2013, a gender-neutral pronoun was politically introduced in Swedish (Gustafsson Sendén et al., 2015) which can be used for both, people identifying outside the gender dichotomy and as a generic pronoun where information about gender is either unavailable or irrelevant.
In a recently recorded eye-tracking study, Vergoossen et al. (2020a) found no evidence that native speakers of Swedish find it harder to process gender-neutral pronouns than gendered pronouns, an argument often brought up by opponents of gender-inclusive language (Speyer and Schleef, 2019;Vergoossen et al., 2020b). In combination with their increasing popularity, this suggests gender-neutral pronouns have been or will be widely and fully adapted over time (Gustafsson Sendén et al., 2015. However, since language technology has the potential to alter the course of language evolution, we want to make sure that our NLP models do not become a bottleneck for this positive development. Contribution We extract stimuli from a Swedish eye-tracking study that has shown no increase in processing cost in humans for the gender-neutral pronoun hen compared to gendered pronouns. We translate those stimuli into English and Danish and compare model perplexity across gendered and gender-neutral pronouns for all three languages. Furthermore, we systematically investigate performance differences across pronouns in downstream tasks, namely natural language understanding (NLI) and coreference resolution. Across the board, we find that NLP models, unlike humans, are challenged by gender-neutral pronouns, incurring significantly higher losses when gendered pronouns are replaced with their gender-neutral alternatives. We argue this is a problem the NLP community must take seriously.

Model perplexity and attention
In this section we introduce a Swedish eye-tracking study and explain how we adapt this study to investigate gender-neutral pronouns in language models. noun hen has a higher processing cost during pronoun resolution than gendered pronouns. Participants were reading sentence pairs where the first sentence contained a noun referring to a person and the second sentence contained a pronoun referring to that person either with a gendered pronoun or hen. It has recently been shown that attention flow, in contrast to attention itself, correlates with human fixation patterns in task-specific reading (Anonymous, 2022). We applied a similar analysis pipeline here and extracted all 384 sentence pairs and fed them into the uncased Swedish BERT model. 1 We calculate perplexity values for each sentence pair over word probabilities as given by BERT with the formula proposed by Wang et al. (2019). Furthermore, we calculate attention flow propagated from layers 1, 6 and 12 (Abnar and Zuidema, 2020) and extract attention flow values assigned to the pronoun with respect to the entity. Attention flow considers the attention matrices as a graph, where tokens are represented as nodes and attention scores as edges between consecutive layers. The edge values, i.e., attention scores, define the maximal flow possible between a pair of nodes. We consider different parameters of human fixation which we assume might be influenced by a change in pronouns, in particular during pronoun resolution, i.e., first and total fixation time on the pronoun and fixation time after the first fixation on the noun. For both attention flow and perplexity, however we could not find any meaningful correlation to those parameters. One reason for that might be that the dataset only contains fixations for the two entities, i.e., pronoun and noun, which makes data comparably sparse and impossible to extract complete reading patterns.

Language models and gender-neutral pronouns
We therefore focus on the model-based data alone in order to understand how well language models can deal with gender-neutral pronouns. For this, we consider perplexity values on sentence-level and calculate rank-based Spearman correlation between perplexity and attention flow for the aforementioned layers. With this analysis, we can see if a) gender-neutral pronouns cause a higher sentence perplexity, i.e., a higher surprisal and if b) a possible higher surprisal is connected to higher attention flow values on the pronoun with respect to the entity. We furthermore translate the sentence pairs into English and Danish where we use two sets of gender-neutral pronouns: 3rd person plural (hence: they/de) which are used in both languages as gender-neutral pronouns (Miltersen, 2020) and neopronouns (xe for English (Hekanaho, 2020) and høn for Danish). 2 We apply the same experiments to those translated datasets with uncased Danish BERT 3 and uncased English BERT 4 .

Results
We show results on perplexity and correlations in Table 1 for Danish, English and Swedish. Perplexity values for the datasets with gendered pronouns are set to 1 and we show relative increase for gender-neutral pronouns within a language since perplexity values have been shown to not be comparable across languages (Mielke et al., 2019;Roh et al., 2020). There we can see that perplexity scores for sentences with gender-neutral pronouns are significantly higher (Wilcoxon signed-rank test and received p-values < .01 for all pair-wise comparisons). For the correlation between perplexity and attention flow on the Swedish sentence pairs, we can see a clear development between the first layer where there is no correlation (p > .05) for gender-neutral hen and very low correlation for gendered pronouns which changes for the other layers where correlations for hen are even higher Figure 1: Pair-wise cosine similarity between word representations of all pronouns and the Swedish word bok (book) as a baseline for different layers of BERT. We see that gender-neutral hen grows from being an outsider (similar to bok) in the 1st layer into the cluster of gendered 3rd person pronouns hon/han across layers.
(ρ = 0.72) than for gendered pronouns (ρ = 0.65). This suggests that there is some development across layers that is stronger for hen than for gendered pronouns. Furthermore, we see a similar evolvement for correlations across layers in English but a much weaker correlation for Danish. To investigate those effects across layers further, we look at word embeddings for all Swedish pronouns from all 12 layers in BERT and compute pair-wise cosine similarity including the Swedish word for book (bok) as a baseline where we expect no specific relation to pronouns. In Figure 1, we see less similarity between hen and the other pronouns in the first layer. This changes for layer 6 and 12 where word representations seem to be more similar and the three 3rd person pronouns hen, han, hon get closer to each other. This is in line with the literature where it has been found that single attention heads perform better on pronoun resolution than others.
In particular middle and deeper layers have shown stronger attention weights between coreferential elements (Vaswani et al., 2017;Webster et al., 2018;Clark et al., 2019). Given that we do not consider individual heads or layers but the entire attention graph it is not surprising that we also see those effects in the top layer as has also been shown in the original paper (Abnar and Zuidema, 2020).

Downstream Tasks
We also perform downstream task experiments on natural language inference and coreference resolution for both gendered and gender-neutral pronouns to investigate to what extent gender-neutral pronouns influence the performance.
Natural Language Inference Natural Language Inference (NLI) is commonly framed as a classification task, which tests a model's ability to un-derstand entailment and contradiction (Bowman et al., 2015). Despite high accuracies achieved by SOTA models, we are yet to know whether they succeed in combating gender bias, especially in crosslingual settings. We apply two multilingual models mBERT 5 (Devlin et al., 2019) and XLM-R 6 (Conneau et al., 2020) with cross-lingual fine-tuning, i.e., we fine-tune on English and apply both models also on Danish and Swedish. Therefore, mBERT was fine-tuned on the English MNLI train split and evaluated on XNLI. For XLM-R, we apply a model that has been fine-tuned on both MNLI and ANLI (Nie et al., 2020) 7 . For English we test both models on the MNLI test split, for Danish and Swedish we test on the extended XNLI corpus (Singh et al., 2019), the manual translation of the first 15000 sentences of the MNLI corpus (Williams et al., 2018) from English into 15 languages.

Coreference Resolution
We also run pronoun resolution experiments on the Winogender dataset (Rudinger et al., 2018) where all 720 English sentences include an occupation, a participant and a pronoun. For each occupation, two similar sentences are composed, one where the pronoun refers to the occupation and one where it refers to the participant. Those sentences are then presented in versions with different pronouns (female, male, singular they). For our experiments, we compare performance for those pronouns and add a version for the gender-neutral pronoun xe. We run experiments with NeuralCoref 4.0 in SpaCy. 8 . For Danish, we apply the recently published coreference model (Barrett et al., 2021) to both the corresponding test set from the Dacoref dataset and a genderneutralized version where we exchange gendered pronouns hun/han for either høn or singular de. 9

Results
Natural Language Inference Accuracies for all languages and both models are displayed in Table  2. We overall see a very small drop in performance for the datasets with gender neutral pronouns compared to the original sentences. For mBERT we see differences of 0  Table 2: Accuracy [in %] on NLI for English, Danish and Swedish for both models mBERT and XLM-R. Accuracies are calculated on the subset of sentences that contain relevant pronouns (924 for en and 2339 for da/sv). The first column for each language shows the accuracy on the original data, second and third columns show accuracies for respective gender-neutral pronouns. Please note, the total number of label flips in both directions for different pronouns is higher than the performance difference for all pair-wise comparisons. A baseline analysis where we exchanged punctuation ("." for "!") yields similar deviations from the original dataset than the changing pronouns.
is slightly higher with 0.21 − 4.71%. We see the biggest difference for the Danish pronoun høn in comparison to the original dataset.
she he they xe acc in % 42.92 43.75 27.92 0  Table 4: Results for the Danish coreference resolution task. Pronouns in the original dataset (orig.) have been exchanged for singular de and gender-neutral høn. Table 3 shows accuracies on the English Winogender corpus for all four pronouns. We see a clear drop in performance from gendered pronouns (she, he) to both gender-neutral pronouns (they, xe). For xe, the model was not able to perform coreference resolution at all. In most cases it was not even recognized as part of a cluster and in the rare cases where it was, it was clustered with the wrong tokens. Please note that since this dataset is not labelled we are only classifying if the pronoun has been clustered with the correct entity. Results on the Danish Coref corpus, where we are able to perform a more extensive coreference resolution task are displayed in Table 4. We were able to replicate results from (Barrett et al., 2021) (the first column orig.). And see small drops in performance for singular de and høn.

Related Work
More eye-tracking studies have been conducted investigating the influence in processing cost for both gender-neutral pronouns and the generic male pronoun. Irmen (2007) and Redl et al. (2021) find male biases when using generic male pronouns in Dutch and generic role nouns in German. The authors of Sanford and Filik (2007) found a clear processing cost when using singular they in English, however their stimuli did not include any investigation of how (anti-)stereotypes influence this processing cost and is thus only in parts comparable to other studies. English datasets have been proposed to investigate gender bias in pronoun resolution but have not reported on performance differences between gendered and gender-neutral pronouns (Rudinger et al., 2018;Zhao et al., 2018;Webster et al., 2018). Sun et al. (2021) propose a rewriting task where data is transferred from gendered to gender-neutral pronouns to train more inclusive language models. The extensive survey on gender bias in NLP recently published by (Stanczak and Augenstein, 2021) also discusses research beyond binary gender formulation and the lacks of it.

Discussion
With this paper we provide a first study on how well language can handle gender-neutral pronouns in Danish, English and Swedish for various tasks. We observe an increase in perplexity for genderneutral pronouns and correlations between perplexity on sentence level and attention flow on the pronoun, in particular for English and Swedish that gets stronger across layers. This indicates that language models indeed struggle with the use of gender-neutral pronouns, even with singular they, which has been used for many years as genderneutral (Saguy and Williams, 2022). The reason for this most likely lies in the sparse representation of gender-neutral pronouns in the training data and the fact that language models, once they are trained and published usually are not updated (Ben-der et al., 2021). At the same time, we observe that word representations of all Swedish 3rd person pronouns grow closer in middle and top layers (see Figure 1) which suggests that relevant information is also learned for gender-neutral hen.
For NLI, we only see a small drop in performance when exchanging gendered pronouns for genderneutral pronouns which is in the same range as a baseline analysis where we exchange punctuation ("!" for "."), except for Danish høn. We argue that classification in NLI probably does not heavily rely on individual pronouns in most cases. In stark contrast to pronoun resolution where we see a very clear drop in performance for English when applying singular they in comparison to both female and male pronouns, again this is surprising since in theory language models should have seen training samples where singular they has been used. The small drop in performance for Danish coreference resolution might be because this dataset does not solely focus on pronoun resolution, further investigation is needed here. We strongly argue that more needs to be done to adapt language models to a more gender inclusive language, initiatives like the rewriting task as proposed by Sun et al. (2021) need to be implemented and extended.