MISGENDERED: Limits of Large Language Models in Understanding Pronouns

Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering.Gender bias in language technologies has been widely studied, but research has mostly been restricted to a binary paradigm of gender. It is essential also to consider non-binary gender identities, as excluding them can cause further harm to an already marginalized group. In this paper, we comprehensively evaluate popular language models for their ability to correctly use English gender-neutral pronouns (e.g., singular they, them) and neo-pronouns (e.g., ze, xe, thon) that are used by individuals whose gender identity is not represented by binary pronouns. We introduce Misgendered, a framework for evaluating large language models’ ability to correctly use preferred pronouns, consisting of (i) instances declaring an individual’s pronoun, followed by a sentence with a missing pronoun, and (ii) an experimental setup for evaluating masked and auto-regressive language models using a unified method. When prompted out-of-the-box, language models perform poorly at correctly predicting neo-pronouns (averaging 7.6% accuracy) and gender-neutral pronouns (averaging 31.0% accuracy). This inability to generalize results from a lack of representation of non-binary pronouns in training data and memorized associations. Few-shot adaptation with explicit examples in the prompt improves the performance but plateaus at only 45.4% for neo-pronouns. We release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendered/.


Introduction
From document retrieval to virtual assistants, large language models (LLMs) (Zhang et al., 2022;Scao Figure 1: Evaluation examples. Each instance begins with a declaration of an individual's preferred pronouns, followed by text where a [PRONOUN] is missing. Language models are evaluated for their ability to predict the pronoun accurately. The correct answer along with predictions from GPT-J are shown. et al., 2022;Lewis et al., 2020) have become indispensable for various automated language processing tasks. Given their proliferation, it is vital that these LLMs are safe to use. Any biases in the model may perpetuate and amplify existing realworld harms toward already marginalized people.
Efforts to address gender bias in natural language processing primarily focus on binary gender categories, female and male. They are aimed at either upstream bias, e.g., gendered associations in language models Kirk et al., 2021;Dev et al., 2021a;Bolukbasi et al., 2016), or downstream bias, e.g., gendered information used for decision-making in tasks such as coreference res-olution (Zhao et al., 2018), machine translation (Choubey et al., 2021;Stanovsky et al., 2019) etc. However, this is restrictive as it does not account for non-binary gender identities as they become more commonplace to openly discuss. This can perpetuate harm against non-binary individuals through exclusion and marginalization (Dev et al., 2021b). This paper comprehensively evaluates popular language models' ability to use declared thirdperson personal pronouns using a framework, MIS-GENDERED. It consists of two parts: (i) instances declaring an individual's pronoun, followed by a sentence with a missing pronoun ( § 3.1), and (ii) an experimental setup for evaluating masked and auto-regressive language models using a unified method ( § 3.2). We create a template-based evaluation dataset, for gendering individuals correctly given a set of their preferred pronouns. Each evaluation instance begins with an individual's name and an explicit declaration of their pronouns, followed by a sentence in which the model has to predict a missing [PRONOUN]. For instance (Fig. 1), 'Aamari's pronouns are xe/xem/xyr/xyrs/xemself. Aamari is undergoing a surgery. Please pray for [PRONOUN] quick recovery. ' We evaluate language models on their ability to fill in [PRONOUN] correctly, here with the possessive-dependent pronoun, xyr. Sentences in our evaluation cover 5 different pronoun forms: nominative, accusative, possessivedependent, possessive-independent, and reflexive (e.g., they, them, their, theirs, and themself, respectively) for 11 sets of pronouns from 3 pronoun types: binary (e.g., he, she) 1 , gender-neutral (e.g., they, them), and neo-pronouns (e.g., xe, thon) 2 . We create 10 variations for each pronoun form and populate them with popular unisex, female, and male names, resulting in a total of 3.8 million instances.
Our evaluation shows that current language models are far from being able to handle gender-neutral and neo-pronouns. For direct prompting, we use models of varying sizes from six families comprising both auto-regressive and masked language models ( § 4.1). While most models are able to correctly use binary pronouns (average accuracy of 75.3%), all models struggle with neo-pronouns (average accuracy of 7.6%), and most with gender-neutral pronouns as well (average accuracy of 31.0%). This poor zero-shot performance could be due to the scarcity of representation of neo-pronouns and gender-neutral pronouns in pre-training corpora ( § 4.2). For example, there are 220× more occurrences of masculine pronoun tokens in C4 (Raffel et al., 2020), the pre-training corpus for T5 (Raffel et al., 2020) models, than for the xe neo-pronouns. Additionally, we also notice some memorized associations between pronouns and the gender of names. Language models identify the non-binary pronouns most accurately for unisex names, whereas the bottom-performing names are either masculine or feminine. Similarly, for binary pronouns, language models correctly predict masculine pronouns for masculine names with almost 3× more accuracy than feminine names.
Although language models do not perform well on predicting neo-pronouns in a zero-shot setting, models with few-shot learning abilities are able to adapt slightly with a few examples (in-context learning achieves an accuracy of up to 45.4% for neo-pronouns). However, performance plateaus with more shots, and it is not clear how this method of prompting with examples can be used to mitigate bias in downstream applications. Future work should focus on further evaluation of language technologies on their understanding of nonbinary pronouns and mitigating biases. While we have made progress towards recognizing pronouns as an open class in NLP rather than a closed one, there is still much work to be done in this regard. Overarching limitations of our work are its adherence to a Western conceptualization of gender, as well as being confined to English. To facilitate further research, we release 3 the full dataset, code base, and demo of our work at https://tamannahossainkay.github. io/misgendered/.

Background
In this section, we present the social context in which our work is situated. The contemporary Western discourse regarding gender differentiates between biological sex and gender identity. An individual's biological sex is assigned at birth and is associated with physical characteristics, such as chromosomes, reproductive organs, etc. (WHO, 2021;Prince, 2005  We create a dataset to evaluate the ability of large language models to correctly 'gender' individuals. We manually write templates, each referring to an individual and containing a blank space for a pronoun to be filled-in. We populate the templates with names (unisex, female, and male) and pronouns (binary, gender-neutral, and non-binary), and declare two to five pronoun forms are for each individual either explicitly or parenthetically. We then use masked and auto-regressive LMs to predict missing pronouns in each instance utilizing a unified constrained decoding method.
binary (female or male) or non-binary, eg. intersex with X, XXY genotypes (NIH, 2021) etc. On the other hand, gender identity is an individual's subjective experience of their own gender, which encompasses a diverse range of experiences and expressions (WHO, 2021;Prince, 2005), eg. cisgender, transgender, non-binary etc. Historically, there are several cultures where gender is understood as a spectrum, for example, the Bugis people of Indonesia recognize five genders (Davies, 2007). While there are nations that legally acknowledge gender exclusively as a binary (female or male) (EqualDex, 2022), an increasing number of jurisdictions recognize gender as a broader concept, including the USA (U.S. Dept of State, 2022; EqualDex, 2022).
Exclusively binary female-male third-person personal pronouns are insufficient in such a diverse and dynamic landscape of gender. Rather, expanding pronouns to include neo pronouns, such as, singular they, thon, ze, etc. is essential (Vance Jr et al., 2014;Markman, 2011). Spaces inclusive of LGBTQIA+ persons encourage everyone to declare what pronouns to use to refer to them (NIH, 2022(NIH, , 2020. Pronoun declarations often include at least two pronoun forms, such as nominative and accusative (e.g., they/them, she/her), but can consist of all five pronoun forms (e.g., they/them/their/theirs/themself ). Misgendering, i.e., addressing individuals using gendered terms that are not aligned with their gender identity are associated with a variety of harms (Dev et al., 2021b).
Note that while an expanding view of gender identity creates a corresponding need for a wider range of pronouns, we cannot infer an individual's gender-identity from their preferred pronouns. For instance, the use of binary pronouns, such as she or he, does not necessarily indicate a binary gender identity, and similarly, the use of neo-pronouns, such as xe, does not imply an identity outside of the female-male binary. In this paper, we aim to establish a paradigm of evaluation of gender bias in NLP which takes into account the growing use of non-binary pronouns. We evaluate language models for one type of misgendering, which is using incorrect pronouns for individuals.

MISGENDERED Framework
The MISGENDERED framework for evaluating the pronoun usage abilities of language models consists of (i) instances specifying an individual's pronoun, succeeded by a sentence missing a pronoun, and (ii) a unified method for evaluating masked and auto-regressive language models.

Dataset Construction
We evaluate existing language models to assess their ability to understand and correctly use thirdperson personal pronouns ( Figure 2). To do this, we create a dataset designed specifically for evaluating the correct gendering of individuals given a set of their pronouns. To gender a person correctly is to use the pronouns they prefer to refer to them. Each instance in the evaluation dataset consists of a first name and preferred pronouns at the start, followed by a manually crafted template that has a blank space for a missing [PRONOUN]. It is important to note that we only use preferred pronouns from a single pronoun group (eg. they/them, xe/xem/xym and do not considered cases where an individual uses multiple sets of pronouns (eg. they/she). All templates are shown in Appendix A. Popular US first names and pronouns are used to populate each template. We do not use any private or individually identifiable information.
We use unisex, female, and male names per US Social Security data over the past 100 years. This limits our analysis to English and American names assigned at birth. We take a sample of 300 names from the unisex names compiled by Flowers (2015). These are names that are least statistically associated with being female or male in the USA. For female and male names, on the other hand, we take the top 100 names that are the most statistically associated with being female or male respectively (Social Security, 2022). We manually construct ten templates for each pronoun form with CheckList (Ribeiro et al., 2020) in the loop. Evaluation instances are then completed by using sets of binary (masculine and feminine), gender-neutral (singular they), and neo-pronouns. For neo-pronouns, we use a list compiled by Lauscher et al. (2022). We do not use nounself, emojiself, numberself, or nameself   (Lauscher et al., 2022) we use in this paper for evaluating the ability of language models to correctly gender individuals. Each row of this table consists of a pronoun group, with each column specifying the pronoun for each of the form for that group.
pronouns from their compilation as they are currently rare in usage. If there are variations in forms of the same neo-pronoun group then we only use one of them, (e.g., for ve/vi, ver/vir, vis, vers/virs, verself/virself, we only use vi, vir, vis, virs, and virself ). Neither Lauscher et al. (2022) nor our list of non-binary pronouns (shown in Table 1) are exhaustive as they are continually evolving. Each row of this table constitutes one possible choice of preferred pronouns and will be referred to as a pronoun group from here onwards, and each pronoun group will be referred to by its nominative form for short, eg. the non-binary pronoun group {xe, xem, xyr, xyrs, xemself} will be referred by xe for short.

Evaluation Setup
Using the evaluation dataset we created we test popular language models by direct prompting and in-context learning.

Constrained Decoding
For both masked and auto-regressive language models, we do a constrained decoding to predict the most likely pronoun out of all pronouns of the same form. We use a uniform framework for making predictions from both masked and auto-regressive langauge models. Let F be the set of pronoun forms (|F | = 5, columns in Table 1), and P be the set of pronoun groups (|P | = 11; rows in Table 1). Let x be an evaluation instance with gold pronoun p * f such that p * ∈ P and f ∈ F . Each instance has |P | inputs, Both inputs and labels are constructed following the pre-training design of each model. Could you lend it to him?', . . ., 'Aamari needs your history book. Could you lend it to zir?'} For both masked and auto-regressive language models, the predicted output of each model is then computed in using its loss function, L: A detailed example evaluation with model inputs, labels, and output is illustrated in Appendix B.

Experiments
Direct Prompting We directly prompt language models out of the box to test their ability to correctly predict declared pronouns. We use instances from the evaluation dataset ( § 3.1) and use a unified constrained decoding mechanism to get predictions from both masked and auto-regressive language models ( § 3.2.1). We use models of varying sizes from the BART (Lewis et al., 2020), T5 (Raffel et al., 2020), GPT-2 (Radford et al., 2019), GPT-J (Wang and Komatsuzaki, 2021), OPT (Zhang et al., 2022), and BLOOM (Scao et al., 2022). 4 The specific models along with their parameter counts are shown in Table 3. All computations are performed on a standard academic laboratory cluster. 4 We use the implementation from the HuggingFace library.

Dec. #
Pronouns Declared 2 Nom., Acc. 3 Nom., Acc., Pos. Ind. 3 Nom., Acc., Pos. Dep. 4 Nom., Acc., Pos. Ind., Ref. 4 Nom., Acc., Pos. Dep., Ref. 5 Nom., Acc., Pos. Dep., Pos. Ind., Ref.  We study the different ways of declaring preferred pronouns. We use two different declaration types and seven combinations of declared forms, • Declaration Type: We declare preferred pronouns for individuals using two formats, explicit and parenthetical. In the first case, pronouns are explicitly declared as '[Name]'s pronouns are' followed by their preferred pronouns. In the second case, pronouns are declared in parenthesis after the first time a person's name is used in a sentence. An example of each declaration type is shown in Figure 2. • Declaration Number: We vary the number of pronouns declared between two to five. The pronoun forms that are declared for each number of declaration is shown in Table 2.
Explaining Zero-Shot Observations In order to better understand the zero-shot performance re-sults we check two things. We take a look at the prevalence of pronoun tokens in the pre-training corpora of a few language models. Using the Elastic Search indices of C4 (pre-training corpus for T5) (Raffel et al., 2020), and Pile (pre-training corpus for GPT-J) (Gao et al., 2020), we count the number of documents in each corpus that contain tokens for each pronoun in Table 1. We also check to see for each pronoun type if there is a difference in performance based on the gender association of the name. Differences in performance would indicate memorization of name and pronoun relationships from the pre-training corpora of the language models.
In-Context Learning In-context learning involves including training examples in the prompt, which is fed to the model along with the instance to be evaluated. This allows the model to adapt to new tasks without the need for any parameter updates. We experiment with 2,4,6, 10, and 20-shot settings using GPT-J-6B and OPT-6.7b models. These experiments are only conducted using explicit declarations of all five pronoun forms as this was best for neo-pronouns. We select the examples given in the prompt by randomly sampling templates, names, and pronouns that are not included in the specific instance being evaluated.

Results
We test popular language models on their ability to correctly use declared pronouns when directly promoted using our evaluation dataset ( § 3.1). We conduct a thorough analysis of the variations in performance varies based on how pronouns were declared, the size of the models used, the form of the pronouns, and individual pronoun sets. We also illustrate the effect of using in-context learning, i.e., by providing models with examples of correct declared pronoun usage within the input prompts.

Direct Prompting
Average accuracy for correctly gendering instances in our evaluation dataset ( § 3.1) by pronoun type across all zero-shot experiments is shown in Figure  4. On average language models perform poorly at predicting gender-neutral pronouns (31% accuracy), and much worse at predicting neo-pronouns correctly (accuracy 7.6%).

Effect of declaration
When experiments are aggregated by declaration type (Fig. 5), we see that   declaring pronouns explicitly is slightly better for correctly predicting neo-pronouns (from 6% accuracy to 9%). However, the opposite is true for singular they and binary pronouns, which both perform better with parenthetical declarations. Declaring more pronoun forms improved performance for neopronouns (Table 6). On the other hand, the number of forms declared does not have much of an effect on predicting binary pronouns, and for singular they increasing the number of declared forms slightly decreases performance.
Effect of model size Our experiments do not show a consistent association with size (Fig. 3). However, some model families have consistent scaling patterns for specific pronoun types. OPT's performance for gender-neutral pronouns increases   sharply with size: OPT-350m has an accuracy of 21.2%, whereas the model with 6.7b parameters has an accuracy of 94.2%. OPT also shows moderate gains with scale for neo-pronouns. On the other hand, our analysis indicates that the performance of BLOOM for neutral pronouns exhibits a negative correlation with size, whereas it demonstrates a positive correlation for binary pronouns, and remains relatively stable for neo-pronouns.
Effect of pronouns and pronoun forms As displayed in Table 7, the overall accuracy for masculine and feminine binary pronouns are similar at 74.7% and 75.8% respectively. However, the performance for neutral pronouns is nearly 2.5 times lower at an accuracy of 31.0%, with an even lower performance for neo-pronouns. Amongst the neopronouns, thon exhibits the highest accuracy at 18.5%, followed by ze at 12.9%. As demonstrated in Table 8, there seems to be an inverse correlation between the performance of binary and neo-pronouns with respect to pronoun forms. Specifically, the nominative form exhibits the highest accuracy for binary pronouns (78.5%) but the lowest for neo-pronouns (3.0%). Conversely, the possessive-independent form presents the highest accuracy for non-binary pronouns (12.2%) but the lowest for binary pronouns (60.0%)

Explaining Direct Prompting Results
Name association with pronouns We notice an association between the performance of pronouns and names. For neo-pronouns, the names with the highest performance are unisex ones (Table 9). The top 10 names mostly consist of ones that are also  Table 7: Direct prompting performance for each pronoun. Among neo-pronouns, thon is most often predicted correctly by language models, followed by xe.
Models are better at correctly using they, but far from as accurately as they are able to utilize binary pronouns.
names of locations or corporations. The lowest performing names, on the other hand, are mostly binary-gendered names (Table 9). This indicates some memorization of pronoun and name association from pre-training corpora (with the caveat that these statistics are based on the distribution of name and gender in the USA). We also notice an association between binary pronouns and names. The predictive accuracy for masculine pronouns is much higher when associated with male names, with accuracy 2.8 times greater than when associated with female names (Table 10). Likewise, the performance for feminine pronouns is 2.2 times higher when associated with female names rather than male ones. These findings suggest that the models may have memorized the association of certain names with specific  Table 8: Direct prompting performance by pronoun form. There is some variation in direct prompting performance by pronoun form. Models are best at predicting possessive-independent forms for non-binary pronouns but it is the worst form for binary.  Table 9: Top and bottom 10 names for neo-pronouns.
The names that models are the best at predicting nonbinary pronouns are all unisex, whereas the bottom ones are mostly gendered names, suggesting memorized association between pronouns and names.
pronouns from their training on corpora.
Corpus counts of pronouns We compute unigram counts for two pretraining corpora, C4 and Pile. In both cases, neo-pronouns are substantially rarer than binary pronouns (Table 11). Further, even the documents that contain non-binary pronoun tokens often do not use them semantically as pronouns (see Table 12 for examples). This means that language models pretrained on these corpora would not have instances in the data to learn about the usage of non-binary pronouns. Though the cases of they are high, the top retrieved cases are of the plural usage of they. These trends are consistent with the text available generally on the web; see OpenWebText (Gokaslan et al., 2019) (Table 11).
Notably, in all three corpora, masculine pronouns are more prevalent than feminine ones.

In-Context Learning
Both GPT-J-6B and OPT-6.7b perform better for non-binary pronouns as more examples are provided (up to 6, Table 13). However, this perfor-  Table 10: Binary and gender-neutral pronoun performance breakdown by gender association of individual names. Models are able to predict feminine pronouns much more accurately for individuals with feminine names than masculine ones. Similarly, they are able to better predict masculine pronouns for masculine names rather than feminine ones.  mance does not keep improving, and we see lower performance for 20 shots. Similar k-shot behavior where performance decreases with high values of k has been noted in GPT-3 and OPT on RTE (Brown et al., 2020;Zhang et al., 2022). There can also generally high variance in few-shot performance even with fixed number of samples (Lu et al., 2021). For the pronoun they, we see different trends from each model. For GPT-J, similar to non-binary pronouns, performance improves as more examples are provided up to 6 shots. On the other hand, for OPT-6.7b, there is a large drop in performance from the zero-shot to the few-shot setting.

Related Work
There has been extensive work to understand and mitigate gender bias in language technologies (Bolukbasi et al., 2016;Zhao et al., 2018;Kurita et al., 2019). However, this has mostly been restricted to a binary view of gender. Recently some work has been done to explore gen-  der bias in a non-binary paradigm. For instance, Dev et al. (2021b) discuss ways in which genderexclusivity in NLP can harm non-binary individuals. Ovalle et al. (2023) design Open Language Generation (OLG) evaluation focused on the experiences of transgender and non-binary individuals and the everyday sources of stress and marginalization they face. Brandl et al. (2022) show that gender-neutral pronouns in Danish, English, and Swedish are associated with higher perplexities in language models. Cao and Daumé III (2020) create specialized datasets for coreference resolutions with neo-pronouns, while Lauscher et al. (2022) provide desiderata for modelling pronouns in language technologies. However, these studies only focus on a few neo-pronouns (eg. xe and ze), and only Dev et al. (2021b) and Brandl et al. (2022) evaluate misgendering but only on a few language models and in zero-shot settings. We are the first to comprehensively evaluate large language models on a wide range of pronouns and pronoun forms.

Conclusion
In this work, we show that current language models heavily misgender individuals who do not use feminine or masculine personal pronouns (e.g. he, she). Despite being provided with explicitly declared pronouns, these models do not use the correct neopronouns and struggle even with gender-neutral pronouns like they. Our analysis suggests the poor performance may be due to the scarcity of neo pronouns in the pre-training corpora and memorized associations between pronouns and names. When prompted with a few explicit examples of pronoun use, the language models do improve, suggesting some ability to adapt to new word use.  Nevertheless, it is unclear how few-shot prompting of pronoun use can mitigate bias and exclusion harms in practice in real-world downstream applications of language models. We hope researchers will expand upon our work to evaluate language technologies on their abilities to understand non-binary identities and mitigate their biases. To facilitate further research in this area, we release the full dataset, code, and demo at https://tamannahossainkay. github.io/misgendered/. While evaluation of misgendering is a crucial first step, future work should aim to go beyond evaluation and focus on developing techniques to correct it. Misgendering can be present in both human-written and model-generated content, especially towards non-binary and transgender individuals. Hence, it is crucial to advance efforts toward detecting misgendering and implementing corrective measures. Individuals who often fall victim to misgendering, such as non-binary and transgender people, should be empowered and given central roles in shaping the work on these topics. discussions and feedback. This work was funded in part by Hasso Plattner Institute (HPI) through the UCI-HPI fellowship, in part by NSF awards IIS-2046873, IIS-2040989, and CNS-1925741. Limitations This paper evaluates language models for their ability to use gender-neutral pronouns and neopronouns using a template-based dataset, MISGEN-DERED. While this approach is helpful in assessing bias, the measurements can be sensitive to the choice of templates (Delobelle et al., 2022;Seshadri et al., 2022;Alnegheimish et al., 2022;Selvam et al., 2022). Consequently, our findings should not be considered as the definitive verdict on the phenomenon of misgendering by language models. There are other limitations to our work that should be considered as well. We also only conduct an upstream evaluation on language models and do not assess downstream applications. Our evaluation is also limited to a Western conception of gender and restricted to English only. We only consider names and genders assigned at birth in the United States. Subsequent changes in names or genders are not taken into account in our analysis. Furthermore, our work does not take into account individuals who use multiple sets of pronouns, such as she/they combinations (Them, 2021), nor does it consider the full range of nonbinary pronouns as the list continues to expand (Lauscher et al., 2022). However, additional names (rare, self-created, or non-Western) and neo-pronouns can be directly used with our framework to further evaluate LLMs. We release our full code dataset to make this easier. Lastly, there are larger models that were not evaluated due to limitations in our computational budget. Further research needs to be done to address these limitations for the complete assessment of accurate preferred pronoun usage by language models.

Ethics Statement
Evaluations of gender bias in language technologies need a holistic outlook, such that they evaluate the harms of stereotyping, erasure of identities, misgendering, dead-naming, and more. Our work attempts to address one specific type of misgendering harm and builds a framework that estimates the extent of misgendering propagated by a model under specific settings. We hope our framework enables model evaluations that are not exclusionary of gender identities. However, the absence of measured misgendering by this paradigm is not evidence of no misgendering or other gender harms at all. For responsible model deployment, it is imperative that they be appropriately scrutinized based on the context of usage.

A Templates
Templates used to create the dataset in the MIS-GENDERED framework are in shown in Table 14.

B Constrained Decoding Example
We evaluate models using a constrained decoding setup. Models make predictions by selecting the most probable pronoun from a set of pronouns that share the same form. The inputs and labels are formatted in a way that allows us to determine the pronoun with the highest probability or the lowest loss for each individual instance. An example of constrained decoding is shown in Table 4.

C Data and Code
To facilitate further work on misgendering by language models, we release the full dataset, code base, and demo of our work at https:// tamannahossainkay.github.io/misgendered. Entirety of the dataset curation and evaluation was conducted at the University of California, Irvine.

Evaluation Instance
Text, x = Aamari needs your history book. Could you lend it to [PRONOUN]?
Pronoun form, f = Accusative Pronoun group, p * = xe Answer, p * f = xem Model Output, y = arg min p∈P L(x(p f ), y(p f )) = xem Figure 4: Constrained Decoding. We evaluate models in a constrained setting. For each evaluation instance, models predict the most likely pronoun out of all the pronouns of the same form. Model-specific formatting of inputs and labels is used to compute the pronoun with the highest probability, or lowest loss, for each instance.