NeuTral Rewriter: A Rule-Based and Neural Approach to Automatic Rewriting into Gender-Neutral Alternatives

Recent years have seen an increasing need for gender-neutral and inclusive language. Within the field of NLP, there are various mono- and bilingual use cases where gender inclusive language is appropriate, if not preferred due to ambiguity or uncertainty in terms of the gender of referents. In this work, we present a rule-based and a neural approach to gender-neutral rewriting for English along with manually curated synthetic data (WinoBias+) and natural data (OpenSubtitles and Reddit) benchmarks. A detailed manual and automatic evaluation highlights how our NeuTral Rewriter, trained on data generated by the rule-based approach, obtains word error rates (WER) below 0.18% on synthetic, in-domain and out-domain test sets.


Introduction
Recent years have seen an increasing need for gender-neutral and inclusive language. This need is reflected, among others, by a surge in the use of singular they, 1 currently endorsed as part of APA style as the generic and gender-neutral pronoun. 2 Within the field of Natural Language Processing (NLP), there are various monolingual and bilingual use cases where gender neutral and inclusive language is appropriate, if not preferred due to e.g. ambiguity in terms of the gender of referents. Section 3 provides a short outline of potential NLP use cases.
To support these use cases, we present a rulebased and a neural approach to gender-neutral rewriting along with manually curated benchmarks, both of which we provide open-access/source. 3 1. The pronoun 'they' was announced word of the year in 2019 according to Merriam Webster https://www.nytimes.com/2019/12/10/us/ merriam-webster-they-word-year.html 2. https://apastyle.apa.org/ 3. https://github.com/ anonymous-until-publication/ NeuTralRewriter First, a rule-based rewriter is implemented leveraging hand-written rules and an automatic error correction tool. Next, a neural rewriter is trained on output generated by the rule-based rewriter to remove the need for extensive pre-processing and the reliance on computationally expensive tools such as dependency parsers. Our manual and automatic evaluation show how the neural rewriter clearly improves over the rule-based approach with word error rates (WER) below 0.18% on synthetic, in-domain and out-domain test sets.
The main contributions of our work can be summarized as follows : (i) WinoBias+, an open-source manually curated extension of WinoBias (Zhao et al., 2018a) providing neutral alternatives for 3,167 sentences as well as a manually curated set of 1,000 natural sentences (domain : Reddit, Open-Subtitles), (ii) open-source code for rule-based and neural neutral rewriters which can convert (binary) gendered English sentences into their gender neutral counterparts, (iii) a detailed manual and automatic evaluation of errors made by the rule-based and neutral rewriter on synthetic and natural data.

Related Work
Recent years have seen an increase in research on gender and gender bias mitigation in NLP. While a relatively large body of research has focused on debiasing word embeddings (e.g., Bolukbasi et al., 2016;Font and Costa-jussà, 2019;Zhao et al., 2018c), our work is related to the generation of gender variants. We broadly distinguish between : (i) approaches that incorporate additional (meta-) information during training/testing allowing for a controlled generation of gender alternatives, and (ii) approaches that focus on gender rewriting. The synopsis will focus specifically on research related to the gender of human referents.
Within the field of Machine Translation (MT), Vanmassenhove and Hardmeier (2018) (2020) present gender-aware reinflection models for Arabic. Using an Arabic sentence and a target gender, the desired gender alternative is generated by re-inflecting the input.
It is worth noting that all the previously described approaches focus on generating binary (female/male) gendered alternatives or translations, while our work focuses on generating genderneutral alternatives. As such, the work that is most closely related to ours is Sun et al. (2021). Their work is contemporaneous to our submission. 4 Sun et al. (2021) present a rule-based and neural rewriter for the generation of gender-neutral singular they sentences as well as an evaluation benchmark 5 of 500 parallel sentences (gendered and genderneutral) from five domains (Twitter, Reddit, movie quotes, jokes). Their rule-based and neural rewriters are able to generate gender-neutral sentences with an error-rate below 1% (0.63% and 0.99% respectively). In terms of resources, compared to Sun et al. (2021), we provide larger synthetic and natural benchmarks. In terms of performance, although complicated due to the lack of a publicly available benchmark, our models are seemingly better with error-rates of 0.52 (rule-based) and 0.02 (neural) on the most comparable benchmark, i.e. Reddit data.
4. Currently in arxiv pre-print. 5. We contacted the authors to obtain their benchmark for comparison as it is currently not open-source, but have not been able to obtain it yet. We will nevertheless attempt to compare our result to theirs to the best of our ability.

Use Cases
Generating neutral alternatives for gendered sentences has applications for various monolingual language generation tasks (e.g. automatic responses), where (i) one does not want to assume the gender of the referents, or (ii) one wants to present the user with various options. Similarly, in a bilingual setting, more specifically for MT, a neutral rewriter allows for the generation of gender neutral alternatives for genderless and gender-neutral source languages (Hungarian, Turkish, Persian, Swahili...) or null-subject source languages (Spanish, Chinese, Arabic, Bulgarian...). For illustration, Example (1) and (2) demonstrate how genderneutral alternatives can be useful in bilingual settings. Example (1) features a sentence in Armenian using the epicene (gender-neutral) pronoun ' ' which can be either translated into 'he', 'she' or singular 'they'.
(1) HY : EN : He/She/They opened the door. 6 Similarly, Example (2) illustrates the possible translations of a null-subject source in Spanish which can be translated as "works in a company".
EN : He/She works in a company. 7 EN : They work in a company.
As a pre-processing step, rewriting into neutral alternatives could be useful to debias training data and thereby its embeddings (see a.o., Bolukbasi et al., 2016;Li et al., 2018;Gonen and Goldberg, 2019) and/or to obfuscate sensitive 'gender' features from real user data facing automatic profiling systems (Reddy and Knight, 2016;Shetty et al., 2018;Emmery et al., 2021).

Datasets
All data is preprocessed using the Moses (de)tokenizer (Koehn et al., 2007). Training (Reddit) and test sets (WinoBias+, OpenSubtitles, Reddit) contain a balanced amount of the eight (binary) target pronouns/determiners : he, she, her(s), his, him, him/herself. 8 6. The translation in bold is the only one provided by Bing and Google Translate consulted on May 4, 2021.
7. The translation in bold is the only one provided by Bing, Google Translate and DeepL consulted on May 4, 2021.
8. For a set containing X sentences, we extracted at least X/8 sentences containing each form -a completely uniform Reddit A set of 2,259,386 sentences (containing a total of 3M pronouns/determiners) was randomly sampled from Pushshift's Reddit snapshots (Baumgartner et al., 2020, including all subreddits) for the period of July-December 2019. This set we would later use for training our neural rewriter. Another set of 1,693 sentences (containing a total of 2K pronouns/determiners) extracted from Reddit in the same way would later be used as a development set. There are no overlaps between the two sets.
WinoBias+ an extension of the WinoBias benchmark, providing (manual) neutral alternatives for its 3,167 synthetic sentences, and corrections (e.g. for ungrammatical sentences 9 ) of the original dataset.
OpenSubtitles, Reddit test additional sets of 1,000 (manually corrected) parallel sentences (500 for each set). The entire cleaned and extended version of the corpus-WinoBias+-the OpenSubtitles (Lison and Tiedemann, 2016), and Reddit benchmark is made publicly available under a CC BY-SA 4.0 10 license. 11

Rule-Based Rewriter
The rule-based rewriter (RBR), consists of two main components : (i) a rule-based pronoun rewriter, and (ii) an error-correction language model. Table 1 gives an overview of the binary forms and their gender-neutral alternatives. While most mappings are one-to-one, 'her' can be either a pronoun (e.g. 'I gave it to her.' → 'I gave it to them.') or a possessive determiner (e.g. 'It is her book.' → 'It is their book') and 'his' can be either a possessive determiner ('It is his book.' → 'It is their book') or an independent possessive pronoun ('The book is his.' → 'The book is theirs'). To disambiguate these forms, the POS tagger and dependency parser from Stanza (Qi et al., 2020) were used. 12 distribution was not achievable due to the fact that multiple pronouns/determiners can be present in a single sentence.

Subject-Verb Agreement Correction
The nominative pronouns (he and she) can be replaced by they. However, if they are in agreement with a simple present tense verb (or the verb 'to be' ) the 3 rd person form/ending should be replaced by a plural one (see Table 2). To address this, we used a Python wrapper for LanguageTool, an open-source grammar, style and spell corrector. 16 We limited the correction to grammar mistakes to avoid additional changes (e.g. insertion of commas, different word choices, removal of whitespaces...).

Neural Rewriter
We trained a Transformer model (Vaswani et al., 2017)

Results & Discussion
Both rewriters were (manually) evaluated on synthetic (WinoBias+) and natural (Reddit, OpenSubs) evaluation benchmarks. Table 3 presents a detailed overview of the errors per test set for the Rule-Based and Neural approach. An overview and explanation of all error labels can be found in the Appendix.

Manual Evaluation
Rule-Based Approach The errors can be divided broadly into "language model" (LM), "postag" (POS) and "other" errors. WinoBias+ consists of 3167 sentences. Only 21 of the synthetic sentences were rewritten incorrectly. Issues arose either due to incorrect disambiguation ('her' → 'them' (pronoun) instead of 'their' (determiner)) or due to incorrect subject-verb agreement (SVA).
The RBR struggled more with the noisy, often ungrammatical, natural data from OpenSubtitles and Reddit. The main issues observed are incorrect SVA, additional corrections by the language tool (unrelated to gender neutrality, e.g. cause → because) and incorrect disambiguation of "'s". 17 17. e.g. He's worked. → They are worked. instead of They have worked.  Neural Approach Interestingly, and in contrast with the findings described in Sun et al. (2021), our neural model trained on the rule-based generated training data, outperforms the rule-based approach.
The error analysis reveals that the neural model resolves many of the longer distance SVA issues, the disambiguation of "'s" and errors that occurred due to incorrect postags.
No errors were made on the synthetic WinoBias+ data. Errors on the in-domain Reddit data were due to the removal of additional spaces (4 errors) or because of an unknown character/emoji (2 errors). On the out-of-domain OpenSubtitles set, we noted 8 errors the majority of which due to incorrect SVA (5 errors).

Automatic Evaluation
For comparison, we employed the same metric as Sun et al. (2021) : WER. A combination of the baseline WER (indicating the amount of changes needed in order to change to gender-neutral alternatives), and the WER computed between the correct neutral forms and the automatically generated forms provides insights into the performance of both approaches.
Given that Sun et al. (2021) use an evaluation benchmark of 500 sentences consisting of Twitter, Reddit, jokes and movie quotes data, its performance is probably most comparable to the scores we obtained on the Reddit set. Like the manual evaluation, and in contrast with Sun et al. (2021), the automatic evaluation (Table 4) confirms that our neural approach is able to generalize over the rulebased generated data, outperforming it with error rates below 0.18% (0.0% (WB+), 0.18% (Open-Subtitles) and 0.02% (Reddit)). Furthermore, these error rates are all substantially lower than those reported by Sun et al. (2021). We hypothesize this is due to the better performance of the RBR (confirmed as well by the automatic/manual evaluation) leading to better source (gendered)-target (neutral) training data for the NMT model.
We ought to note that WER does not take into account the removal of superfluous spaces (e.g. before the first character of a sentence, double spaces instead of a single one). We only observed the removal of such spaces by the neural rewriter on the Reddit data (see detailed manual analysis presented in Table 3).

Conclusion
This paper presents a rule-based and a neural gender-neutral rewriter for English. First, the rulebased approach was implemented, leveraging handwritten rules and an automatic error correction tool. Using the RBR, we generated a parallel genderedto-neutral corpus on which an NMT system was trained. The NMT model removes the need for computationally expensive pre-processing steps and, according to the manual and automatic evaluation, outperforms the RBR on synthetic, in-domain and out-domain benchmarks. Along with our openaccess/source code, we also provide three manually curated benchmarks for neutral rewriting.
For now, the neutral rewriter is limited to English using 'singular they' and recommendations for gender neutral writing specific to the English language. It is, in theory, possible to extend this approach (or a similar one) to other languages. However, so far, few languages have a crystallized approach when it comes to gender-neutral pronouns and genderneutral word endings.
In future work, we intend to explore potential applications of the neutral rewriters (e.g. gender debiasing of corpora). We furthermore plan to extend our work to gender-neutral rewriting targeting specific referents within a sentence to accommodate the gender preferences of individual referents.

Ethics statement
Neutral Rewriter Application The Neutral Rewriter is intended to provide gender-neutral alternatives and increase the inclusiveness of NLP/MT applications. The rewriter can furthermore be used as a preprocessing step to obfuscate a potentially sensitive gender attribute from training data.
At this stage, the rewriter works on a sentencelevel and does not allow for rewriting pronouns or determiners of specific referents. We followed the guidelines of the European Parliament for gender neutral language and provide an option to change gendered animate nouns, unnecessary feminine forms of animate nouns and generic uses of the word 'man' based on non-exhaustive word lists.
Datasets We present three openly available English benchmarks : (i) WinoBias+, (ii) OpenSubtitles and (iii) Reddit. (i) WinoBias+ consists of a curated and extended version of the synthetic Wino-Bias (Zhao et al., 2018b) dataset, distributed under the MIT License. 18 (ii) The open-source Open-Subtitles (Lison and Tiedemann, 2016) 19 data was used to randomly sample a subset for the Open-Subtitles benchmark. OpenSubtitles is distributed under a Creative Commons license. 20 (iii) The Reddit dataset was collected through the third-party snapshots of Reddit's publicly available API at https://pushshift.io. It is subject to Reddit's own User Agreement and Privacy Policy and covers the free and public sharing of user data. 21 The neutral alternatives for the three benchmarks were manually created by a linguist. The curation rationale behind the selected datasets is summarized as follows. WinoBias was selected as it is one of the few benchmarks for gender bias in NLP. We extended it with gender-neutral alternatives. The natural Reddit and OpenSubtitles dataset allowed us to verify the robustness of the rewriters on more noisy and diverse data sets. The OpenSubtitles and Reddit datasets contain variety in terms of language and English social dialects. Training and test sets contain a balanced amount of the eight (binary) target pronouns/determiners. For a set containing X sentences, we extracted at least X/8 sentences containing each form -a completely uniform distribution was not achievable due to the fact that multiple pronouns/determiners can be present in a single sentence.
Carbon statement The neural model presented in this work has an ecological footprint equivalent to 1.68kg of CO2 emissions. 22 The training time, consumption and carbon emission can be found in

A.1 Advanced Rewriter
The advanced rewriter includes rewriting of gender-marked job titles (chairman, anchorman...), rewriting of unnecessary feminine forms (actress, comedienne, waitress...), avoidance of construction using a generic form of 'man' ('average man', 'man and wife'...), and rewriting of titles ('Mrs' and 'Miss').  As explained in the paper, errors are divided into Language Model (LM) errors, postag error (POS) and other errors (OTHER). Within these three error classes, we identified multiple subclasses of LM, POS and OTHER errors. An explanation of the labels used in our error analysis and paper can be found in Table 9. Failure to make correct subject-verb agreement, usually due to long distance dependencies. POS Wrong form of 'they' produced by rewriter due to incorrect postag POS (source) Wrong form of 'they' produced by rewriter due to incorrect postag which is related to an ungrammatical/incorrect soure sentences OTHER rule Some forms such as 'hisn's' are not standard language and does not covered by our rules. Similarly written forms such as 'hes' for 'he's' are not corrected by the rewriter Other ungram.
Ungrammatical input sentence leading to an ungrammatical output Other UNK The Neural Rewriter outputs <UNK> for unknown characters (in our case " ?", " !", "...", and emojis/special characters that did not appear in our Reddit training data)

A.3 Neural Rewriter
Our neural model is trained with the following options : transformer-iwslt-en-de architecture with 4 attention heads and encoder and decoder embedding dimensions equal to 512, encoder and decoder embedding dimensions for the FFN equal to 1024, Adam learning optimizer (Kingma and Ba, 2015) with a learning rate of 0.005 and inverse square-root schedule with 4 000 warmup steps, an early stopping based on the improvement on the validation set with patience 5, dropout of 0.3, joint byte-pair encoding (Sennrich et al., 2016) with 32 000 operations, token-based batches with maximum size of 4096. For ease of replicability we provide our complete preprocessing and training scripts in Appendix. Where's herself. → Where's themselves.