How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation

Having recognized gender bias as a major issue affecting current translation technologies, researchers have primarily attempted to mitigate it by working on the data front. However, whether algorithmic aspects concur to exacerbate unwanted outputs remains so far under-investigated. In this work, we bring the analysis on gender bias in automatic translation onto a seemingly neutral yet critical component: word segmentation. Can segmenting methods influence the ability to translate gender? Do certain segmentation approaches penalize the representation of feminine linguistic markings? We address these questions by comparing 5 existing segmentation strategies on the target side of speech translation systems. Our results on two language pairs (English-Italian/French) show that state-of-the-art sub-word splitting (BPE) comes at the cost of higher gender bias. In light of this finding, we propose a combined approach that preserves BPE overall translation quality, while leveraging the higher ability of character-based segmentation to properly translate gender.

Having recognized gender bias as a major issue affecting current translation technologies, researchers have primarily attempted to mitigate it by working on the data front. However, whether algorithmic aspects concur to exacerbate unwanted outputs remains so far under-investigated. In this work, we bring the analysis on gender bias in automatic translation onto a seemingly neutral yet critical component: word segmentation. Can segmenting methods influence the ability to translate gender? Do certain segmentation approaches penalize the representation of feminine linguistic markings? We address these questions by comparing 5 existing segmentation strategies on the target side of speech translation systems. Our results on two language pairs (English-Italian/French) show that state-of-the-art subword splitting (BPE) comes at the cost of higher gender bias. In light of this finding, we propose a combined approach that preserves BPE overall translation quality, while leveraging the higher ability of character-based segmentation to properly translate gender.
Bias Statement. 1 We study the effect of segmentation methods on the ability of speech translation (ST) systems to translate masculine and feminine forms referring to human entities. In this area, structural linguistic properties interact with the perception and representation of individuals (Gygax et al., 2019;Corbett, 2013;Stahlberg et al., 2007). Thus, we believe they are relevant gender expressions, used to communicate about the self and others, and by which the sociocultural and political reality of gender is negotiated (Hellinger and Motschenbacher, 2015).
Accordingly, we consider a model that systematically and disproportionately favours masculine over feminine forms to be biased, as it fails to properly recognize women. From a technical perspective, such behaviour deteriorates models' performance. Most importantly, however, from a humancentered view, real-world harms are at stake (Crawford, 2017), as translation technologies are unequally beneficial across gender groups and reduce feminine visibility, thus contributing to misrepresent an already socially disadvantaged group. This work is motivated by the intent to shed light on whether issues in the generation of feminine forms are also a by-product of current algorithms and techniques. In our view, architectural improvements of ST systems should also account for the trade-offs between overall translation quality and gender representation: our proposal of a model that combines two segmentation techniques is a step towards this goal.
Note that technical mitigation approaches should be integrated with the long-term multidisciplinary commitment (Criado-Perez, 2019;Benjamin, 2019;D'Ignazio and Klein, 2020) necessary to radically address bias in our community. Also, we recognize the limits of working on binary gender, as we further discuss in the ethic section ( §8).

Introduction
The widespread use of language technologies has motivated growing interest on their social impact (Hovy and Spruit, 2016;Blodgett et al., 2020), with gender bias representing a major cause of concern (Costa-jussà, 2019; Sun et al., 2019). As regards translation tools, focused evaluations have exposed that speech translation (ST) -and machine translation (MT) -models do in fact overproduce masculine references in their outputs (Cho et al., 2019;, except for feminine asso-ciations perpetuating traditional gender roles and stereotypes (Prates et al., 2020;Stanovsky et al., 2019). In this context, most works identified data as the primary source of gender asymmetries. Accordingly, many pointed out the misrepresentation of gender groups in datasets (Garnerin et al., 2019;Vanmassenhove et al., 2018), focusing on the development of data-centred mitigating techniques (Zmigrod et al., 2019;. Although data are not the only factor contributing to generate bias (Shah et al., 2020;Savoldi et al., 2021), only few inquiries devoted attention to other technical components that exacerbate the problem (Vanmassenhove et al., 2019) or to architectural changes that can contribute to its mitigation (Costa-jussà et al., 2020b). From an algorithmic perspective, Roberts et al. (2020) additionally expose how "taken-for-granted" approaches may come with high overall translation quality in terms of BLEU scores, but are actually detrimental when it comes to gender bias.
Along this line, we focus on ST systems and inspect a core aspect of neural models: word segmentation. Byte-Pair Encoding (BPE) (Sennrich et al., 2016) represents the de-facto standard and has recently shown to yield better results compared to character-based segmentation in ST (Di Gangi et al., 2020). But does this hold true for gender translation as well? If not, why?
Languages like French and Italian often exhibit comparatively complex feminine forms, derived from the masculine ones by means of an additional suffix (e.g. en: professor, fr: professeur M vs. professeure F). Additionally, women and their referential linguistic expressions of gender are typically under-represented in existing corpora . In light of the above, purely statistical segmentation methods could be unfavourable for gender translation, as they can break the morphological structure of words and thus lose relevant linguistic information (Ataman et al., 2017). Indeed, as BPE merges the character sequences that co-occur more frequently, rarer or more complex feminine-marked words may result in less compact sequences of tokens (e.g. en: described, it: des@@critto M vs. des@@crit@@ta F). Due to such typological and distributive conditions, may certain splitting methods render feminine gender less probable and hinder its prediction?
We address such questions by implementing different families of segmentation approaches em-ployed on the decoder side of ST models built on the same training data. By comparing the resulting models both in terms of overall translation quality and gender accuracy, we explore whether a so far considered irrelevant aspect like word segmentation can actually affect gender translation. As such, (1) we perform the first comprehensive analysis of the results obtained by 5 popular segmentation techniques for two language directions (en-fr and en-it) in ST.
(2) We find that the target segmentation method is indeed an important factor for models' gender bias. Our experiments consistently show that BPE leads to the highest BLEU scores, while character-based models are the best at translating gender. Preliminary analyses suggests that the isolation of the morphemes encoding gender can be a key factor for gender translation.
(3) Finally, we propose a multi-decoder architecture able to combine BPE overall translation quality and the higher ability to translate gender of character-based segmentation.

Background
Gender bias. Recent years have seen a surge of studies dedicated to gender bias in MT (Gonen and Webster, 2020;Rescigno et al., 2020) and ST (Costa-jussà et al., 2020a). The primary source of such gender imbalance and adverse outputs has been identified in the training data, which reflect the under-participation of women -e.g. in the media (Madaan et al., 2018), sexist language and gender categories overgeneralization (Devinney et al., 2020). Hence, preventive initiatives concerning data documentation have emerged (Bender and Friedman, 2018), and several mitigating strategies have been proposed by training models on ad-hoc gender-balanced datasets Costa-jussà and de Jorge, 2020), or by enriching data with additional gender information (Moryossef et al., 2019;Vanmassenhove et al., 2018;Elaraby and Zahran, 2019;Stafanovičs et al., 2020).
Comparatively, very little work has tried to identify concurring factors to gender bias going beyond data. Among those, Vanmassenhove et al. (2019) ascribes to an algorithmic bias the loss of less frequent feminine forms in both phrase-based and neural MT. Closer to our intent, two recent works pinpoint the impact of models' components and inner mechanisms. Costa-jussà et al. (2020b) investigate the role of different architectural de-signs in multilingual MT, showing that languagespecific encoder-decoders (Escolano et al., 2019) better translate gender than shared models (Johnson et al., 2017), as the former retain more gender information in the source embeddings and keep more diversion in the attention. Roberts et al. (2020), on the other hand, prove that the adoption of beam search instead of sampling -although beneficial in terms of BLEU scores -has an impact on gender bias. Indeed, it leads models to an extreme operating point that exhibits zero variability and in which they tend to generate the more frequent (masculine) pronouns. Such studies therefore expose largely unconsidered aspects as factors contributing to gender bias in automatic translation, identifying future research directions for the needed countermeasures.
To the best of our knowledge, no prior work has taken into account if it may be the case for segmentation methods as well. Rather, prior work in ST  compared gender translation performance of cascade and direct systems using different segmentation algorithms, disregarding their possible impact on final results.
Segmentation. Although early attempts in neural MT employed word-level sequences (Sutskever et al., 2014;Bahdanau et al., 2015), the need for open-vocabulary systems able to translate rare/unseen words led to the definition of several word segmentation techniques. Currently, the statistically motivated approach based on byte-pair encoding (BPE) by Sennrich et al. (2016) represents the de facto standard in MT. Recently, its superiority to character-level (Costa-jussà and Fonollosa, 2016;Chung et al., 2016) has been also proved in the context of ST (Di Gangi et al., 2020). However, depending on the languages involved in the translation task, the data conditions, and the linguistic properties taken into account, BPE greedy procedures can be suboptimal. By breaking the surface of words into plausible semantic units, linguistically motivated segmentations (Smit et al., 2014;Ataman et al., 2017) were proven more effective for low-resource and morphologically-rich languages (e.g. agglutinative languages like Turkish), which often have a high level of sparsity in the lexical distribution due to their numerous derivational and inflectional variants. Moreover, fine-grained analyses comparing the grammaticality of character, morpheme and BPE-based models exhibited different capabilities. Sennrich (2017) and Ataman et al. (2019) show the syntactic advantage of BPE in managing several agreement phenomena in German, a language that requires resolving long range dependencies. In contrast, Belinkov et al. (2020) demonstrate that while subword units better capture semantic information, character-level representations perform best at generalizing morphology, thus being more robust in handling unknown and lowfrequency words. Indeed, using different atomic units does affect models' ability to handle specific linguistic phenomena. However, whether low gender translation accuracy can be to a certain extent considered a by-product of certain compression algorithms is still unknown.

Language Data
As just discussed, the effect of segmentation strategies can vary depending on language typology (Ponti et al., 2019) and data conditions. To inspect the interaction between word segmentation and gender expressions, we thus first clarify the properties of grammatical gender in the two languages of our interest: French and Italian. Then, we verify their representation in the datasets used for our experiments.

Languages and Gender
The extent to which information about the gender of referents is grammatically encoded varies across languages (Hellinger and Motschenbacher, 2015;Gygax et al., 2019). Unlike English -whose gender distinction is chiefly displayed via pronouns (e.g. he/she) -fully grammatical gendered languages like French and Italian systematically articulate such semantic distinction on several parts of speech (gender agreement) (Hockett, 1958;Corbett, 1991). Accordingly, many lexical items exist in both feminine and masculine variants, overtly marked through morphology (e.g. en: the tired kid sat down; it: il bimbo stanco siè seduto M vs. la bimba stanca siè seduta F). As the example shows, the word forms are distinguished by two morphemes ( -o, -a), which respectively represent the most common inflections for Italian masculine and feminine markings. 2 In French, the morphological mechanism is slightly different (Schafroth, 2003), as it relies on an additive suffixation on top of masculine words to express feminine gender (e.g. en: an expert is gone, fr: un expert est allé M vs. une experte est allée F). Hence, feminine French forms require an additional morpheme. Similarly, another productive strategy -typical for a set of personal nouns -is the derivation of feminine words via specific affixes for both French (e.g. -eure,euse) 3 and Italian (-essa, -ina, -trice) (Schafroth, 2003;Chini, 1995), whose residual evidence is still found in some English forms (e.g. heroine, actress) (Umera-Okeke, 2012).
In light of the above, translating gender from English into French and Italian poses several challenges to automatic models. First, gender translation does not allow for one-to-one mapping between source and target words. Second, the richer morphology of the target languages increases the number of variants and thus data sparsity. Hereby, the question is whether -and to what extent -statistical word segmentation differently treats the less frequent variants. Also, considering the morphological complexity of some feminine forms, we speculate whether linguistically unaware splitting may disadvantage their translation. To test these hypotheses, below we explore if such conditions are represented in the ST datasets used in our study.

Gender in Used Datasets
MuST-SHE ) is a gendersensitive benchmark available for both en-fr and enit (1,113 and 1,096 sentences, respectively). Built on naturally occurring instances of gender phenomena retrieved from the TED-based MuST-C corpus , 4 it allows to evaluate gender translation on qualitatively differentiated and balanced masculine/feminine forms. An important feature of MuST-SHE is that, for each reference translation, an almost identical "wrong" reference is created by swapping each annotated gender-marked word into its opposite gender. By means of such wrong reference, for each target language we can identify ∼2,000 pairs of gender forms (e.g. en: tired, fr: fatiguée vs. fatigué) that we compare in terms of i) length, and ii) frequency in the MuST-C training set.
As regards frequency, we asses that, for both language pairs, the types of feminine variants are less frequent than their masculine counterpart in over 86% of the cases. Among the exceptions, we find words that are almost gender-exclusive (e.g. pregnant) and some problematic or socially connoted activities (e.g. raped, nurses). Looking at words' length, 15% of Italian feminine forms result to be longer than masculine ones, whereas in French this percentage amounts to almost 95%. These scores confirm that MuST-SHE reflects the typological features described in §3.1.

Experiments
All the direct ST systems used in our experiments are built in the same fashion within a controlled environment, so to keep the effect of different word segmentations as the only variable. Accordingly, we train them on the MuST-C corpus, which contains 492 hours of speech for en-fr and 465 for en-it. Concerning the architecture, our models are based on Transformer (Vaswani et al., 2017). For the sake of reproducibility, we provide extensive details about the ST models and hyper-parameters' choices in the Appendix §A. 5

Segmentation Techniques
To allow for a comprehensive comparison of word segmentation's impact on gender bias in ST, we identified three substantially different categories of splitting techniques. For each of them, we hereby present the candidates selected for our experiments.
Character Segmentation. Dissecting words at their maximal level of granularity, characterbased solutions have been first proposed by Ling et al. (2015) and Costa-jussà and Fonollosa (2016). This technique proves simple and particularly effective at generalizing over unseen words. On the other hand, the length of the resulting sequences increases the memory footprint, and slows both the training and inference phases. We perform our segmentation by appending "@@ " to all characters but the last of each word.
Statistical Segmentation. This family comprises data-driven algorithms that generate statistically significant subwords units. The most popular one is BPE (Sennrich et al., 2016), 6 which proceeds by merging the most frequently co-occurring characters or character sequences. Recently, He et al. (2020) introduced the Dynamic Programming Encoding (DPE) algorithm, which performs competitively and was claimed to accidentally produce more linguistically-plausible subwords with respect to BPE. DPE is obtained by training a mixed character-subword model. As such, the computational cost of a DPE-based ST model is around twice that of a BPE-based one. We trained the DPE segmentation on the transcripts and the target translations of the MuST-C training set, using the same settings of the original paper. 7 Morphological Segmentation. A third possibility is linguistically-guided tokenization that follows morpheme boundaries. Among the unsupervised approaches, one of the most widespread tools is Morfessor (Creutz and Lagus, 2005), which was extended by Ataman et al. (2017) to control the size of the output vocabulary, giving birth to the LMVR segmentation method. These techniques have outperformed other approaches when dealing with low-resource and/or morphologically-rich languages (Ataman and Federico, 2018). In other languages, they are not as effective, so they are not widely adopted.  For fair comparison, we chose the optimal vocabulary size for each method (when applicable). Following , we employed 8k merge rules for BPE and DPE, since the latter requires an initial BPE segmentation. In LMVR, instead, the desired target dimension is actually only an upper bound for the vocabulary size. We tested 32k and 16k, but we only report the results with 32k as it proved to be the best configuration both in terms of translation quality and gender accuracy. Finally, character-level segmentation and Morfessor do not allow to determine the vocabulary size. Table 1 shows the size of the resulting dictionaries. 7 See https://github.com/xlhex/dpe. 8 We used the parameters and commands suggested in https://github.com/d-ataman/lmvr/blob/ master/examples/example-train-segment.sh 9 Here "tokens" refers to the number of words in the corpus, and not to the unit resulting from subword tokenization.

Evaluation
We are interested in measuring both i) the overall translation quality obtained by different segmentation techniques, and ii) the correct generation of gender forms. We evaluate translation quality on both the MuST-C tst-COMMON set (2,574 sentences for en-it and 2,632 for en-fr) and MuST-SHE ( §3.2), using SacreBLEU (Post, 2018). 10 For fine-grained analysis on gender translation, we rely on gender accuracy . 11 We differentiate between two categories of phenomena represented in MuST-SHE. Category (1) contains first-person references (e.g. I'm a student) to be translated according to the speakers' preferred linguistic expression of gender. In this context, ST models can leverage speakers' vocal characteristics as a gender cue to infer gender translation. 12 Gender phenomena of Category (2), instead, shall be translated in concordance with other gender information in the sentence (e.g. she/he is a student). Table 2 shows the overall translation quality of ST systems trained with distinct segmentation techniques. BPE comes out as competitive as LMVR for both language pairs. On averaged results, it exhibits a small gap (0.2 BLEU) also with DPE on en-it, while it achieves the best performance on en-fr. The disparities are small though: they range within 0.5 BLEU, apart from Char standing ∼1 BLEU below. Compared to the scores reported by Di , the Char gap is however smaller. As our results are considerably higher than theirs, we believe that the reason for such differences lies in a sub-optimal fine-tuning of their hyper-parameters. Overall, in light of the tradeoff between computational cost (LMVR and DPE require a dedicated training phase for data segmen-tation) and average performance (BPE achieves winning scores on en-fr and competitive for en-it), we hold BPE as the best segmentation strategy in terms of general translation quality for direct ST.

Comparison of Segmentation Methods
Turning to gender translation, the gender accuracy scores presented in Table 3 exhibit that all ST models are clearly biased, with masculine forms (M) disproportionately produced across language pairs and categories. However, we intend to pinpoint the relative gains and losses among segmenting methods. Focusing on overall accuracy (ALL), we see that Char -despite its lowest performance in terms of BLEU score -emerges as the favourite segmentation for gender translation. For French, however, DPE is only slightly behind. Looking at morphological methods, they surprisingly do not outperform the statistical ones. The greatest variations are detected for feminine forms of Category 1 (1F), where none of the segmentation techniques reaches 50% of accuracy, meaning that they are all worse than a random choice when the speaker should be addressed by feminine expressions. Char appears close to such threshold, while the others (apart from DPE in French) are significantly lower.
These results illustrate that target segmentation is a relevant parameter for gender translation. In particular, they suggest that Char segmentation improves models' ability to learn correlations between the received input and gender forms in the reference translations. Although in this experiment models rely only on speakers' vocal characteristics to infer gender -which we discourage as a cue for gender translation for real-world deployment (see §8)such ability shows a potential advantage for Char, which could be better redirected toward learning correlations with reliable gender meta-information included in the input. For instance, in a scenario in which meta-information (e.g. a gender tag) is added to the input to support gender translation, a Char model might better exploit this information. Lastly, our evaluation reveals that, unlike previous ST studies , a proper comparison of models' gender translation potentialities requires adopting the same segmentation. Our question then becomes: What makes Char segmentation less biased? What are the tokenization features determining a better/worse ability in generating the correct gender forms?
Lexical diversity. We posit that the limited generation of feminine forms can be framed as an issue of data sparsity, whereas the advantage of  Char-based segmentation ensues from its ability to handle less frequent and unseen words (Belinkov et al., 2020). Accordingly, Vanmassenhove et al. (2018); Roberts et al. (2020) link the loss of linguistic diversity (i.e. the range of lexical items used in a text) with the overfitted distribution of masculine references in MT outputs.
To explore such hypothesis, we compare the lexical diversity (LD) of our models' translations and MuST-SHE references. To this aim, we rely on Type/Token ratio (TTR) - (Chotlos, 1944;Templin, 1957), and the more robust Moving Average TTR (MATTR) -(Covington and McFall, 2010). 13 As we can see in Table 4, character-based models exhibit the highest LD (the only exception is DPE with the less reliable TTR metric on en-it). However, we cannot corroborate the hypothesis formulated in the above-cited studies, as LD scores do not strictly correlate with gender accuracy (Table  3). For instance, LMVR is the second-best in terms of gender accuracy on en-it, but shows a very low lexical diversity (the worst according to MATTR and second-worst according to TTR).   though to different extent, longer and less frequent than their masculine counterparts. In light of such conditions, we expected that the statistically-driven BPE segmentation would leave feminine forms unmerged at a higher rate, and thus add uncertainty to their generation. To verify if this is the actual case -explaining BPE's lower gender accuracywe check whether the number of tokens (characters or subwords) of a segmented feminine word is higher than that of the corresponding masculine form. We exploit the coupled "wrong" and "correct" references available in MuST-SHE, and compute the average percentage of additional tokens found in the feminine segmented sentences 14 over the masculine ones. Results are reported in Table 5. At a first look, we observe opposite trends: BPE segmentation leads to the highest increment of tokens for feminine words in Italian, but to the lowest one in French. Also, DPE exhibits the highest increment in French, whereas it actually performs slightly better than Char on feminine gender translation (see Table 3). Hence, even the increase in sequence length does not seem to be an issue on its own for gender translation. Nonetheless, these apparently contradictory results encourage our last exploration: How are gender forms actually split?
Gender isolation. By means of further manual analysis on 50 output sentences per each of the 6 systems, we inquire if longer token sequences for feminine words can be explained in light of the different characteristics and gender productive mechanisms of the two target languages ( §3.1). Table 6 reports selected instances of coupled feminine/masculine segmented words, with their respective frequency in the MuST-C training set.
Starting with Italian, we find that BPE sequence length increment indeed ensues from greedy splitting that, as we can see from examples (a) and (c), ignores meaningful affix boundaries for both same length and different-length gender pairs, respec-14 As such references only vary for gender-marked words, we can isolate the difference relative to gender tokens.  tively. Conversely, on the French set -with 95% of feminine words longer than their masculine counterparts -BPE's low increment is precisely due to its loss of semantic units. For instance, as shown in (e), BPE does not preserve the verb root (adopt), nor isolates the additional token (-e) responsible for the feminine form, thus resulting into two words with the same sequence length (2 tokens). Instead DPE, which achieved the highest accuracy results for en-fr feminine translation (Table 3), treats the feminine additional character as a token per se (f).
Based on such patterns, our intuition is that the proper splitting of the morpheme-encoded gender information as a distinct token favours gender translation, as models learn to productively generalize on it. Considering the high increment of DPE tokens for Italian in spite of the limited number of longer feminine forms (15%), our analysis confirms that DPE is unlikely to isolate gender morphemes on the en-it language pair. As a matter of fact, it produces the same kind of coarse splitting as BPE (see (b) and (d)).
Finally, we attest that the two morphological techniques are not equally valid. Morfessor occasionally generates morphologically incorrect subwords for feminine forms by breaking the word stem (see example (g) where the correct stem is sicur). Such behavior also explains Morfessor's higher token increment with respect to LMVR. Instead, although LMVR (examples (h) and (i)) produces linguistically valid suffixes, it often condenses other grammatical categories (e.g. tense and number) with gender. As suggested above, if the pinpointed split of morpheme-encoded gender is a key factor for gender translation, LMVR's lower level of granularity explains its reduced gender accuracy. Working on character' sequences, instead, the isolation of the gender unit is always attained.   Table 8: Gender accuracy (%) for MuST-SHE Overall (ALL), Category 1 and 2 on en-fr and en-it.
6 Beyond the Quality-Gender Trade-off Informed by our experiments and analysis ( §5), we conclude this study by proposing a model that combines BPE overall translation quality and Char's ability to translate gender. To this aim, we train a multi-decoder approach that exploits both segmentations to draw on their corresponding advantages.
In the context of ST, several multi-decoder architectures have been proposed, usually to jointly produce both transcripts and translations with a single model. Among those in which both decoders access the encoder output, here we consider the best performing architectures according to Sperber et al. (2020). As such, we consider: i) Multitask direct, a model with one encoder and two decoders, both exclusively attending the encoder output as proposed by Weiss et al. (2017), and ii) the Triangle model (Anastasopoulos and Chiang, 2018), in which the second decoder attends the output of both the encoder and the first decoder.
For the triangle model, we used a first BPEbased decoder and a second Char decoder. With this order, we aimed to enrich BPE high quality translation with a refinement for gender translation, performed by the Char-based decoder. However, the results were negative: the second decoder seems to excessively rely on the output of the first one, thus suffering from a severe exposure bias (Ranzato et al., 2016) at inference time. Hence, we do not report the results of these experiments.
Instead, the Multitask direct has one BPE-based and one Char-based decoder. The system requires a training time increase of only 10% and 20% com-pared to, respectively, Char and BPE models. At inference phase, instead, running time and size are the same of a BPE model. We report overall translation quality (Table 7) and gender accuracy (Table  8) of the BPE output (BPE&Char). 15 Starting with gender accuracy, the Multitask model's overall gender translation ability (ALL) is still lower, although very close, to that of the Char-based model. Nevertheless, feminine translation improvements are present on Category 2F for en-fr and, with a larger gain, on 1F for en-it. We believe that the presence of the Char-based decoder is beneficial to capture into the encoder output gender information, which is then also exploited by the BPE-based decoder. As the encoder outputs are richer, overall translation quality is also slightly improved (Table 7). This finding is in line with other work (Costa-jussà et al., 2020b), which proved a strict relation between gender accuracy and the amount of gender information retained in the intermediate representations (encoder outputs).
Overall, following these considerations, we posit that target segmentation can directly influence the gender information captured in the encoder output. In fact, since the Char and BPE decoders do not interact with each other in the Multitask model, the gender accuracy gains of the BPE decoder cannot be attributed to a better ability of a segmentation method in rendering the gender information present in the encoder output into the translation.
Our results pave the way for future research on the creation of richer encoder outputs, disclosing the importance of target segmentation in extracting gender-related knowledge. With this work, we have taken a step forward in ST for English-French and English-Italian, pointing at plenty of new ground to cover concerning how to split for different language typologies. As the motivations of this inquiry clearly concern MT as well, we invite novel studies to start from our discoveries and explore how they apply under such conditions, as well as their combination with other bias mitigating strategies.

Conclusion
As the old IT saying goes: garbage in, garbage out. This assumption underlies most of current attempts to address gender bias in language technologies. Instead, in this work we explored whether technical choices can exacerbate gender bias by focusing on the influence of word segmentation on gender translation in ST. To this aim, we compared several word segmentation approaches on the target side of ST systems for English-French and English-Italian, in light of the linguistic gender features of the two target languages. Our results show that tokenization does affect gender translation, and that the higher BLEU scores of state-of-the-art BPE-based models come at cost of lower gender accuracy. Moreover, first analyses on the behaviour of segmentation techniques found that improved generation of gender forms could be linked to the proper isolation of the morpheme that encodes gender information, a feature which is attained by character-level split. Finally, we propose a multi-decoder approach to leverage the qualities of both BPE and character splitting, improving both gender accuracy and BLEU score, while keeping computational costs under control.

Acknowledgments
This work is part of the "End-to-end Spoken Language Translation in Rich Data Conditions" project, 16 which is financially supported by an Amazon AWS ML Grant. The authors also wish to thank Duygu Ataman for the insightful discussions on this work.

Ethic statement 17
In compliance with ACL norms of ethics, we wish to elaborate on i) characteristics of the dataset used in our experiments, ii) the study of gender as a variable, and iii) the harms potentially arising from real-word deployment of direct ST technology.
As already stated, in our experiments we rely on the training data from the TED-based MuST-C corpus 18 and its derived evaluation benchmark, MuST-SHE. Although precise information about various sociodemographic groups represented in the data are not fully available, based on impressionistic overview and prior knowledge about the nature of TED talks it is expected that the speakers are almost exclusively adults (over 20), with different geographical backgrounds. Thus, such data are likely to allow for modeling a range of English varieties of both native and non-native speakers.
16 https://ict.fbk.eu/ units-hlt-mt-e2eslt/ 17 Extra space after the 8th page allowed for ethical considerations -see https://2021.aclweb.org/calls/ papers/ 18 https://ict.fbk.eu/must-c/ As regards gender, from the data statements (Bender and Friedman, 2018) of the used corpora, we know that MuST-C training data are manually annotated with speakers' gender information 19 based on the personal pronouns found in their publicly available personal TED profile. As reported in its release page, 20 the same annotation process applies to MuST-SHE as well, with the additional check that the indicated (English) linguistic gender forms are rendered in the gold standard translations. Hence, information about speakers' preferred linguistic expressions of gender are transparently validated and disclosed. Overall, MuST-C exhibits a gender imbalance: 70% vs. 30% of the speakers referred by means of he/she pronoun, respectively. Instead, allowing for a proper cross-gender comparison, they are equally distributed in MuST-SHE.
Accordingly, when working on the evaluation of speaker-related gender translation for MuST-SHE category (1), we proceed by solely focusing on the rendering of their linguistic gender expressions. As per (Larson, 2017) guidelines, no assumptions about speakers' self determined identity (GLAAD, 2007) -which cannot be directly mapped from pronoun usage (Cao and Daumé III, 2020; Ackerman, 2019) -has been made. Unfortunately, our experiments only account for the binary linguistic forms represented in the used data. To the best of our knowledge, ST natural language corpora going beyond binarism do not yet exist, 21 also due to the fact that gender-neutralization strategies are still object of debate and challenging to fully implement in languages with grammatical gender (Gabriel et al., 2018;Lessinger, 2020). Nonetheless, we support the rise of alternative neutral expressions for both languages (Shroy, 2016;Gheno, 2019) and point towards the development of non-binary inclusive technology.
Lastly, we endorse the point made by . Namely, direct ST systems leveraging speaker's vocal biometric features as a gender cue have the capability to bring real-world dangers, like the categorization of individuals by means of biological essentialist frameworks (Zimman, 2020). This is particularly harmful to transgender individuals, as it can lead to misgendering (Stryker, 2008) and diminish their personal identity. More generally, it can reduce gender to stereotypical ex-pectations about how masculine or feminine voices should sound. Note that, we do not advocate for the deployment of ST technologies as is. Rather, we experimented with unmodified models for the sake of hypothesis testing without adding variability. However, our results suggest that, if certain word segmentation techniques better capture correlations from the received input, such capability could be exploited to redirect ST attention away from speakers' vocal characteristics by means of other information provided.