Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization

This paper describes the motivation and development of speech synthesis systems for the purposes of language revitalization. By building speech synthesis systems for three Indigenous languages spoken in Canada, Kanien’kéha, Gitksan & SENĆOŦEN, we re-evaluate the question of how much data is required to build low-resource speech synthesis systems featuring state-of-the-art neural models. For example, preliminary results with English data show that a FastSpeech2 model trained with 1 hour of training data can produce speech with comparable naturalness to a Tacotron2 model trained with 10 hours of data. Finally, we motivate future research in evaluation and classroom integration in the field of speech synthesis for language revitalization.


Introduction
There are approximately 70 Indigenous languages spoken in Canada, from 10 distinct language families (Rice, 2008). As a consequence of the residential school system and other policies of cultural suppression, the majority of these languages now have fewer than 500 fluent speakers remaining, most of them elderly. Despite this, interest from students and parents in Indigenous language education continues to grow (Statistics Canada, 2016); we have heard from teachers that they are overwhelmed with interest from potential students, and the growing trend towards online education means many students who have not previously had access to language classes now do.
Supporting these growing cohorts of students comes with unique challenges for languages with few fluent first-language speakers. A particular concern of teachers is to provide their students with opportunities to hear the language outside of 1 National Research Council Canada 2 University of Edinburgh 3 Queen's University class. Text-to-speech synthesis technology (TTS) shows potential for supplementing text-based language learning tools with audio in the event that the domain is too large to be recorded directly, or as an interim solution pending recordings from firstlanguage speakers.
Development of TTS systems in this context faces several challenges. Most notable is the usual assumption that neural speech synthesis models require at least tens of hours of audio recordings with corresponding text transcripts to be trained adequately. Such a data requirement is far beyond what is available for the languages we are concerned with, and is difficult to meet given the limited time of the relatively small number of speakers of these languages. The limited availability of Indigenous language speakers also hinders the subjective evaluation methods often used in TTS studies, where naturalness of synthetic speech samples is judged by speakers of the language in question.
In this paper, we re-evaluate some of these challenges for applying TTS in the low-resource context of language revitalization. We build TTS systems for three Indigenous languages of Canada, with training data ranging from 25 minutes to 3.5 hours, and confirm that we can produce acceptable speech as judged by language teachers and learners. Outputs from these systems could be suitable for use in some classroom applications, for example a speaking verb conjugator.

Language Revitalization
It is no secret that the majority of the world's languages are in crisis, and in many cases this crisis is even more urgent than conservation biologists' dire predictions for flora and fauna (Sutherland, 2003). However, the 'doom and gloom' rhetoric that often follows endangered languages over-represents vulnerability and under-represents the enduring strength of Indigenous communities who have refused to stop speaking their languages despite over a century of colonial policies against their use (Pine and Turin, 2017). Continuing to speak Indigenous languages is often seen as a political act of anti-colonial resistance. As such, the goals of any given language revitalization effort extend far beyond memorizing verb paradigms to broader goals of nationhood and self-determination (Pitawanakwat, 2009; McCarty, 2018. Language revitalization programs can also have immediate and important impacts on factors including community health and wellness (Whalen et al., 2016; Oster et al., 2014.
There is a growing international consensus on the importance of linguistic diversity, from the Truth & Reconciliation Commission of Canada (TRC) report in 2015 which issued nine calls to action related to language, to 2019 being declared an International Year of Indigenous Languages by the UN, and 2022-2032 being declared an International Decade of Indigenous Languages. From 1996 to 2016, the number of speakers of Indigenous languages increased by 8% (Statistics Canada, 2016). These efforts have been successful despite a lack of support from digital technologies. While opportunities may exist for technology to assist and support language revitalization efforts, these technologies must be developed in a way that does not further marginalize communities (Brinklow et al., 2019; Bird, 2020.

Why TTS for Language Revitalization?
Our interest in speech synthesis for language revitalization was sparked during user evaluations of Kawennón:nis (lit. 'it makes words'), a Kanien'kéha verb conjugator (Kazantseva et al., 2018) developed in collaboration between the National Research Council Canada and the Onkwawenna Kentyohkwa adult immersion program in Six Nations of the Grand River in Ontario, Canada. Kawennón:nis models a pedagogicallyimportant subset of verb conjugations in XFST (Beesley and Karttunen, 2003), and currently produces 247,450 unique conjugations.
The pronominal system is largely responsible for much of this productivity, since in transitive paradigms, agent/patient pairs are fused, as illustrated in Figure 1.
In user evaluations of Kawennón:nis, students often asked whether it was possible to add audio to the tool, to model the pronunciation of unfamil-(1) Senòn:wes you.to.it-like-habitual 'You like it.' (2) Takenòn:wes you.to.me-like-habitual 'You like me.' Figure 1: An example of fusional morphology of agent/patient pairs in Kanien'kéha transitive verb paradigms (from Kazantseva et al., 2018) iar words. Assuming a rate of 200 forms/hr for 4 hours per day, 5 days per week, this would take a teacher out of the classroom for approximately a year. Considering Kawennón:nis is anticipated to have over 1,000,000 unique forms by the time the grammar modelling work is finished, recording audio manually becomes infeasible. The research question that then emerged was 'what is the smallest amount of data needed in order to generate audio for all verb forms in Kawennón:nis'. Beyond Kawennón:nis, we anticipate that there are many similar language revitalization projects that would want to add supplementary audio to other text-based pedagogical tools.

Speech Synthesis
The last few years have shown an explosion in research into purely neural network-based approaches to speech synthesis . Similar to their HMM/GMM predecessors, neural pipelines typically consist of both a network predicting the acoustic properties of a sequence of text and a vocoder. The feature prediction network must be trained using parallel speech/text data where the input is typically a sequence of characters or phones that make up an utterance, and the output is a sequence of fixed-width frames of acoustic features. In most cases the predictions from the TTS model are log Mel-spectral features and a vocoder is used to generate the waveform from these acoustic features.
Much of the previous work on low resource speech synthesis has focused on transfer learning; that is, 'pre-training' a network using data from a language that has more data, and then 'fine-tuning' using data from the low-resource language. One of the problems with this approach is that the input space often differs between languages. As the inputs to these systems are sequences of characters or phones, and as these sequences are typically one-hot encoded, it can be difficult to devise a principled method for transferring weights from the source language network to the target if there is a difference between the character or phone inventories of the two languages. Various strategies have emerged for normalizing the input space. For example, Demirsahin et al. (2018) propose a unified inventory for regional multilingual training of South Asian languages, while Tu et al. (2019) compare various methods to create mappings between source and target input spaces. Another proposal is to normalize the input space between source and target languages by replacing one-hot encodings of text with multi-hot phonological feature encodings (Gutkin et al., 2018; Wells andRichmond, 2021).

Speech Synthesis for Indigenous Languages in Canada
There is extremely little published work on speech synthesis for Indigenous languages in Canada (and North America generally). A statistical parametric speech synthesizer using Simple4All was recently developed for Plains Cree (Harrigan et al., 2019; Clark, 2014. Although it was unpublished, two highschool students 1 created a statistical parametric speech synthesizer for Kanien'kéha by adapting eSpeak (Duddington and Dunn, 2007). We know of no other attempts to create speech synthesis systems for Indigenous languages in Canada. Elsewhere in North America, a Tacotron2 system has been built for Cherokee (Conrad, 2020), and some early work on concatenative systems for Navajo was discussed in a technical report (Whitman et al., 1997), as well as on Rarámuri (Urrea et al., 2009).

Indigenous Language Data
Although the term 'low resource' is used to describe a wide swath of languages, most Indigenous languages in Canada would be considered 'lowresource' in multiple senses of the word, having both a low amount of available data (annotated or unannotated), and a relatively low number of speakers. Most Indigenous languages lack transcribed audio corpora, and fewer still have such data recorded in a studio context. Due to the limited number of speakers, creating these resources is non-trivial: there are limited amounts of text from which a speaker could read, and there are few people available who are sufficiently literate in the languages to transcribe recorded audio. Re-focusing speakers' limited time to these tasks presents a significant opportunity cost; they are often already over-worked and over-burdened in under-funded and under-resourced language teaching projects. As mentioned in §2.1, language technology projects that aim to assist language revitalization and reclamation efforts must be centered around the primary goals of those efforts and ensure that the means of developing the technology do not distract or work against the broader sociopolitical goals. A primary stress point for many natural language processing projects involving Indigenous communities surrounds issues of data sovereignty. It is important that communities direct the development of these tools, and maintain control, ownership, and distribution rights for their data, as well as for the resulting speech synthesis models (Keegan, 2019; Brinklow, 2021. In keeping with this, the datasets described in this paper are not being released publicly at this time. To test the feasibility of developing speech synthesis systems for Indigenous languages, we trained models for three unrelated Indigenous languages, Kanien'kéha ( §3.1), Gitksan ( §3.2), and SENĆOŦEN ( §3.3).

Kanien'kéha
Kanien'kéha 2 (a.k.a. Mohawk) is an Iroquoian language spoken by roughly 2,350 people in southern Ontario, Quebec, and northern New York state (Statistics Canada, 2016). In 1979 the first immersion school of any Indigenous language in Canada was opened for Kanien'kéha, and many other very successful programs have been started since, including the Onkwawenna Kentyohkwa adult immersion program in 1999 (Gomashie, 2019).
In the late 1990s, a team of five Kanien'kéha translators worked with the Canadian Bible Society to translate and record parts of the Bible; one of the speakers on these recordings, Satewas, is still living. Translation runs in Satewas's family, with his great-grandfather also working on Bible translations in the 19th century. Later, a team of four speakers and learners, including this paper's third author, aligned the text and audio at the utterance level using Praat (Boersma and van Heuven, 2001) and ELAN (Brugman and Russel, 2004).
While a total of 24 hours of audio were recorded, members of the Kanien'kéha-speaking community told us it would be inappropriate to use the voices of speakers who had passed away, leaving only recordings of Satewas's voice. Using a GMMbased speaker ID system (Kumar, 2017), we removed utterances by these speakers, then removed utterances that were outliers in duration (less than 0.4s or greater than 11s) and speaking rate (less than 4 phones per second or greater than 15), recordings with an unknown phase effect present, and utterances containing non-Kanien'kéha characters (e.g. proper names like 'Euphrades'). Handling utterances with non-Kanien'kéha characters would have required grapheme-to-phoneme prediction capable of dealing with multilingual text and code-switching which we did not have available. The resulting speech corpus comprised 3.46 hours of speech.

Gitksan
Gitksan 3 is one of four languages belonging to the Tsimshianic language family spoken along the Skeena river and its surrounding tributaries in the area colonially known as northern British Columbia. Traditional Gitksan territory spans some 33,000 square kilometers and is home to almost 10,000 people, with approximately 10% of the population continuing to speak the language fluently (First Peoples' Cultural Council, 2018).
As there were no studio-quality recordings of the Gitksan language publicly available, and as an intermediate speaker of the language, the first author recorded a sample set himself. In total, he recorded 35.46 minutes of audio reading isolated sentences from published and unpublished stories (Forbes et al., 2017).

SENĆOŦEN
The SENĆOŦEN language is spoken by the W SÁNEĆ people on the southern part of the island colonially known as Vancouver Island. It belongs to the Coastal branch of the Salish language family. The W SÁNEĆ community runs a worldfamous language revitalization program 4 , and uses an orthography developed by the late SENĆOŦEN speaker and W SÁNEĆ elder Dave Elliott. While the community of approximately 3,500 has fewer than 10 fluent speakers, there are hundreds of learners, many of whom have been enrolled in years of immersion education in the language (First Peoples' Cultural Council, 2018).
As there were no studio-quality recordings of the SENĆOŦEN language publicly available, we recorded 25.92 minutes of the language with PENÁĆ David Underwood reading two stories originally spoken by elder Chris Paul.

Research Questions
Given the motivation and context for language revitalization-based speech synthesis, a number of research questions follow. Namely, how much data is required in order to build a system of reasonable pedagogical quality? How do we evaluate such a system? And, how is the resulting system best integrated into the classroom? In §4.1, we discuss the difficulty of evaluating TTS systems in low-resource settings. We then discuss preliminary results for English and Indigenous language TTS which show that acceptable speech quality can be achieved with much less training data than usually considered for neural speech synthesis ( §4.2). Finally, we suggest possible directions for pedagogical integration in section §4.4.

Low-Resource Evaluation
One of the most significant challenges in researching speech synthesis for languages with few speakers is evaluating the models. For some Indigenous languages in Canada, the total number of speakers of the language is less than the number typically required for statistical significance in a listening test (Wester et al., 2015). While the number of speakers in these conditions is sub-optimal for statistical analysis, we have been told by the communities we work with that the positive assessment of a few widely respected and community-engaged language speakers would be practically sufficient to assess the pedagogical value of speech models in language revitalization contexts. For the experiments described in this paper, we ran listening tests for both Kanien'kéha and Gitksan with speakers, teachers, and learners, but were not able to run any such tests for SENĆOŦEN due to very few speakers with already busy schedules.
While some objective metrics do exist, such as Mel cepstral distortion (MCD, Kubichek, 1993), we do not believe they should be considered reliable proxies for listening tests. Future research on speech synthesis for languages with few speakers should prioritize efficient and effective means of evaluating results. In many cases, including in the experiment described in §4.2, artificial data constraints can be placed on a language with more data, like English, to simulate a low-resource scenario. While this technique can be insightful and it is tempting to draw universal conclusions, English is linguistically very different from many of the other languages spoken in the world. Accordingly, we should be cautious not to assume that results from these types of experiments will necessarily transfer or extend to genuinely low-resource languages.

How much data do you really need?
The first question to answer is whether our Indigenous language corpora ranging from 25 minutes to 3.46 hours of speech are sufficient for building neural speech synthesizers. Due to the prominence of Tacotron2 (Shen et al., 2018), it seems that many people have assumed that the data requirements for training any neural speech synthesizer of similar quality must be the same as the requirements for this particular model. As a result, some researchers still choose to implement either concatenative or HMM/GMM-based statistical parametric speech synthesis systems in low-resource situations based on the assumption that a "sufficiently large corpus [for neural TTS] is unavailable" (James et al., 2020, p. 298). We argue that attention-based models such as Tacotron2 should not be used as a benchmark for data requirements among all neural TTS methods, as they are notoriously difficult to train and unnecessarily inflate training data requirements.

Replacing attention-based weak duration models
Tacotron2 is an autoregressive model, meaning it predicts the speech parametersŷ t from both the input sequence of text x and the previous speech parameters y 1 , ..., y t−1 . Typically, the model is trained with 'teacher-forcing', where the autoregressive frame y t−1 passed as input for predictingŷ t is taken from the ground truth acoustic features and not the prediction network's output from the previous frameŷ t−1 . As discussed by , such a system might learn to copy the teacher forcing input or disregard the text en-tirely, which could still optimize Tacotron2's root mean square error function over predicted acoustic features, but result in an untrained or degenerate attention network which is unable to properly generalize to new inputs at inference time when the teacher forcing input is unavailable. Attention failures represent a characteristic class of errors for models such as Tacotron2, for example skipping or repeating words from the input text (Valentini-Botinhao and King, 2021).
There have been many proposals to improve training of the attention network, for example by guiding the attention or using a CTC loss function to respect the monotonic alignment between text inputs and speech outputs (Tachibana et al., 2018; Liu et al., 2019; Zheng et al., 2019; Gölge, 2020. As noted by , increasing the socalled 'reduction factor' -which applies dropout to the autoregressive frames -can also help the model learn to rely more on the attention network than the teacher forcing inputs, but possibly at the risk of compromising synthesis quality. FastSpeech2 (Ren et al., 2021), and similar systems like FastPitch (Łańcucki, 2021), present an alternative to Tacotron2-type attentive, autoregressive systems with similar listening test results and without the characteristic errors related to attention. Instead of modelling duration using attention, they include an explicit duration prediction module trained on phone duration targets extracted from the training data. For the original FastSpeech, target phone durations derived from the attention weights of a pre-trained Tacotron2 system were used to provide phone durations (Ren et al., 2019). In low-resource settings, however, there might not be sufficient data to train an initial Tacotron2 in the target language in the first place. For Fast-Speech2, phone duration targets are instead extracted using the Montreal Forced Aligner (MFA, McAuliffe et al., 2017), trained on the same data as used for TTS model training. We have found MFA can provide suitable alignments for our target languages, even with alignment models being trained on only limited data.
Faster convergence of text-acoustic feature alignments has been found to speed up overall encoder-decoder TTS model training, as stable alignments provide a solid foundation for further training of the decoder. Badlani et al. (2021) show this by adding a jointly-learned alignment framework to a Tacotron2 architecture, reducing time to convergence. In contrast, they found that replacing MFA duration targets in FastSpeech2 training offers no benefit -forced alignment targets already provide enough information for more timeefficient training compared to an attention-based Tacotron2 system. Relieving the burden of learning an internal alignment model also opens the door to more data-efficient training. For example, Perez-Gonzalez-de-Martos et al. (2021) submitted a non-attentive model trained from forced alignments to the Blizzard Challenge 2021, where their system was found to be among the most natural and intelligible in subjective listening tests despite only using 5 hours of speech; all other submitted systems included often significant amounts of additional training data (up to 100 hours total).

Experimental Comparison of Data Requirements for Neural TTS
To investigate the effects of differing amounts of data on the attention network, and in preparation for training systems with our limited Indigenous language data sets, we trained five Tacotron2 models on incremental partitions of the LJ Speech corpus of American English (Ito and Johnson, 2017). We used the NVIDIA implementation 5 with default hyperparameters apart from a reduced batch size of 32 to fit the memory capacity of our GPU resources. We artificially constrained the training data such that the first model saw only the first hour of data from the shuffled corpus, the second model that same first hour plus another two hours (3 total) etc., so that the five models were trained on 1, 5 https://github.com/NVIDIA/tacotron2 3, 5, 10 and 24 (full corpus) hours of speech. The models were trained for 100k steps and, as seen in Figure 2, using up to 5 hours of data the attention mechanism does not learn properly, resulting in degenerate outputs. For comparison, we trained seven FastSpeech2 models with batch size 16 for 200k steps on 15 and 30 minute, 1, 3, 5, 10 and 24 hour incremental partitions of LJ Speech. Our model 6 is based on an open-source implementation (Chien, 2021), which adds learnable speaker embeddings and a decoder postnet to the original model, as well as predicting pitch and energy values at the phone rather than frame level. We also added learnable language embeddings for supplementary experiments in cross-lingual fine-tuning; while not reported in this paper, we refer the interested reader to Pine (2021) for discussion of these experiments. Motivated by concerns of efficiency in model training and inference, and the possibility of overfitting a large model to limited amounts of data, we further modified the base architecture to match the Light-Speech model presented in Luo et al. (2021). We removed the energy adaptor, replaced the convolutional layers in the encoder, decoder and remaining variance predictors with depthwise separable convolutions (Kaiser et al., 2018) and matched encoder and decoder convolutional kernel sizes with Luo et al. (2021). This reduced the number of model parameters from 35M 7 to 11.6M without noticeable change in voice quality and sped up train- ing by 33% on GPU or 64% on CPU. For additional discussion of the accessibility benefits of these changes with respect to Indigenous language communities, see Appendix A.

Results
We conducted a short (10-15 minute) listening test to compare the two Tacotron2 models that trained properly (10h, full) against the seven FastSpeech2 models. We recruited 30 participants through Prolific, and presented each with four MUSHRA-style questions where they were asked to rank the 9 voices along with a hidden natural speech reference (ITU-R, 2003). MUSHRA-style questions were used as a practical way to evaluate this large number of models. While it only took 30 minutes to recruit 30 participants using Prolific, the quality of responses was quite varied. We rejected two outright as they seemingly did not listen to the stimuli and left the same rankings for every voice. Even still, there was a lot of variation in responses from the remaining participants, as seen in Figure 3. We tested for significant differences between pairs of voices using Bonferroni-corrected Wilcoxon signed rank tests. Pairwise test results are summarized in the heat map of their p-values in Figure 4.
In the results from the pairwise analysis, we can see that natural speech is rated as significantly more natural than all synthetic speech samples. Naturalness ratings for the FastSpeech2 voices trained on 15m and 30m of data are significantly lower than all other voices, and significantly different from each other. The results for the remaining  Figure 3), are not significantly different from each other. This is a relevant and important finding for low-resource speech synthesis because it shows that a FastSpeech2 voice built with 3 hours of data can achieve subjective naturalness ratings which are not significantly different from a Tacotron2 voice built with 24 hours of data. Similarly, the results of the listening test for our Fast-Speech2 voice built with 1 hour of data are not significantly different from our Tacotron2 voice built with 10 hours of data. Additionally, while all the FastSpeech2 voices were intelligible, all Tacotron2 models trained with less than 10 hours of data produced unintelligible speech.

Indigenous Language Experiments
Despite the difficulty in evaluation ( §4.1), we built and evaluated a number of TTS systems for the Indigenous languages described in §3. We had a baseline concatenative model available for Kanien'kéha that we had previously built using Festival and Multisyn (Taylor et al., 1998; Clark et al., 2007. Additionally, we trained cold-start FastSpeech2 models for each language, as well as models fine-tuned for 25k steps from a multilin-gual, multispeaker FastSpeech2 model pre-trained on a combination of VCTK (Yamagishi et al., 2019), Kanien'kéha and Gitksan recordings. A rule-based mapping from orthography to pronunciation form was developed for each language using the 'g2p' Python library in order to perform alignment and synthesis at the phone-level instead of character-level (Pine et al., Under Review).

Results
We carried out listening test evaluations of Gitksan and Kanien'kéha models. Participants were recruited by contacting teachers, learners and linguists with at least some familiarity with the languages.
For the Kanien'kéha listening test, 6 participants were asked to answer 20 A/B questions comparing synthesized utterances from the various models. We used A/B tests for more targeted comparisons between different systems, namely cold-start vs. fine-tuned and neural vs. concatenative. Results showed that 72.2% of A/B responses from participants preferred our FastSpeech2 model over our baseline concatenative model. In addition, 81.7% of A/B responses from participants preferred the cold-start to the model fine-tuned on the multispeaker, multi-lingual model, suggesting that the transfer learning approach discussed in §2.3 might not be necessary for models with explicit durations such as FastSpeech2 since they are relieved of the burden to learn an implicit model of duration through attention from limited data.
For the Gitksan listening test, we did not build a concatenative model as with Kanien'kéha and so we were not comparing different models, but rather just gathering opinions on the quality of the cold-start FastSpeech2 model. Accordingly, 10 MOS-style questions were presented to 12 participants for both natural utterances and samples from our FastSpeech2 model. The model received a 3.56 ± 0.26 MOS compared with a MOS for the reference recordings of 4.63 ± 0.19 as shown in Figure 5. While both Kanien'kéha and Gitksan results seem to corroborate our belief that these models should be of reasonable quality despite limited training data, it is difficult to make any conclusive statement given the low number of eligible participants available for evaluation.
As the main goal of our efforts here is to eventually integrate our speech synthesis systems into a pedagogical setting, we also asked the 18 people who participated across Kanien'kéha and Gitk-  Figure 6: Responses from qualitative survey asking participants "Would you be comfortable with any of the voices you heard being played online, say for a digital dictionary or verb conjugator if no other recording existed?". No participants responded "no". san listening tests directly whether they approved of the synthesis quality. As seen in Figure 6, participant responses were generally positive; full responses are reported in Appendix B.

Integrating TTS in the Classroom
Satisfying the goal of adding supplementary audio to a reference tool like Kawennón:nis can be straightforwardly implemented by linking entries in the verb conjugator to pre-generated audio for the domain from a static server. This implementation also limits the potential of out of domain utterances that might be deemed inappropriate, which is an ethical concern in communities with low numbers of speakers where the identity of the 'model' speaker is easily determined.
However, the ability to synthesize novel utterances could be pedagogically useful. Students often come into contact with words or sentences which do not have audio, and teachers often have to prepare new thematic word lists or vocabulary lessons that could benefit from a more general purpose speech synthesis solution. In those cases, with community and speaker input, we might consider what controls would be necessary for the users of this technology. One potential solution is the variance adaptor architecture present in Fast-Speech2, allowing for phone-level control of duration, pitch and energy; an engaging demonstration of a graphical user interface for the corresponding controls in a FastPitch model is also available. 8 We would like to focus further efforts on designing a user interface for speech synthesis systems that satisfies ethical concerns while prioritizing language pedagogy as the fundamental use case.
In addition to fine-grained prosodic controls, we would like to explore the synthesis of hyperarticulated speech, as often used by language teachers when modelling pronunciation of unfamiliar words or sounds for students. This style of speech typically involves adjustment beyond the parameters of pitch, duration and energy, and is characterized by more careful enunciation of individual phones than is found in normal speech. This problem has parallels to the synthesis of Lombard speech , as used to improve intelligibility by speakers who find themselves in noisy environments.

Conclusion
In this paper, we presented the first neural speech synthesis systems for Indigenous languages spoken in Canada. Subjective listening tests showed encouraging results for the naturalness and acceptability of voices for two languages, Kanien'kéha and Gitksan, despite limited training data availability (3.5 hours and 35 minutes, respectively).
More extensive evaluation on English shows that the FastSpeech2 architecture can produce speech with similar quality to a Tacotron2 system using a fraction of the amount of speech usually considered for neural speech synthesis. Notably, a Fast-Speech2 voice trained on 1 hour of English speech achieved subjective naturalness ratings not significantly different from a Tacotron2 voice using 10 hours of data, while a 3-hour FastSpeech2 system showed no significant difference from a 24-hour Tacotron2 voice.
We attribute these results to the fact that Fast-Speech2 learns input token durations from forced alignments, rather than jointly learning to align linguistic inputs to acoustic features alongside the acoustic feature prediction task as in attention- R. Kubichek. 1993. Mel-cepstral

A Compute, Accessibility, & Environmental Impact
For reasons of environmental impact and accessibility, reducing the amount of computation required for both training and inference is important for any neural speech synthesis system, particularly so for Indigenous languages.

A.1 Accessibility, Training & Inference Speed
While language revitalization efforts are mostly encouraging about integrating new technologies into curriculum, there is a growing awareness of the potential harms. Beyond assessing the benefits and risks of introducing a new technology into language revitalization efforts, communities are concerned with the way the technology is researched and developed, as this process has the ability to empower or disempower language communities in equal measure (Alia, 2009; Brinklow et al., 2019. The current model for developing speech synthesis systems is not very equitablemodels need to be run on GPUs by people with specialized training. For Indigenous communities to create speech synthesis tools for their languages, they should not be required to hand over their language data to a large government or corporate organization. A pre-training, fine-tuning pipeline could be attractive for this reason; communities could fine-tune their own models on a laptop if a multilingual/multi-speaker model were pre-trained on GPUs at a larger institution. Reducing the computational requirements for training and inference of these models could help ensure language communities have greater control over the process of the development of these systems, less dependence on governmental organizations or corporations, and more sovereignty over their data (Keegan, 2019). Strubell et al. (2019) present an argument for equitable access to computational resources for NLP research; put another way, we might say that systems which require less compute are more accessible. Reducing the number of parameters in a neural TTS model should translate to increased efficiency, and might make the model less prone to overfitting when training on limited amounts of data. As discussed in §4.2.2, we modified the base implementation of FastSpeech2 from Chien (2021) closely following the lightweight alternative discovered through neural architecture search in Luo et al. (2021). These changes reduced the size of the model from Chien (2021) from 35M to 11.6M parameters, reduced the size of the stored model from 417 MB to 135 MB and significantly improved inference and train times as summarized in Table 1. We saw a 33% improvement in average batch processing times on the GPU during training, and 64% on the CPU, which may be even more relevant for Indigenous language communities with limited computational resources. During inference, we saw a 15% speed-up on GPU and 57% on CPU.
Results were timed by running the model for 300 repetitions and taking the mean. The GPU (Tesla V100-SXM2 16GB) was warmed up for 10 repetitions before timing started, and PyTorch's builtin GPU synchronization method was used to synchronize timing (which occurs on the CPU) with the training or inference running on the GPU. CPU tests were performed on an Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz with 4 cores and 16GB memory reserved. All timings used a batch size of 16.

A.2 CO2 Consumption
Strubell et al. (2019) also argue that NLP researchers should have a responsibility to disclose the environmental footprint of their research, in order for the community to effectively evaluate any gains and to allow for a more equitable and reproducible field. All experiments for this paper requiring a GPU were run on the Canadian General Purpose Science Cluster (GPSC) in Dorval, Quebec. Experiments were all run on single Tesla V100-SXM2 16GB GPUs. Strubell et al. (2019) provide the following equation for estimating CO 2 production: p t = 1.58t(p c + p r + (g * p g )) 1000 ( 1) where t is time, p t is total power for training, p c is average draw of CPU sockets, p r is average DRAM memory draw, g is the number of GPUs used in training and p g is the average draw from GPUs. In our case, we estimate t to be equal to 1,541.98 9 after summing the time for experiments based on their log files, p c is 75 watts, p r is 6 watts, g is 1, and p g is 250 watts, and the equation for grams of CO2 consumption is CO 2 = 34.5p t as the average carbon footprint of electricity distributed in Quebec is estimated at 9 Note this estimate is based on the total number of hours spent running experiments from the M.Sc. dissertation this paper draws its experiments from. There were additional models trained for experiments that are not discussed in this paper. As such, this is a generous overestimation of t. 34.5g CO2eq/kWh (Levasseur et al., 2021). This results in a total equivalent carbon consumption of 27,821.65 grams, roughly equivalent to driving a single passenger gas-powered vehicle for 110 kilometres according to the average rate of 404 grams/mile (EPA, 2019). This is a comparatively low CO2 consumption for over 1500 GPU hours, largely due to the low CO2/kWh output of Quebec electricity when compared with the 2019 USA average of 400g CO2eq/kWh (EPA, 2019). However, CO2 equivalents are just a proxy for environmental impact and should not be understood to comprehensively account for social and environmental impact. Hydroelectric dam projects in Quebec, like the ones powering the GPSC have a sordid and complex history in the province. Innu Nation Grand Chief Mary Ann Nui spoke to this when she commented that "over the past 50 years, vast areas of our ancestral lands were destroyed by the Churchill Falls hydroelectric project, people lost their land, their livelihoods, their travel routes, and their personal belongings when the area where the project is located was flooded. Our ancestral burial sites are under water, our way of life was disrupted forever. Innu of Labrador weren't informed or consulted about that project" (Innu-Atikamekw-Anishnabeg Coalition, 2020).

B Qualitative Results
Question: "Would you be comfortable with any of the voices you heard being played online, say for a digital dictionary or verb conjugator if no other recording existed?" Kanien'kéha responses: • Yes.
• yes • Yes • Out of the two voices I hear, the first was clearer to understand • Yes, voices sounds really good!

• yes
Gitksan responses: • yes • Yes, but the ones that have the most whistling or buzzing would be annoying.
• maybe?? I think for a talking dictionary people do want to hear original pronunciations, but it could be a useful interim solution or a way to do short phrases!
• Assuming there is a single control for the last section of the survey/test, then some of the synthesised voices actually sound really good and I would be comfortable hearing those in an online dictionary where audio didn't exist for a particular word or phrase.
• yes • The ones with higher ratings for sure, some of the lower ratings were just about the sound quality because that hampered hearing the speech quality. So I may have confounded the results with that, but point remains that it is always good to try to avoid poor audio recordings for online dictionaries • Maybe/yes • only ones rated fair or above fair • Absolutely yes • yes, as long as they were identified as synthesized