Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec

“Transcription bottlenecks”, created by a shortage of effective human transcribers (i.e., transcriber shortage), are one of the main challenges to endangered language (EL) documentation. Automatic speech recognition (ASR) has been suggested as a tool to overcome such bottlenecks. Following this suggestion, we investigated the effectiveness for EL documentation of end-to-end ASR, which unlike Hidden Markov Model ASR systems, eschews linguistic resources but is instead more dependent on large-data settings. We open source a Yoloxóchitl Mixtec EL corpus. First, we review our method in building an end-to-end ASR system in a way that would be reproducible by the ASR community. We then propose a novice transcription correction task and demonstrate how ASR systems and novice transcribers can work together to improve EL documentation. We believe this combinatory methodology would mitigate the transcription bottleneck and transcriber shortage that hinders EL documentation.


Introduction
warned that half of the world's 7,000 languages would disappear by the end of the 21st century. Consequently, a concern with endangered language documentation has emerged from the convergence of interests of two major groups: (1) native speakers who wish to document their language and cultural knowledge for future generations; (2) linguists who wish to document endangered languages to explore linguistic structures that may soon disappear. Endangered language (EL) documentation aims to mitigate these concerns by developing and archiving corpora, lexicons, and grammars (Lehmann, 1999). There are two major challenges: (a) Transcription Bottleneck: The creation of EL resources through documentation is extremely challenging, primarily because the traditional method to preserve primary data is not simply with audio recordings but also through time-coded transcriptions. In a best-case scenario, texts are presented in interlinear format with aligned parses and glosses along with a free translation (Anastasopoulos and Chiang, 2017). But interlinear transcriptions are difficult to produce in meaningful quantities: (1) ELs often lack a standardized orthography (if written at all); (2) invariably, few speakers can accurately transcribe recordings. Even a highly skilled native speaker or linguist will require a minimum of 30 to 50 hours to simply transcribe one hour of recording (Michaud et al., 2014;Zahrer et al., 2020). Additional time is needed for parse, gloss, and translation. This creates what has been called a "transcription bottleneck", a situation in which the expert transcribers cannot keep up with the amount of recorded material for documentation.
(b) Transcriber Shortage: It is generally understood that any viable solution to the transcription bottleneck must involve native speaker transcribers. Yet usually few, if any, native speakers have the skills (or time) to transcribe their language. Training new transcribers is one solution, but it is timeconsuming, especially with languages that present complicated phonology and morphology. The situation is distinct for major languages, for which transcription can be crowd-sourced to speakers with little need for specialized training (Das and Hasegawa-Johnson, 2016). In Yoloxóchitl Mixtec (YM; Glottocode=yolo1241, ISO 639-3=xty), the focus of this study, training is time-consuming: after one-year part-time transcription training, a proficient native speaker, Esteban Guadalupe Sierra, still has problems with certain phones, particularly tones and glottal stops. Documentation requires accurate transcriptions, a goal yet beyond even the capability of an enthusiastic speaker with many months of training.
As noted, ASR has been proposed to mitigate the Transcription Bottleneck and create increasingly extensive EL corpora. Previous studies first investigated HMM-based ASR for EL documentation (Ćavar et al., 2016;Mitra et al., 2016;Jimerson and Prud'hommeaux, 2018;Cruz and Waring, 2019;Thai et al., 2020;Zahrer et al., 2020;Gupta and Boulianne, 2020a). Along with HMM-based ASR, natural language processing and semi-supervised learning have been suggested as a way to produce morphological and syntactic analyses. As HMM-based systems have become more precise, they have been increasingly promoted as a mechanism to bypass the transcription bottleneck. However, ASR's context for ELs is quite distinct from that of major languages. Endangered languages seldom have sufficient extant language lexicons to train an HMM system and invariably suffer from a dearth of skilled transcribers to create these necessary resources (Gupta and Boulianne, 2020b).
As we have confirmed with this present study, end-to-end ASR systems have shown comparable or better results over conventional HMM-based methods (Graves and Jaitly, 2014;Chiu et al., 2018;Pham et al., 2019;Karita et al., 2019a). As endto-end systems directly predict textual units from acoustic information, they save much effort on lexicon construction. Nevertheless, end-to-end ASR systems still suffer from the limitation of training data. Attempts with resource-scarce languages have relatively high character (CER) or word (WER) error rates (Thai et al., 2020;Matsuura et al., 2020;Hjortnaes et al., 2020). It has nevertheless become possible to utilize ASR with ELs to reduce significantly, but not eliminate, the need for human input and annotation to create acceptable ("archival quality") transcriptions.
This Work: This work represents end-to-end ASR efforts on Yoloxóchitl Mixtec (YM), an endangered language from western Mexico. The YMC 1 corpus comprises two sub-corpora. The first ("YMC-EXP", expert transcribed, corpus) includes 100 hours of transcribed speech that have been carefully checked for accuracy. We built a recipe of the ESPNet (Watanabe et al., 2018) that shows the whole process of constructing an end-1 Specifically, we used material from the community of Yoloxóchitl (YMC), one of four in which YM is spoken.
to-end ASR system using the YMC-EXP corpus. 2 The second corpus, ("YMC-NT", native trainee, corpus) includes 8+ hours of additional recordings not included in the YMC-EXP corpus. This second corpus contains novice transcriptions with subsequent expert corrections that has allowed us to evaluate the skill level of the novice. Both the YMC-EXP and YMC-NT corpora are publicly available at OpenSLR under a CC BY-SA-NC 3.0 License. 3 The contributions of our research are: • A new Yoloxóchitl Mixtec corpus to support ASR efforts in EL documentation.
• A reproducible workflow to build an end-toend ASR system for EL documentation.
• A comparative study between HMM-based ASR and end-to-end ASR, demonstrating the feasibility of the latter. To test the framework's generalizability, we also experiment with another EL: Highland Puebla Nahuat (Glottocode=high1278; ISO 639-3=azz).
• An in-depth analysis of errors in novice transcription and ASR. Considering the discrepancies in error types, we propose Novice Transcription Correction (NTC) as a task for the EL documentation community. A rule-based method and a voting-based method are proposed. 4 In clean speech, the best system reduces relative word error rate in the novice transcription by 38.9% .

Corpus Description
In this section, we first introduce the linguistic specifics for YM and YMC. Then we discuss the recording settings. Since YM is a spoken language without a standardized textual format, we next explain the transcription style designed for this language. Finally, we offer the corpus partition and some statistics regarding corpora size.

Linguistic Specifics for Yoloxóchitl Mixtec
Yoloxóchitl Mixtec is an endangered, relatively low-resource Mixtecan language. It is mainly spoken in the municipality of San Luis Acatlán, state of Guerrero, Mexico. It is one of some 50 languages in the Mixtec language family, which is part of a larger unit, Otomanguean, that Suárez (1983) considers "a 'hyper-family' or 'stock'." Mixtec languages (spoken in Oaxaca, Guerrero, and Puebla) are highly varied, resulting from approximately 2,000 years of diversification. YM is spoken in four communities: Yoloxóchitl, Cuanacaxtitlan, Arroyo Cumiapa, and Buena Vista. Mutual intelligibility among the four YM communities is high despite significant differences in phonology, morphology, and syntax. All villages have a simple segmental inventory but significant though still undocumented variation in tonal phonology. YMC (refering only to the Mixtec of the community of Yoloxóchitl [16.81602, -98.68597]) manifests 28 distinct tonal patterns on 1,451 identified bimoraic lexical stems. The tonal patterns carry a significant functional load in regards to the lexicon and inflection. For example, 24 distinct tonal patterns on the bimoraic segmental sequence [nama] yield 30 words (including six homophones). This ample tonal inventory presents challenges to both a native speaker learning to write and an ASR system learning to recognize. Notably, it also introduces difficulties in constructing a language lexicon for training HMM-based systems.

Recording Settings
There are two corpora used in this study. The first (YMC-EXP) was used for ASR training. The second (YMC-NT) was used to train the novice speaker (e.g., set up a curriculum for him to learn how to transcribe) and for Novice Transcription Correction. The YMC-EXP corpus comprises expert transcriptions used as the gold-standard reference for ASR development. The YMC-NT corpus has paired novice-expert transcription as it was used to train and evaluate the novice writer.
The corpus used for ASR development comprises mostly conversational speech in two-channel recordings (split for training). Each conversation is with two speakers and each of the two speakers was fitted with a separate head-worn mic (usually a Shure SM10a). Over two dozen speakers (mostly male) contributed to the corpus. The topics and their distribution were varied (plants, animals, hunting/fishing, food preparation, ritual speech). The YMC-NT corpus comprises single-channel field recordings made with a Zoom H4n at the moment plants were collected during ethnobotanical research. Speakers were interviewed one after another; there is no overlap. However, the recordings often registered background sounds (crickets, birds) that we expected would negatively impact ASR accuracy more than seems to have occurred. The topic was always a discussion of plant knowledge (a theme of only 9% of the YMC-EXP corpus). Expectedly, there were many out-of-vocabulary (OOV) words (e.g., plant names not elsewhere recorded) in this YMC-NT corpus. 5

Corpus Transcription
(a) Transcription Level: The YMC-EXP corpus presently has two levels of transcription: (1) a practical orthography that represents underlying forms; (2) surface forms. The underlying form marks prefixes (separated from the stem by a hyphen), enclitics (separated by an = sign), and tone elision (with the elided tones in parentheses). All these "breaks" and phonological processes disappear in the surface form. For example, the underlying be 3 e 3 =an 4 (house=3sgFem; 'her house') surfaces as be 3ã4 . And be 3 e (3) = 2 ('my house') surfaces as be 3 e 2 . Another example is the completive prefix ni 1 -, which is separated from the stem as in ni 1xi 3 xi (3) = 2 (completive-eat-1sgS; 'I ate'). The surface form would be written nĩ 1 xi 3 xi 2 . Again, processes such as nasalization, vowel harmony, palatalization, and labialization are not represented in the practical (underlying) orthography but are generated in the surface forms. The only phonological process encoded in the underlying orthography is tone elision, for which parentheses are used.
The practical, underlying orthography mentioned above was chosen as the default system for ASR training for three reasons: (1) it is easier than a surface representation for native speakers to write; (2) it represents morphological boundaries and thus serves to teach native speakers the morphology of their language; and (3) for a researcher interested in generating concordances for a corpus-based lexicographic project it is much easier to discover the root for 'house' in be 3 e 3 =an 4 and be 3 e (3) = 2 than in the surface forms be 3ã4 and be 3 e 2 .
(b) "Code-Switching" in YMC: Endangered, colonialized Indigenous languages often manifest extensive lexical input from a dominant Western language, and speakers often talk with "codeswitching" (for lack of a better term). Yoloxóchitl Mixtec is no exception. Amith considered how to write such forms best and decided that Spanishorigin words would be written in Spanish and without tone when their phonology and meaning are close to that of Spanish. So Spanish docena appears over a dozen times in the corpus and is written tucena; it always has the meaning of 'dozen'. All month and day names are also written without tones. Note, however, that Spanish camposanto ('cemetery') is also found in the corpus and pronounced as pa 3 san 4 tu 2 . The decision was made to write this with tone markings as it is significantly different in pronunciation from the Spanish origin word. In effect, words like pa 3 san 4 tu 2 are considered loans into YM and are treated orthographically as Mixtec. Words such as tucena are considered "code-switching" and written without tones.
(c) Transcription Process: The initial timealigned transcriptions were made in Transcriber (Barras et al., 1998). However, given that Transcriber cannot handle multiple tiers (e.g., transcription and translation, or underlying and surface orthographies), the Transcriber transcriptions were then imported into ELAN (Wittenburg et al., 2006) for further processing (e.g., correction, surfaceform generation, translation).

Corpus Size and Partition
Though endangered, YMC does not suffer from the same level of resource limitations that affect most ASR work with ELs (Ćavar et al., 2016;Thai et al., 2020). The YMC-EXP corpus, developed for over ten years, provided 100 hours for the ASR training, validation, and test corpora. There are 505 recordings from 34 speakers in the YMC-EXP corpus, and the transcription for the YMC-EXP were all carefully proofed by an expert native-speaker linguist. As shown in Table  1, we offer a train-valid-test split where there is no overlap in content between the sets. The partition considers the balance between speakers and relative size for each part. As introduced in Section 2.2, the YMC-NT corpus has both expert and novice transcription. It includes only three speakers for a total of 8.36 hours. In the recordings of two consultants, the environment is relatively clean and free of background noise. The speech of the other individual, however, is frequently affected by background noise. This seems coincidental as all three were recorded together, one after the other in random order. But given this situation, we split the corpus into three sets: clean-dev (speaker EGS), clean-test (speaker CTB), and noise-test (speaker FEF; see Table 1).
The "code-switching" discussed in 2.3 (b) introduces different phonological representations and makes it difficult to train an HMM-based model using language lexicons. Therefore, previous work (Mitra et al., 2016) using the HMM-based system for YMC did not consider phrases with "codeswitching". To compare our model with their results, we have used the same experimental corpus in our evaluation. Their corpus (YMC-EXP(-CS)), shown in Table 1, is a subset of the YMC-EXP; the YMC-EXP(-CS) corpus does not contain "codeswitching" phrases, i.e., phrases with words that were tagged as Spanish origin and transcribed without tone.

ASR Experiments
3.1 End-to-End ASR As ESPNet (Watanabe et al., 2018) is widely used in open-source end-to-end ASR research, our endto-end ASR systems are all constructed using ESP-Net. For the encoder, we employed the conformer structure (Gulati et al., 2020), while for the decoder we used the transformer structure to condition the full context, following the work of Karita et al. (2019b). The conformer architecture is a stateof-the-art innovation developed from the previous transformer-based encoding methods (Karita et al., 2019a;Guo et al., 2020). A comparison between the conformer and transformer encoders shows the value of applying state-of-the-art end-to-end ASR to ELs.

Experiments and Results
As discussed above, our end-to-end model applied an encoder-decoder architecture with a conformer encoder and a transformer decoder. The architecture of the model follows Gulati et al. (2020) while its configuration follows the aishell conformer recipe from ESPNet. 6 The experiment is reproducible using ESPNet.
As the end-to-end system models are based on word pieces, we adopted CER and WER as evaluation metrics. They help demonstrate the system performances at different levels of graininess. But because the HMM-based systems were decoding with a word-based lexicon, for comparison to HMM we only use the WER metric. To thoroughly examine the model, we conducted several comparative experiments, as discussed in continuation.

(a) Comparison with HMM-based Methods:
We first compared our end-to-end method with the Deep Neural Network-Hidden Markov Model (DNN-HMM) methods proposed in Mitra el al. (2016). In this work, Gammatone Filterbanks (GFB), articulation, and pitch are configured for the DNN-HMM model. This baseline is a DNN-HMM model using Mel Filterbanks (MFB). In recent unpublished work, Kwon and Kathol develop a latest state-of-the-art CNN-HMM-based ASR model 7 for YMC based on the lattice-free Maximum Mutual Information (LF-MMI) approach, also known as "chain model" (Povey et al., 2016). The experimental data of the above HMM-based models is YMC-EXP(-CS) discussed in Section 2.4. For the comparison, our end-to-end model adopted the same partition to ensure fair comparability with their results. Table 2 shows the comparison between DNN-HMM systems and our end-to-end system on YMC-EXP(-CS). It indicates that even without an external language lexicon the end-to-end system significantly outperforms both the DNN-HMM baseline models and the CNN-HMM-based state-of-the-art model.
In Section 2.3 (b), we note that "code-switching" is invariably present in EL speech (e.g., YMC). Thus, ASR models built on "code-switching-free corpora (like YMC-EXP[-CS]) are not practical for real-world usage. However, a language lexicon is available only for the YMC-EXP(-CS) corpus so 6 See Appendix for details about the model configuration. 7 See Appendix for details about the model configuration.   we cannot conduct HMM-based experiments with either YMC-EXP or YMC-NT corpora.

(b) Comparison with Different End-to-End ASR Architectures:
We also conducted experiments comparing models with different encoders and decoders on the YMC-EXP corpus. For a Recurrent Neural Network-based (E2E-RNN) model, we followed the best hyper-parameter configuration, as discussed in Zeyer et al. (2018). For a Transformer-based (E2E-Transformer) model, the same configuration from Karita et al. (2019b) was adopted. Both models shared the same data preparation process as the E2E-Conformer model. Table 3 compares different end-to-end ASR architectures on the YMC-EXP corpus. 8 The E2E-Conformer obtained the best results, obtaining significant WER improvement as compared to the E2E-RNN and the E2E-Transformer models. The E2E-Conformer's WER on YMC-EXP(-CS) is slightly lower than that obtained for the whole YMC-EXP corpus, despite a significantly smaller training set in the YMC-EXP(-CS) corpus. Since the subset excludes Spanish words, "codeswitching" may well be a problem to consider in ASR for endangered languages such as YM.   In addition to comparing model architectures, we compared the impact of transcription levels on the ASR model. E2E-Conformer models with the same configurations were trained using both the surface and the underlying transcription forms, which are discussed in Section 2.3. We also trained separate RNN language models for fusion and unigram language models to extract word pieces for different transcription levels. Table 4 shows the E2E-Conformer results for both underlying and surface transcription levels. As introduced in Section 2.3, the surface form reduces several linguistic and phonological processes compared to the underlying practical form. The results indicate that the end-to-end system is able to automatically infer those morphological and phonological processes and maintain a consistent low error rate.

(d) Comparison with Different Corpus Sizes:
As introduced in Section 1, most ELs are considered low-resource for ASR purposes. To measure the impact of resource availability on ASR accuracy we trained the E2E-Conformer model on 10, 20, and 50 hours subsets of YMC-EXP. The results demonstrate the model performances over different sizes of resources. Table 5 shows the E2E-Conformer performances on different amounts of training data. It demonstrates how the model consumes data. As corpus size is incrementally increased, WER decreases   significantly. It is apparent that the model still has the capacity to improve performance with more data. The result also indicates that our system can get reasonable performances from 50 hours of data. This would be an important guideline when we collect a new EL database.
(e) The Framework Generalizability: To test the end-to-end ASR systems' generalization ability, we conducted the same end-to-end training and test procedures on another endangered language: Highland Puebla Nahuatl (high1278; azz). This corpus is also open access under the same CC license. 9 It comprises 954 recordings that total 185 hours 22 minutes, including 120 hours transcribed data in ELAN and 65 hours still only in Transcriber and not used in ASR training. 10 Table 6 shows the performance of three different end-to-end ASR architectures on Highland Puebla Nahuatl. For this language the E2E-Conformer again offers better performances over the other models. Table 7 shows the E2E-Conformer performances on different amounts of training data for Highland Puebla Nahuatl. We can observe that 50-hour is a reasonable size for an EL, which is similar to the experiments in Table 5. These experiments indicate the general ability to consistently apply end-to-end ASR systems across ELs.

Novice Transcription Correction
Finally, this paper presents novice transcription correction (NTC) as a task for EL documentation. That is, in this experiment we explore not only the possibility of using ASR to enhance the accuracy of a YM novice transcription but to combine both novice transcription and ASR to achieve accurate results that surpass that of either component. Below we first analyze patterns manifested in novice transcriptions. Next, we introduce two baselines that fuse ASR hypotheses and novice transcription for the NTC task.

Novice Transcription Error
As mentioned in Section 1, transcriber shortages have been a severe challenge for EL documentation. Before 2019, only the native speaker linguist, Rey Castillo García, could accurately transcribe the segments and tones of YMC. To mitigate the YMC transcriber shortage, in 2019 Castillo began to train another speaker, Esteban Guadalupe Sierra. First, a computer course was designed to incrementally teach Guadalupe segmental and tonal phonology.
In the next stage, he was given YMC-NT corpus recordings to transcribe. Compared to the paired expert transcription, the novice achieved a CER of 6.0% on clean-dev, defined in Table 1. However, it is not feasible to spend many months training speakers with no literacy skills to acquire the transcription proficiency achieved by Guadalupe in our project. Moreover, even with a 6.0% CER, there are still enough errors so as to require significant annotation/correction by the expert, Castillo. The state-of-the-art ASR system (e.g., E2E-Conformer) shown in Table 3 gets an 8.2% CER on the cleandev set, more errors than the novice CER. So for YMC, ASR is still not a good enough substitute for a proficient novice. As Amith and Castillo worked with the novice, they saw a repetition of types of errors that they worked to correct by giving the novice exercises focused on these transcription shortcomings. The end-to-end ASR, however, has demonstrated a different pattern of errors. For example, it developed a fair understanding of the rules for suppleting tones, marked by parentheses around the suppleted tones. Rather than over-specify the NTC correction algorithm, we first analyzed the error-type distribution using the Clean-dev from the YMC-NT corpus, as shown in Table 8.

Novice-ASR Fusion
Rapid comparison of the types of errors for each transcription (novice and ASR) demonstrated consistent patterns and has led us to hypothesize that a fusion system might automatically correct many of these errors. Two baseline methods are examined for the fusion: a voting-based system (Fiscus, 1997) and a rule-based system.
The voting-based system follows the definition in (Fiscus, 1997) that combines hypotheses from different ASR models with novice transcription.
The framework of rule-based fusion is shown in Figure 1. The rules are defined in different linguistic units: words, syllables, and characters. They assume a hierarchical alignment between the novice transcription and ASR hypotheses. The rules are applied to the transcription from word to syllable to character level. The rules are developed based on continual evaluation of the novice's progress. Thus they will be different but discoverable when  Syllable Rules: If a novice syllable is tone initial, use the corresponding ASR syllable. If the novice and the ASR have identical segments but different tones, use the ASR tones. When an ASR syllable has CVV or CV'V, and its corresponding novice syllable has CV, 11 use the ASR syllable (CVV or CV'V). If the tone from either transcription system follows a consonant (except a stem-final n), use the other system's transcription. Character Rules: If the ASR has a hyphen, equal sign, parentheses, glottal stop which is absent from the novice transcription, then always trust the ASR and maintain the aforementioned symbols in the final transcription. We apply the edit distance (Wagner and Fischer, 1974) to find the alignment between the ASR model hypothesis {C 1 , ..., C n } and the Novice transcription {C 1 , ..., C m }. The L I , L D , L S are introduced in the dynamic function as the insertion, deletion, and substitution loss, respectively. In the naive setting, L I , L D are both set to 1. The L S is set to 1 if C i is different from C j and 0 otherwise. This setting is computation-efficient. However, it does not consider how the contents mismatch between the C i and C j . Therefore, we adopt a hierarchical dynamic alignment. In this method, the character alignment follows the native setting. While the L S (C i , C j ) for syllable alignment is defined as the normalized character-level edit distance between C i and C j as follows: where the |C i | is the lengths of the syllable. Similarly, the L S (C i , C j ) for word alignment is defined based on syllable alignment.

Experimental Settings
The novice transcription, the E2E-Transformer model, and the E2E-Conformer model were considered as baselines for the NTC task. To evaluate the system for reduced training data, we also show our results of E2E-Conformer trained with a 50-hour subset. For the end-to-end models, we adopted the trained model from Section 3 with the same decoding set-ups. To test the effectiveness of the hierarchical dynamic alignment, we tested the data with two fusion systems, namely Fusion1 and Fusion2. The Fusion1 system used the naive settings of edit distance, while the Fusion2 system adopted the hierarchical dynamic alignment. Both fusion systems adopt rules defined in Section 4.2. Two configurations for voting-based methods were tested. The first "ROVER" combined three hypotheses (i.e., the E2E-Transformer, the E2E-Conformer, and the Novice). In contrast, the "ROVER-Fusion2" combined the Fusion2 system with the above three.

Results
As shown in Table 9, voting-based methods and rule-based methods all significantly reduce the novice errors for clean speech. 12 However, for the noise-test, the novice transcription is the most robust method. For overall results, the ROVER system (model I) has a lower WER, while the ROVER-Fusion2 system (model J) reaches a lower CER. Model J significantly reduces specific errors, including tone errors (25%), enclitic errors (50%), and parentheses errors (87.5%). In addition, models D, F, and H indicate that the system could still reduce clean-environment novice errors using ASR models trained with a 50-hour subset of the YMC-EXP corpus.
As we discussed in Section 4, novice and ASR transcriptions manifest distinct patterns of error and thus can be used to complement each other. Table 9 shows that our proposed rule-based and voting-based fusion methods can potentially eliminate the errors that come from the novice transcriber, and it can mitigate the transcriber shortage problems based on these fusion methods. However, we should note that a noisy recording condition would negatively affect a fusion approach as ASR does poorly under such conditions (>23% CER), and for practical purposes, the novice transcription alone (<8.5%) is much more accurate. In such conditions we should rely on the novice transcriber alone.

Conclusion and Future Work
This work presents an open-source endangered language corpus in Yoloxóchitl Mixtec and a comparative and reproducible study on various approaches to end-to-end ASR. We demonstrate that end-toend approaches are feasible and present comparable results over conventional HMM approaches, which require resources such as language lexicons not necessary with end-to-end ASR. Additionally, we propose novice transcription correction as a potential task for ASR in EL documentation. We examine two methods to approach this task. The first is a rule-based approach that uses hierarchical dynamic alignment and linguistic rules to perform novice-ASR hybridization. The second is a votingbased method that combines hypotheses from the novice and end-to-end ASR systems. Empirical studies on the YMC-NT corpus indicate that both methods significantly reduce the CER/WER of the novice transcription for clean speech.
The above discussion suggests that a useful approach to EL documentation using both human and computational (ASR) resources might focus on training each system (human and ASR) for particular transcription tasks. If we know from the start that ASR will be used to correct novice transcriptions in areas of difficulty, we could train an ASR system to maximize accuracy in those areas that challenge novice learning.
phone-based. The transcriptions are mapped to surface representations and then to phones (a total of 197 phones, as each tone for a given vowel, is a different phone). There are 22,465 total entries in the lexicon. The chain model is trained with a sequence-level objective function and operates with an output frame rate of 30 ms, three times longer than the previous standard. The longer frame rate increases decoding speed, which in turn makes it possible to operate with a significantly deeper DNN architecture for acoustic modeling. The best results were achieved with a neural network based on the ResNet architecture (Szegedy et al., 2017). This consists of an initial layer for Linear Discriminative Analysis (LDA) transformation and subsequent alternating 160-dimensional bottleneck layers, adding up to 45 layers in total. The DNN acoustic model is then compiled with a 4-gram language model into a weighted finite-state transducer for word sequence decoding.