Avengers, Ensemble! Benefits of ensembling in grapheme-to-phoneme prediction

We describe three baseline beating systems for the high-resource English-only sub-task of the SIGMORPHON 2021 Shared Task 1: a small ensemble that Dialpad’s speech recognition team uses internally, a well-known off-the-shelf model, and a larger ensemble model comprising these and others. We additionally discuss the challenges related to the provided data, along with the processing steps we took.


Introduction
The transduction of sequences of graphemes to phones or phonemes, 2 that is from characters used in orthographic representations to characters used to represent minimal units of speech, is a core component of many tasks in speech science & technology. This graphemeto-phoneme conversion (or g2p) may be used, e.g., to automate or scale the creation of digital lexicons or pronunciation dictionaries, which are crucial to FST-based approaches to automatic speech recognition (ASR) and synthesis (Mohri et al., 2002).
The SIGMORPHON 2021 Workshop included a Shared Task on g2p conversion, comprising 3 sub-tasks. 3 The low-and mediumresource tasks were multilingual, while the high-resource task was English-only. This paper provides an overview of the three baseline-beating systems submitted by the Dialpad team for the high-resource sub-task, 2 Sub-task 1: high-resource, English-only The organizers provided 41,680 lines of data in total; 33,344 for training, and 4,168 each for development and test. The data consists of word/pronunciation pairs (word-pron pairs, henceforth), where words are sequences of graphemes and pronunciations are sequences of characters from the International Phonetic Alphabet (International Phonetic Association, 1999). The data was derived from the English portion of the WikiPron database (Lee et al., 2020), a massively multilingual resource of word-pron pairs extracted from Wiktionary 4 and subject to some manual QA and postprocessing. 5 The baseline model provided was the 2nd place finisher from the 2020 g2p shared task . It is an ensembled neural transition model that operates over edit actions and is trained via imitation learning (Makarov and Clematide, 2020).
Evaluation scripts were provided to compute word error rate (WER), the percentage of words for which the output sequence does not match the gold label.
Notwithstanding the baseline's strong prior performance and the amount of data available, the task proved to be challenging; the baseline system achieved development and test set WERs of 45.13 and 41.94, respectively. We discuss possible reasons for this below.

Data-related challenges
Wiktionary is an open, collaborative, public effort to create a free dictionary in multiple languages. Anyone can create an account and add or amend words, pronunciations, etymological information, etc. As with most usergenerated content, this is a noisy method of data creation and annotation.
Even setting aside the theory-laden question of when or whether a given word should be counted as English, 6 the open nature of Wiktionary means that speakers of different variants or dialects of English may submit varying or conflicting pronunciations for sets of words. For example, some transcriptions indicate that the users who input them had the cot/caught merger while others do not; in the training data "cot" is transcribed /k ɑ t/ and "caught" is transcribed /k ɔ t/, indicating a split, but "aughts" is transcribed as /ɑ t s/, indicating merger. There is also variation in the narrowness of transcription. For example, some transcriptions include aspiration on stressed-syllable-initial stops while others do not c.f. "kill" /kʰ ɪ l/ and "killer" /k ɪ l ɚ/.
Typically the set of English phonemes is taken to be somewhere between 38-45 depending on variant/dialect (McMahon, 2002). In exploring the training data, we found a total of 124 symbols in the training set transcriptions, many of which only appeared in a small set (1-5) of transcriptions. To reduce the effect of this long tail of infrequent symbols, we normalized the training set.
The main source of symbols in the long tail was the variation in the broadness of transcription-vowels were sometimes but not always transcribed with nasalization before a nasal consonant, aspiration on wordinitial voiceless stops was inconsistently indicated, phonetic length was occasionally indicated, etc. There were also some cases of erroneous transcription that we uncovered by looking at the lowest frequency phones and the word-pronunciation pairs where they appeared. For instance, the IPA /j/ was transcribed as /y/ twice, the voiced alveolar approximant /ɹ/ was mistranscribed as the trill /r/ over 200 times, and we found a handful of issues where a phone was transcribed with a Unicode symbol not used in the IPA at all.
Most of these were cases where the rare variant was at least two orders of magnitude less frequent than the common variant of the symbol. There was, however, one class of sounds where the variation was less dramatically skewed; the consonants /m/, /n/, and /l/ appeared in unstressed syllables following schwa (/əm/, /ən/, /əl/) roughly one order of magnitude more frequently than their syllabic counterparts (/m̩ /, /n̩ /, /l ̩/), and we opted not to normalize these. If we had normalized the syllabic variants, it would have resulted in more consistent g2p output but it would likely also have penalized our performance on the uncleaned test set. 7 In the end, our training data contained 47 phones (plus end-of-sequence and UNK symbols for some models).

Models
We trained and evaluated several models for this task, both publicly available, in-house, and custom developed, along with various ensembling permutations. In the end, we submitted three sets of baseline beating results. The organizers assigned sequential identifiers to multiple submissions (e.g. Dialpad-N); we include these in the discussion of our entries below, for ease of subsequent reference.

The Dialpad model (Dialpad-2)
Dialpad uses a g2p system internally for scalable generation of novel lexicon additions. We were motivated to enter this shared task as a means of assessing potential areas of improvement for our system; in order to do so we needed to assess our own performance as a baseline.
This model is a simple majority-vote ensemble of 3 existing publicly available g2p systems: Phonetisaurus (Novak et al., 2012), a WFST-based model, Sequitur (Bisani and Ney, 2008), a joint-sequence model trained via EM, and a neural sequence-to-sequence model developed at CMU as part of the CMUSphinx 8 toolkit (see subsection 3.2). As Dialpad uses a proprietary lexicon and phoneset internally, we retrained all three models on the cleaned version of the shared task training data, retaining default hyperparameters and architectures.
In the end, this ensemble achieved a test set WER of 41.72, narrowly beating the baseline (results are discussed in more depth in Section 4).

A strong standalone model:
CMUSphinx CMUSphinx is a set of open systems and tools for speech science developed at Carnegie Mellon University, including a g2p system. 9 It is a neural sequence-to-sequence model (Sutskever et al., 2014) that is Transformerbased (Vaswani et al., 2017), written in Tensorflow (Abadi et al., 2015). A pre-trained 3layer model is available for download, but it is trained on a dictionary that uses ARPABET, a substantially different phoneset from the IPA used in this challenge. For this reason we retrained a model from scratch on the cleaned version of the training data. This model achieved a test set WER of 41.58, again narrowly beating the baseline. Interestingly, this outperformed the Dialpad model which incorporates it, suggesting that Phonetisaurus and Sequitur add more noise than signal to predicted outputs, to say nothing of increased computational resources and training time. More generally, this points to the CMUSphinx seq2seq model as a simple and strong baseline against which future g2p research should be assessed.

A large ensemble (Dialpad-1)
In the interest of seeing what results could be achieved via further naive ensembling, our final submission was a large ensemble, comprising two variations on the baseline model, the Dialpad-2 ensemble discussed above, and two additional seq2seq models, one using LSTMs and the other Transformer-based. The latter additionally incorporated a sub-word extraction method designed to bias a model's inputoutput mapping toward "good" graphemephoneme correspondences. 9 https://github.com/cmusphinx/g2p-seq2seq The method of ensembling for this model is word level majority-vote ensembling. We select the most common prediction when there is a majority prediction (i.e. one prediction has more votes than all of the others). If there is a tie, we pick the prediction that was generated by the best standalone model with respect to each model's performance on the development set.
This collection of models achieved a test set WER of 37.43, a 10.75% relative reduction in WER over the baseline model. As shown in Table 1, although a majority of the component models did not outperform the baseline, there was sufficient agreement across different examples that a simple majority voting scheme was able to leverage the models' varying strengths effectively. We discuss the components and their individual performance below and in Section 4.

Baseline variations
The "foundation" of our ensemble was the default baseline model (Makarov and Clematide, 2018), which we trained using the raw data and default settings in order to reflect the baseline performance published by the organization. We included this in order to individually assess the effect of additional models on overall performance.
In addition to this default base, we added a larger version of the same model, for which we increased the number of encoder and decoder layers from 1 to 3, and increased the hidden dimensions 200 to 400.

biLSTM+attention seq2seq
We conducted experiments with a RNN seq2seq model, comprising a biLSTM encoder, LSTM decoder, and dot-product attention. 10 We conducted several rounds of hyperparameter optimization over layer sizes, optimizer, and learning rate. Although none of these models outperformed the baseline, a small network  proved to be efficiently trainable (2 CPUhours) and improved the ensemble results, so it was included.

PAS2P: Pronunciation-assisted sub-words to phonemes
Sub-word segmentation is widely used in ASR and neural machine translation tasks, as it reduces the cardinality of the search space over word-based models, and mitigates the issue of OOVs. Use of sub-words for g2p tasks has been explore, e.g. Reddy and Goldsmith (2010) develop an MDL-based approach to extracting sub-word units for the task of g2p.
Recently, a pronunciation-assisted sub-word model (PASM) (Xu et al., 2019) was shown to improve the performance of ASR models. We experimented with pronunciation-assisted sub-words to phonemes (PAS2P), leveraging the training data and a reparameterization of the IBM Model 2 aligner (Brown et al., 1993) dubbed fast_align (Dyer et al., 2013). 11 The alignment model is used to find an alignment of sequences of graphemeres to their corresponding phonemes. We follow a similar process as Xu et al. (2019) to find the consistent grapheme-phoneme pairs and refinement of the pairs for the PASM model. We also collect grapheme sequence statistics and marginalize it by summing up the counts of each type of grapheme sequence over all possible types of phoneme sequences. These counts are the weights of each sub-word sequence.
Given a word and the weights for each sub-word, the segmentation process is a search problem over all possible sub-word segmentation of that word. We solve this search problem by building weighted FSTs 12 of a given word and the sub-word vocabulary, and finding the best path through this lattice. For example, the word "thoughtfulness" would be segmented by PASM as "th_ough_t_f_u_l_n_e_ss", and this would be used as the input in the PAS2P model rather than the full sequence of individual graphemes.
Finally, the PAS2P transducer is a Transformer-based sequence-to-sequence model trained using the ESPnet end-to-end speech processing toolkit (Watanabe et al., 2018), with pronunciation-assisted subwords as inputs and phones as outputs. The 11 https://github.com/clab/fast_align 12 We use Pynini (Gorman, 2016) for this. model has 6 layers of encoder and decoder with 2048 units, and 4 attention heads with 256 units. We use dropout with a probability of 0.1 and label smoothing with a weight of 0.1 to regularize the model. This model achieved WERs of 44.84 and 43.40 on the development and test sets, respectively.

Results
Our main results are shown in Table 1, where we show both dev and test set WER for each individual model in addition to the submitted ensembles. In particular, we can see that many of the ensemble components do not beat the baseline WER, but nonetheless serve to improve the ensembled models.

Additional experiments
We experimented with different ensembles and found that incorporating models with different architectures generally improves overall performance. In the standalone results, only the top three models beat the baseline WER, but adding additional models with higher WER than the baseline continues to reduce overall WER. Table 2 shows the effect of this progressive ensembling, from our top-3 models to our top-7 (i.e. the ensemble for the Dialpad-1 model).

Edit distance-based voting
In addition to varying our ensemble sizes and components, we investigated a different ensemble voting scheme, in which ties are broken using edit distance when there is no 1best majority option. That is, in the event of  a tie, instead of selecting the prediction made by the best standalone model (our usual tiebreaking method), we select the prediction that minimizes edit distance to all other predictions that have the same number of votes. The idea of this method is to maximize subword level agreement. Although this method did not show clear improvements on the development set, we found after submission that it narrowly but consistently outperformed the top-N ensembles on the test set (see Table 3).  "acres" (e ɪ k ɚ z/) rhymes with "degrees", and that "beret" has a /t/ sound in it. In each of these cases, there was either not enough samples in the training set to reliably learn the relevant grapheme-phoneme correspondence, or else a conflicting (but correct) correspondence was over-represented in the training data.

Conclusion
We presented and discussed three g2p systems submitted for the SIGMORPHON2021 English-only shared sub-task. In addition to finding a strong off-the-shelf contender, we show that naive ensembling remains a strong strategy in supervised learning, of which g2p is a sub-domain, and that simple majority-voting schemes in classification can often leverage the respective strengths of sub-optimal component models, especially when diverse architectures are combined. Additionally, we provided more evidence for the usefulness of linguistically-informed subword modeling as an input transformation on speech-related tasks. We also discussed additional experiments whose results were not submitted, indicating the benefit of exploring top-N model vs ensemble trade-offs, and demonstrating the potential benefit of an edit-distance based tiebreaking method for ensemble voting.
Future work includes further search for the optimal trade-off between ensemble size and performance, as well as additional exploration of the edit-distance voting scheme, and more sophisticated ensembling/voting methods, e.g. majority voting at the phone level on aligned outputs.