Linguistic Knowledge in Multilingual Grapheme-to-Phoneme Conversion

This paper documents the UBC Linguistics team’s approach to the SIGMORPHON 2021 Grapheme-to-Phoneme Shared Task, concentrating on the low-resource setting. Our systems expand the baseline model with simple modifications informed by syllable structure and error analysis. In-depth investigation of test-set predictions shows that our best model rectifies a significant number of mistakes compared to the baseline prediction, besting all other submissions. Our results validate the view that careful error analysis in conjunction with linguistic knowledge can lead to more effective computational modeling.


Introduction
With speech technologies becoming ever more prevalent, grapheme-to-phoneme (G2P) conversion is an important part of the pipeline. G2P conversion refers to mapping a sequence of orthographic representations in some language to a sequence of phonetic symbols, often transcribed in the International Phonetic Alphabet (IPA). This is often an early step in tasks such as text-to-speech, where the pronunciation must be determined before any speech is produced. An example of such a G2P conversion, in Amharic, is illustrated below: 'Amharic' For the second year, one of SIGMORPHON shared tasks concentrates on G2P. This year, the task is further broken into three subtasks of varying data levels: high-resource ( 33K training instances), medium-resource (8K training instances), and lowresource (800 training instances). Our focus is on the low-resource subtask. The language data and associated constraints in the low-resource setting will be summarized in Section 3.1; the reader interested in the other two subtasks is referred to Ashby et al. (this volume) for an overview.
In this paper, we describe our methodology and approaches to the low-resource setting, including insights that informed our methods. We conclude with an extensive error analysis of the effectiveness of our approach. This paper is structured as follows: Section 2 overviews previous work on G2P conversion. Section 3 gives a description of the data in the lowresource subtask, evaluation metric, and baseline results, along with the baseline model architecture. Section 4 introduces our approaches as well as the motivation behind them. We present our results in Section 5 and associated error analyses in Section 6. Finally, Section 7 concludes our paper.

Previous Work on G2P conversion
The techniques for performing G2P conversion have long been coupled with contemporary machine learning advances. Early paradigms utilize joint sequence models that rely on the alignment between grapheme and phoneme, usually with variants of the Expectation-Maximization (EM) algorithm (Dempster et al., 1977). The resulting sequences of graphones (i.e., joint graphemephoneme tokens) are then modeled with n-gram models or Hidden Markov Models (e.g., Jiampojamarn et al., 2007;Bisani and Ney, 2008;Jiampojamarn and Kondrak, 2010). A variant of this paradigm includes weighted finite-state transducers trained on such graphone sequences (Novak et al., 2012(Novak et al., , 2015. With the rise of various neural network techniques, neural-based methods have dominated the scene ever since. For example, bidirectional long short-term memory-based (LSTM) networks using a connectionist temporal classification layer produce comparable results to earlier n-gram models (Rao et al., 2015). By incorporating alignment information into the model, the ceiling set by n-gram models has since been broken (Yao and Zweig, 2015). Attention further improved the performance, as attentional encoder-decoders (Toshniwal and Livescu, 2016) learned to focus on specific input sequences. As attention became "all that was needed" (Vaswani et al., 2017), transformer-based architectures have begun looming large (e.g., Yolchuyeva et al., 2019).
Recent years have also seen works that capitalize on multilingual data to train a single model with grapheme-phoneme pairs from multiple languages. For example, various systems from last year's shared task submissions learned from a multilingual signal (e.g., ElSaadany and Suter, 2020; Peters and Martins, 2020;Vesik et al., 2020).

The Low-resource Subtask
This section provides relevant information concerning the low-resource subtask.

Task Data
The provided data in the low-resource subtask come from ten languages 1 : Adyghe (ady; in the Cyrillic script), Modern Greek (gre; in the Greek alphabet), Icelandic (ice), Italian (ita), Khmer (khm; in the Khmer script, which is an alphasyllabary system), Latvian (lat), Maltese transliterated into the Latin script (mlt_latn), Romanian (rum), Slovene (slv), and the South Wales dialect of Welsh (wel_sw). The data are extracted from Wikitionary 2 using WikiPron (Lee et al., 2020), and filtered and downsampled with proprietary techniques, resulting in each language having 1,000 labeled grapheme-phoneme pairs, split into a training set of 800 pairs, a development set of 100 pairs, and a blind test set of 100 pairs.

The Evaluation Metric
This year, the evaluation metric is the word error rate (WER), which is simply the percentage of words for which the predicted transcription sequence differs from the ground-truth transcription. Different systems are ranked based on the macroaverage over all languages, with lower scores indicating better systems. We also adopted this metric when evaluating our models on the development sets.

Baselines
The official baselines for individual languages are based on an ensembled neural transducer trained with the imitation learning (IL) paradigm (Makarov and Clematide, 2018a). The baseline WERs are tabulated in Table 3. In what follows, we overview this baseline neural-transducer system, as our models are built on top of this system. The detailed formal description of the baseline system can be found in Clematide (2018a,b,c, 2020).
The neural transducer in question defines a conditional distribution over edit actions, such as copy, deletion, insertion, and substitution: where x denotes an input sequence of graphemes, and a = a 1 . . . a |a| stands for a sequence of edit actions. Note that the ouput sequence y is missing from the conditional probability on the right-hand side as it can be deterministically computed from x and a. The model is implemented with an LSTM decoder, coupled with a bidirectional LSTM encoder.
The model is trained with IL and therefore demands an expert policy, which contains demonstrations of how the task can be optimally solved given any configuration. Cast as IL, the mapping from graphemes to phonemes can be understood as following an optimal path dictated by the expert policy that gradually turns input orthographic symbols to output IPA characters. To acquire the expert policy, a Stochastic Edit Distance (Ristad and Yianilos, 1998) model trained with the EM algorithm is employed to find an edit sequence consisting of four types of edits: copy, deletion, insertion, and substitution. During training time, the expert policy is queried to identify the next optimal edit that minimizes the following objective expressed in terms of Levenshtein distance and edit sequence cost: where the first term is the Levenshtein distance between the target sequence y and the predicted sequenceŷ, and the second term measures the cost of editing x toŷ.
The baseline is run with default hyperparameter values, which include ten different initial seeds and a beam of size 4 during inference. The predictions of these individual models are ensembled using a voting majority. Early efforts to modify the ensemble to incorporate system confidence showed that a majority ensemble was sufficient.
This model has proved to be competitive, judging from its performance on the previous year's G2P shared task. We therefore decided to use it as the foundation to construct our systems.

Our Approaches
This section lays out our attempted approaches. We investigate two alternatives, both linguistic in nature. The first is inspired by a universal linguistic structure-the syllable-and the other by the error patterns discerned from the baseline predictions on the development data.

System 1: Augmenting Data with Unsupervised Syllable Boundaries
Our first approach originates from the observation that, in natural languages, a sequence of sounds does not just assume a flat structure. Neighboring sounds group to form units, such as the onset, nucleus, and coda. In turn, these units can further project to a syllable (see Figure 1 for an example of such projection). Syllables are useful structural units in describing various linguistic phenomena and indeed in predicting the pronunciation of a word in some languages (e.g., Treiman, 1994  To identify syllable boundaries in the input sequence, we adopted a simple heuristic, the specific steps of which are listed below: 3 2. Find vowels in the input: Next we align the grapheme sequence with the phoneme sequence using an unsupervised many-to-many aligner (Jiampojamarn et al., 2007;Jiampojamarn and Kondrak, 2010). By identifying graphemes that are aligned to phonemic vowels, we can identify vowels in the input. Using the Icelandic example again, the aligner produces a one-to-one mapping: t → t h , r → r, a → ø, u → y, s → s, and t → t. We therefore assume that the input characters a and u represent two vowels. Note that this step is often redundant for input sequences based on the Latin script but is useful in identifying vowel symbols in other scripts.

Find valid onsets and codas:
A key step in syllabification is to identify which sequences of consonants can form an onset or a coda. Without resorting to linguistic knowledge, one way to identify valid onsets and codas is to look at the two ends of a word-consonant sequences appearing word-initially before the first vowel are valid onsets, and consonant sequences after the final vowel are valid codas. Looping through each input sequence in the training data gives us a list of valid onsets and codas. In the Icelandic example traust, the initial tr sequence must be a valid onset, and the final st sequence a valid coda.
4. Break word-medial consonant sequences into an onset and a coda: Unfortunately identifying onsets and codas among wordmedial consonant sequences is not as straightforward. For example, how do we know the sequence in the input VngstrV (V for a vowel character) should be parsed as Vng.strV, as Vn.gstrV, or even as V.ngstrV? To tackle this problem, we use the valid onset and coda lists gathered from the previous step: we split the consonant sequence into two parts, and we choose the split where the first part is a valid coda and the second part a valid onset. For instance, suppose we have an onset list {str, tr} and a coda list {ng, st}. This implies that we only have a single valid split-Vng.strVso ng is treated as the coda for the previous syllable and str as the onset for the following syllable. In the case where more than one split is acceptable, we favor the split that produces a more complex onset, based on the linguistic heuristic that natural languages tend to tolerate more complex onsets than codas. For example, Vng.strV > Vngs.trV. In the situation where none of the splits produces a concatenation of a valid coda and onset, we adopt the following heuristic: • If there is only one medial consonant (such as in the case where the consonant can only occur word-internally but not in the onset or coda position), this consonant is classified as the onset for the following syllable. • If there is more than one consonant, the first consonant is classified as the coda and attached to the previous syllable while the rest as the onset of the following syllable.
Of course, this procedure is not free of errors (e.g., some languages have onsets that are only allowed word-medially, so word-initial onsets will naturally not include them), but overall it gives reasonable results.

Form syllables:
The last step is to put together consonant and vowel characters to form syllables. The simplest approach is to allow each vowel character to be projected as a nucleus and distribute onsets and codas around these nuclei to build syllables. If there are four vowels in the input, there are likewise four syllables. There is one important caveat, however. When there are two or more consecutive vowel characters, some languages prefer to merge them into a single vowel/nucleus in their pronunciation (e.g., Greek και → [ce]) while other languages simply default to vowel hiatuses/two side-by-side nuclei (e.g., Italian badia → [badia])-indeed, both are common cross-linguistically. We again rely on the alignment results in the second step to select the vowel segmentation strategy for individual languages.
After we have identified the syllables that compose each word, we augmented the input sequences with syllable boundaries. We identify four labels to distinguish different types of syllable boundaries: <cc>, <cv>. <vc>, and <vv>, depending on the classes of sound the segments straddling the syllable boundary belong to. For instance, the input sequence b í l a v e r k s t ae ð i in Icelandic will be augmented to be b í <vc> l a <vc> v e r k <cc> s t ae <vc> ð i. We applied the same syllabification algorithm to all languages to generate new input sequences, with the exception of Khmer, as the Khmer script does not permit a straightforward linear mapping between input and output sequences, which is crucial for the vowel identification step. We then used these syllabified input sequences, along with their target transcriptions, as the training data for the baseline model. 4

System 2: Penalizing Vowel and Diacritic Errors
Our second approach focuses on the training objective of the baseline model, and is driven by the errors we observed in the baseline predictions. Specifically, we noticed that the majority of errors for the languages with a high WER-Khmer, Latvian, and Slovene-concerned vowels, some examples of which are given in rȇ @ j s pȏ @ n s p a n lat t s e: l s t sÊ: l s j uō k s jù o k s vae l s vǣ: l s slv jó: g u r t j O gú: r t k rì: S k rí: S z dá j z dá: j additional penalties. Each incorrectly-predicted vowel incurs this penalty. The penalty acts as a regularizer that forces the model to expend more effort on learning vowels. This modification is in the same spirit as the softmax-margin objective of Gimpel and Smith (2010), which penalizes highcost outputs more heavily, but our approach is even simpler-we merely supplement the loss with additional penalties for vowels and diacritics. We fine-tuned the vowel and diacritic penalties using a grid search on the development data, incrementing each by 0.1, from 0 to 0.5. In the cases of ties, we skewed higher as the penalties generally worked better at higher values. The final values used to generate predictions for the test data are listed in Table 2. We also note that the vowel penalty had significantly more impact than the diacritic penalty.

Results
The performances of our systems, measured in WER, are juxtaposed with the official baseline results in Table 3. We first note that the baseline was particularly strong-gains were difficult to achieve for most languages. Our first system (Syl), which is based on syllabic information, unfortunately does not outperform the baseline. Our second system (VP), which includes additional penalties for vowels and diacritics, however, does outperform the baselines in several languages. Furthermore, the macro WER average not only outperforms the baseline, but all other submitted systems.  It seems that extra syllable information does not help with predictions in this particular setting. It might be the case that additional syllable boundaries increase input variability without providing much useful information with the current neuraltransducer architecture. Alternatively, information about syllable boundary locations might be redundant for this set of languages. Finally, it is possible that the unsupervised nature of our syllable annotation was too noisy to aid the model. We leave these speculations as research questions for future endeavors and restrict the subsequent error analyses and discussion to the results from our vowelpenalty system. 5  Figure 2: Distributions of error types in test-set predictions across languages. Error types are distinguished based on whether an error involves only consonants, only vowels, or both. For example, C-V means that the error is caused by a ground-truth consonant being replaced by a vowel in the prediction. C-ǫ means that it is a deletion error where the ground-truth consonant is missing in the prediction while ǫ-C represents an insertion error where a consonant is wrongly added.

Error Analyses
In this section, we provide detailed error analyses on the test-set predictions from our best system. The goals of these analyses are twofold: (i) to examine the aspects in which this model outperforms the baseline and to what extent, and (ii) to get a better understanding of the nature of errors made by the system-we believe that insights and improvements can be derived from a good grasp of error patterns. We analyzed the mismatches between predicted sequences and ground-truth sequences at the segmental level. For this purpose, we again utilized many-to-many alignment (Jiampojamarn et al., 2007;Jiampojamarn and Kondrak, 2010), but this time between a predicted sequence and the corresponding ground-truth sequence. 6 For each error along the aligned sequence, we classified it into one of the three kinds: • Those involving erroneous vowel insertions (e.g., ǫ → [@]), deletions (e.g., [@] → ǫ), or substitutions (e.g., [@] → [a]).
• In the same vein, those involving erroneous consonant insertions (e.g., ǫ → [P]), deletions boundaries does not improve the results, it is unlikely that marking constituent boundaries, which adds more variability to the input, will result in better performance, though we did not test this hypothesis. 6 The parameters used are: allowing deletion of input grapheme strings, maximum length of aligned grapheme and phoneme substring being one, and a training threshold of 0.001. The frequency of each error type made by the baseline model and our systems for each individual language is plotted in Figure 2. Some patterns are immediately clear. First, both systems have a similar pattern in terms of the distribution of error types across language, albeit that ours makes fewer errors on average. Second, both systems err on different elements, depending on the language. For instance, while Adyghe (ady) and Khmer (khm) have a more balanced distribution between consonant and vowel errors, Slovene (slv) and Welsh (wel_sw) are dominated by vowel errors. Third, the improvements gained in our system seem to come mostly from reduction in vowel errors, as is evident in the case of Khmer, Latvian (lav), and, to a lesser extent, Slovene.
The final observation is backed up if we zoom in on the errors in these three languages, which we visualize in Figure 3. Many incorrect vowels generated by the baseline model are now correctly predicted. We note that there are also cases, though less common, where the baseline model gives the right prediction, but ours does not. It should be pointed out that, although our system shows improvement over the baseline, there is still plenty of room for improvement in many languages, and our system still produces incorrect vowels in many Here we only visualize the cases where either the baseline model gives the right vowel but our system does not, or vice versa. We do not include cases where both the baseline model and our system predict the correct vowel, or both predict an incorrect vowel, to avoid cluttering the view. Each baseline-ground-truth-ours line represents a set of aligned vowels in the same word; the horizontal line segment between a system and the ground-truth means that the prediction from the system agrees with the ground-truth. Color hues are used to distinguish cases where the prediction from the baseline is correct versus those where the prediction from our second system is correct. Shaded areas on the plots enclose vowels of similar vowel quality.
instances. Finally, we look at several languages which still resulted in high WER on the test set-ady, gre, ita, khm, lav, and slv. We analyze the confusion matrix analysis to identify clusters of commonly-confused phonemes. This analysis again relies on the alignment between the groundtruth sequence and the corresponding predicted sequence to characterize error distributions. The results from this analysis are shown in Figure 4, and some interesting patterns are discussed below. Figure 2 suggests that Khmer has an equal share of consonant and vowel errors, and the heat maps in Figure 4 reveal that these errors do not seem to follow a certain pattern. However, a different picture emerges with Latvian and Slovene. For both languages, Figure 2 indicates the dominance of errors tied to vowels; consonant errors account for a relatively small proportion of errors. This observation is borne out in Figure 4, with the consonant heat maps for the two languages displaying a clear diagonal stripe, and the vowel heat maps showing much more off-diagonal signals. What is more interesting is that the vowel errors in fact form clusters, as highlighted by white squares on the heat maps. The general pattern is that confusion only arises within a cluster where vowels are of similar quality but differ in terms of length or pitch accent. For example, while [i:] might be incorrectly-predicted as [i], our model does not confuse it with, say, [u]. The challenges these languages present to the mod-els are therefore largely suprasegmental-vowel length and pitch accent, both of which are lexicalized and not explicitly marked in the orthography. For the other three languages, their errors also show distinct patterns: for Adyghe, consonants differing only in secondary features can get confused; in Greek, many errors can be attributed to the mixing of [r] and [R]; in Italian, front and back mid vowels can trick our model.
We hope that our detailed error analyses show not only that these errors "make linguistic sense"and therefore attest to the power of the modelbut also point out a pathway along which future modeling can be improved.

Conclusion
This paper presented the approaches adopted by the UBC Linguistics team to tackle the SIGMOR-PHON 2021 Grapheme-to-Phoneme Conversion challenge in the low-resource setting. Our submissions build upon the baseline model with modifications inspired by syllable structure and vowel error patterns. While the first modification does not result in more accurate predictions, the second modification does lead to sizable improvements over the baseline results. Subsequent error analyses reveal that the modified model indeed reduces erroneous vowel predictions for languages whose errors are dominated by vowel mismatches. Our approaches also demonstrate that patterns uncov-  Figure 4: Confusion matrices of vowel and consonant predictions by our second system (VP) for languages with the test WER > 20%. Each row represents a predicted segment, with colors across columns indicating the proportion of times the predicted segment matches individual ground-truth segments. A gray row means the segment in question is absent in any predicted phoneme sequences but is present in at least one ground-truth sequence. The diagonal elements represent the number of times for which the predicted segment matches the target segment, while off-diagonal elements are those that are mis-predicted by the system. White squares are added to highlight segment groups where mismatches are common. ered from careful error analyses can inform the directions for potential improvements.