Results of the Second SIGMORPHON Shared Task on Multilingual Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The second iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year’s task (Gorman et al. 2020), including additional languages, a stronger baseline, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Four teams submitted a total of thirteen systems, at best achieving relative reductions of word error rate of 11% in the high-resource subtask and 4% in the low-resource subtask.


Introduction
Many speech technologies demand mappings between written words and their pronunciations. In open-vocabulary systems-as well as certain resource-constrained embedded systems-it is insufficient to simply list all possible pronunciations; these mappings must generalize to rare or unseen words as well. Therefore, the mapping must be expressed as a mapping from a sequence of orthographic characters-graphemes-to a sequence of sounds-phones or phonemes. 1 The earliest work on grapheme-to-phoneme conversion (G2P), as this task is known, used ordered rewrite rules. However, such systems are often brittle and the linguistic expertise needed to build, test, and maintain rule-based systems is often in short supply. Furthermore, rulebased systems are outperformed by modern neu- 1 We note that referring to elements of transcriptions as phonemes implies an ontological commitment which may or may not be justified; see Lee et al. 2020 (fn. 4) for discussion. Therefore, we use the term phone to refer to symbols used to transcribe pronunciations. ral sequence-to-sequence models (e.g., Rao et al. 2015, Yao and Zweig 2015, van Esch et al. 2016. With the possible exception of van Esch et al. (2016), who evaluate against a proprietary database of 20 languages and dialects, virtually all of the prior published research on graphemeto-phoneme conversion evaluates only on English, for which several free and low-cost pronunciation dictionaries are available. The 2020 SIGMOR-PHON Shared Task on Multilingual Grapheme-to-Phoneme Conversion  represented a first attempt to construct a multilingual benchmark for grapheme-to-phoneme conversion. The 2020 shared task targeted fifteen languages and received 23 submissions from nine teams. The second iteration of this shared task attempts to further refine this benchmark by introducing additional languages, a much stronger baseline model, new quality assurance procedures for the data, and automated error analysis techniques. Furthermore, in response to suggestions from participants in the 2020 shared task, the task has been divided into high-, medium-, and low-resource subtasks.

Data
As in the previous year's shared task, all data was drawn from WikiPron (Lee et al. 2020), a massively multilingual pronunciation database extracted from the online dictionary Wiktionary. Depending on the language and script, Wiktionary pronunciations are either manually entered by human volunteers working from language-specific pronunciation guidelines and/or generated from the graphemic form via language-specific serverside scripting.
WikiPron scrapes these pronunciatons from Wiktionary, optionally applying case-folding to the graphemic form, removing any stress and syllable boundaries, and segmenting the pronunciation-encoded in the Interna-tional Phonetic Alphabet-using the Python library segments (Moran and Cysouw 2018). In all, 21 WikiPron languages were selected for the three subtasks, including seven new languages and fourteen of the fifteen languages used in the 2020 shared task. 2 In several cases, multiple scripts or dialects are available for a given language. For instance, WikiPron has both Latin and Cyrillic entries for Serbo-Croatian, and three different dialects of Vietnamese. In such case, the largest data set of the available scripts and/or dialects is chosen. Furthermore, WikiPron distinguishes between "broad" transcriptions delimited by forward slash (/) and "narrow" transcriptions delimited by square brackets ([ and ]). 3 Once again, the larger of the two data sets is the one used for this task.

Quality assurance
During the previous year's shared task we became aware of several consistency issues with the shared task data. This lead us to develop quality assurance procedures for WikiPron and the "upstream" Wiktionary data. For a few languages, we worked with Wiktionary editors who automatically enforced upstream consistency via "bots", i.e., scripts which automatically edit Wiktionary entries. We also improved WikiPron's routines for extracting pronunciation data from Wiktionary. In some cases (e.g., Vietnamese), this required the creation of language-specific extraction routines.
In early versions of WikiPron, users had limited means to separate out entries for languages written in multiple scripts. We therefore added an automated script detection system which ensures that entries for the many languages written with multiple scripts-including shared task languages Maltese, Japanese, and Serbo-Croatian-are sorted according to script.
We noticed that the WikiPron data includes many hyper-foreign pronunciations with nonnative phones. For example, the English data includes a broad pronunciation of Bach (the surname of a family of composers) as /bɑːx/ with a velar fricative /x/, a segment which is common in German but absent in modern English. Furthermore, unexpected phones may represent simple human error. Therefore, we wished to exclude pronunciations which include any nonnative segments. This was accomplished by creating phonelists which enumerate native phones for a given language. Separate phonelists may be provided for broad and narrow transcriptions of the same language. During data ingestion, if a pronunciation contains any segment not present on the phonelist, the entry was discarded. Phonelist filtration was used for all languages in the medium-and low-resource subtasks, described below.

Task definition
In this task, participants were provided with a collection of words and their pronunciations, and then scored on their ability to predict the pronunciation of a set of unseen words.

Subtasks
In the previous year's shared task, each language's data consisted of 4,500 examples, sampled from WikiPron, split randomly into 80% training examples, 10% development examples, and 10% test examples. As part of their system development, two teams in the 2020 shared task (Hauer et al. 2020, Yu et al. 2020) down-sampled these data to simulate a lower-resource setting, and one participant expressed concern whether the methods used in the shared task would generalize effectively to high-resource scenarios like the large English data sets traditionally used to evaluate graphemeto-phoneme systems. This motivated a division of the data into three subtasks, varying the amount of data provided, as described below. 4 High-resource subtask The first subtask consists of a roughly 41,000-word sample of Mainstream American English (eng_us). Participating teams were permitted to use any and all external resources to develop their systems except for Wiktionary or WikiPron. It was anticipated participants would exploit other freely available American English pronunciation dictionaries.

Medium-resource subtask
The second subtask represents a medium-resource task. For each of the ten target languages, a sample of 10,000 words was used. Teams participating in this subtask were permitted to use UniMorph paradigms (Kirov et al. 2018) to lemmatize or to look up morphological features, but were not permitted to use any other external resources. The languages for this subtask are listed and exemplified in Table 1.

Low-resource subtask
The third subtask is designed to simulate a low-resource setting and consists of 1,000 words from ten languages. Teams were were not permitted to use any external resources for this subtask. The languages for this subtask are shown in Table 2.

Data preparation
The procedures for sampling and splitting the data are similar to those used in the previous year's shared task; see Gorman et al. 2020, §3. For each of the three subtasks, the data for each language are first randomly downsampled according to their frequencies in the Wortschatz (Goldhahn et al. 2012) norms. Words containing less than two Unicode characters or less than two phone segments are excluded, as are words with multiple pronunciations. The resulting data are randomly split into 80% training data, 10% development data, and 10% test data. As in the previous year's shared task, these splits are constrained so that inflectional variants of any given lemma-according to the UniMorph (Kirov et al. 2018) paradigmscan occur in at most one of the three shards. Training and development data was made available at the start of the task. The test words were also made available at the start of the task; test pronunciations were withheld until the end of the task. Some additional processing is required for certain languages, as described below.
English The Wiktionary American English pronunciations exhibit a large number of inconsistencies. These pronunciations were validated by automatically comparing them with entries in the CALLHOME American English Lexicon (Kingsbury et al. 1997), which provides broad ARPAbet transcriptions of Mainstream American English. Furthermore, a script was used to standardize use of vowel length and enforce consistent use of tie bars with affricates (e.g., /tʃ/ → /t ͡ ʃ/). However, we note that Gautam et al. (2021: §2.1) report several residual quality issues with this data.
Bulgarian Bulgarian Wiktionary transcriptions make inconsistent use of tie bars on affricates; for example, ц is transcribed as both /ts, t ͡ s/. Furthermore, the broad transcriptions sometimes contain allophones of the consonants /t, d, l/ (Ternes and Vladimirova-Buhtz 1990); e.g., л is transcribed as both /l, ɫ/. A script was used to enforce a consistent broad transcription.

Maltese
In the Latin-script Maltese data, Wiktionary has multiple transcriptions of digraph għ, which in the contemporary language indicates lengthening of an adjacent vowel, except word-finally where it is read as [ħ] (Hoberman 2007:278f.). Rather than excluding multiple pronunciations, a script was used to eliminate pronunciations which contain archaic readings of this digraph, e.g., as pharyngealization or as [ɣ].
Welsh WikiPron's transcriptions of the Southern dialect of Welsh include the effects of variable processes of monophthongization and deletion (Hannahs 2013:18-25). Once again, rather than excluding multiple pronunciations, a script was used to select the "longer" pronunciationnaturally, the pronunciation without variable monophthongization or deletion-of Welsh words with multiple pronunciations.

Evaluation
The primary metric for this task was word error rate (WER), the percentage of words for which the hypothesized transcription sequence is not identical to the gold reference transcription. As the medium-and low-resource subtasks involve multiple languages, macro-averaged WER was used for system ranking. Participants were provided with two evaluation scripts: one which computes WER for a single language, and one which also computes macro-averaged WER across two or more languages. The 2020 shared task also reported another metric, phone error rate (PER), but this was found to be highly correlated with WER and therefore has been omitted here.

Baseline
The 2020 shared task included three baselines: a WFST-based pair n-gram model, a bidirectional LSTM encoder-decoder network, and a transformer. All models were tuned to minimize perlanguage development-set WER using a limitedbudget grid search. Best results overall were obtained by the bidirectional LSTM. Despite the extensive GPU resources required to execute a Armenian (Eastern) arm_e համադրություն h ɑ m ɑ d ә ɾ u tʰ j u n Bulgarian bul обоснованият o̝ b o̝ s n o v a n i j ә t̪ Dutch dut konijn k oː n ɛ i̯ n French fre joindre ʒ w ɛ̃d ʁG eorgian geo მოუქნელად m ɔ u kʰ n ɛ l ɑ d vie_hanoi ngừng bắn ŋ ɨ ŋ ˨˩ ʔ ɓ a n ˧˦  per-language grid search, the best baseline was handily outperformed by nearly all submissions. This led us to seek a simpler, stronger, and less computationally-demanding baseline for this year's shared task. The baseline for the 2021 shared task is a neural transducer system using an imitation learning paradigm (Makarov and Clematide 2018). A variant of this system (Makarov and Clematide 2020) was the second-best system in the 2020 shared task. 5 Alignments are computed using ten iterations of expectation maximization, and the imitation learning policy is trained for up to sixty epochs (with a patience of twelve) using the Adadelta optimizer. A beam of size of four is used for prediction. Final predictions are produced by a majority-vote ten-component ensemble. Internal processing is performed using the decomposed Unicode normalization form (NFD), but pre-5 The baseline was implemented using the DyNet neural network toolkit (Neubig et al. 2017). In contrast to the previous year's baseline, the imitation learning system does not require a GPU for efficient training; it runs effectively on CPU and can exploit multiple CPU cores if present. Training, ensembling, and evaluation for all three subtasks took roughly 72 hours of wall-clock time on a commodity desktop PC. dictions are converted back to the composed form (NFC). An implementation of the baseline was provided during the task and participating teams were encouraged to adapt it for their submissions.

Submissions
Below we provide brief descriptions of submissions to the shared task; more detailed descriptions-as well as various exploratory analyses and post-submission experiments-can be found in the system papers later in this volume. (2021) produced a single submission to the low-resource subtask. The model is inspired by the previous year's bidirectional LSTM baseline but also employs several data augmentation strategies. First, much of the development data is used for training rather than for validation. Secondly, new training examples are generated using substrings of other training examples. Finally, the AZ model is trained simultaneously on all languages, a method used in some of the previous year's shared task submissions (e.g., Peters andMartins 2020, Vesik et al. 2020). Clematide and Makarov (2021) produced four submissions to the medium-resource subtask and three to the low-resource subtask. All seven submissions are variations on the imitation learning baseline model (section 6). They experiment with processing individual IPA Unicode characters instead of entire IPA "segments" (e.g., CLUZH-1, CLUZH-5, and CLUZH-6), and larger ensembles (e.g., CLUZH-3). They also experiment with input dropout, mogrifier LSTMs, and adaptive batch sizes, among other features.

CLUZH
Dialpad Gautam et al. (2021) produced three systems to the high-resource subtask.
The Dialpad-1 submission is a large ensemble of seven different sequence models. Dialpad-2 is a smaller ensemble of three models. Dialpad-3 is a single transformer model implemented as part of CMU Sphinx. Gautam et al. also experiment with subword modeling techniques. (2021) submitted two systems for the low-resource subtask, both variations on the baseline model. The UBC-1 submission hypothesizes that, as previously reported by van Esch et al. (2016), inserting explicit syllable boundaries into the phone sequences enhances grapheme-tophoneme performance. They generate syllable boundaries using an automated onset maximization heuristic. The UBC-2 submission takes a different approach: it assigns additional languagespecific penalties for mis-predicted vowels and diacritic characters such as the length mark /ː/.

Results
Multiple submissions to the high-and lowresource subtasks outperformed the baseline; however, no submission to the medium-resource subtask exceeded the baseline. The best results for each language are shown in Table 3.

Subtasks
High-resource subtask The Dialpad team submitted three systems for the high-resource subtask, all of which outperformed the baseline. Results for this subtask are shown in Table 4. The best submission overall, Dialpad-1, a seven-component ensemble, achieved an impressive 4.5% absolute (11% relative) reduction in WER over the baseline.

Medium-resource subtask
The CLUZH team submitted four systems for the medium-resource subtask. All of of these systems are variants of the baseline model. The results are shown in Table 5; note that the individual language results are expressed as three-digit percentages since there are 1,000 test examples each. While several of the CLUZH systems outperform the baseline on individual languages, including Armenian, French, Hungarian, Japanese, Korean, and Vietnamese, the baseline achieves the best macro-accuracy.
Low-resource subtask Three teams-AZ, CLUZH, and UBC-submitted a total of six systems to the low-resource subtask. Results for this subtask are shown in Table 6; note that the results are expressed as two-digit percentages since there are 100 test examples for each language. Three submissions outperformed the baseline. The best-performing submission was UBC-2, an adaptation of the baseline which assigns higher penalties for mis-predicted vowels and diacritic characters. It achieved a 1.0% absolute (4% relative) reduction in WER over the baseline.

Error analysis
Error analysis can help identify strengths and weaknesses of existing models, suggesting future improvements and guiding the construction of ensemble models. Prior experience using gold crowd-sourced data extracted from Wiktionary suggests that a non-trivial portion of errors made by top systems are due to errors in the gold data itself. For example, Gorman et al. (2019) report that a substantial portion of the prediction errors made by the top two systems in the 2017 CoNLL-SIGMORPHON Shared Task on Morphological Reinflection (Cotterell et al. 2017) are due to target errors, i.e., errors in the gold data. Therefore we conducted an automatic error analysis for four target languages. It was hoped that this analysis would also help identify (and quantify) target errors in the test data.
Two forms of error analysis were employed here. First, after Makarov and Clematide (2020), the most frequent error types in each language are shown in Table 7. From this table it is clear that many errors can be attributed either to the ambiguity of a language's writing system. For example, in both Serbo-Croatian and Slovenian the most common errors involve the confusion or omission of suprasegmental information such as pitch accent and vowel length, neither of which are represented in the orthography. Likewise, in French and Italian the most frequent errors confuse vowel sounds   represented by the same graphemes. Many errors may also be attributable to problems with the target data. For example, the two most frequent errors for English are predicting [ɪ] instead of [ә], and predicting [ɑ] instead of [ɔ]. Impressionistically, the former is due in part to inconsistent transcription of the -ed and -es suffixes, whereas the latter may reflect inconsistent transcription of the low back merger.
The second error analysis technique used here is an adaptation of a quality assurance technique proposed by Jansche (2014). For each language targeted by the error analysis, a finite-state covering grammar is constructed by manually listing all pairs of permissible grapheme-phone mappings for that language. Let C be the set of all such g, p pairs. Then, the covering grammar γ is the rational relation given by the closure over C, thus γ = C * . Covering grammars were constructed for three medium-resource languages and four of the low-resource languages. A fragment of the Bulgarian covering grammar, showing readings of the characters б, ф, and ю, is presented in Table 8. 6 Let G be the graphemic form of a word and let P andP be the corresponding gold and hypothesis pronunciations for that word. For error analysis we are naturally interested in cases where P ̸ =P, i.e., those cases where the gold and hypothesis pronunciations do not match, since these are exactly the cases which contribute to word error rate. Then, P = π o (G • γ) is a finite-state lattice representing the set of all "possible" pronunciations of G admitted by the covering grammar.
When P ̸ =P but P ∈ P-that is, when Macro-average 10.6 11.4 10.9 11.1 10.8  the gold pronunciation is one of the possible pronunciations-we refer to such errors as model deficiencies, since this condition suggests that the system in question has failed to guess one of several possible pronunciations of the current word. In many cases this reflects genuine ambiguities in the orthography itself. For example, in Italian, e is used to write both the phonemes /e, ɛ/ and o is similarly read as /o, ɔ/ (Rogers and d'Arcangeli 2004). There are few if any orthographic clues to which mid-vowel phoneme is intended, and all submissions incorrectly predicted that the o in nome 'name' is read as [ɔ] rather than [o]. Similar issues arise in Icelandic and French. The preceding examples both represent global ambiguities, but model deficiencies may also occur when the system has failed to disambiguate a local ambiguity. One example of this can be found in French: the verbal third-person plural suffix -ent is silent whereas the non-suffixal word-final ent is normally read as [ɑ]. Morphological information was not provided to the covering grammar, but it could easily be exploited by grapheme-tophoneme models.
Another condition of interest is when P ̸ =P but P / ∈ P. We refer to such errors as coverage deficiencies, since they arise when the gold pronunciation is not one permitted by the covering grammar. While coverage deficiencies may result from actual deficiencies in the covering grammar itself, they more often arise when a word does not follow the normal orthographic principles of its language. For instance, Italian has borrowed the English loanword weekend [wikɛnd] 'id.' but has not yet adapted it to Italian orthographic principles. Finally, coverage deficiencies may indicate target errors, inconsistencies in the gold data itself. For example, in the Italian data, the tie bars used to indi-   cate affricates are not always present, and many apparent errors are the result of gold pronunciations which omit a tie bar.
WER and model deficiency rate (MDR) is shown for select systems and three languages from the medium-resource subtask in Table 9, and  Table 10 shows similar statistics for four lowresource languages. Note that by construction, one can obtain the coverage deficiency rate simply by subtracting MDR from WER. By comparing WER and MDR one can see the overwhelming majority of errors in these seven languages are model deficiencies, most naturally arising from genuine ambiguities in orthography rather than target errors (i.e., data inconsistencies).
To facilitate ensemble construction and further error analysis, we release all submissions' test set predictions to the research community. 7

Discussion
We once again see an enormous difference in language difficulty. One of the languages with the highest amount of data, English, also has one of the highest WERs. In contrast, the baseline and all four submissions to the medium-resource subtask achieve perfect performance on Georgian. This is a substantial change from the previous year's shared task: with a sample roughly half the size of this year's task, the best system (Yu et al. 2020   2020:47). This enormous improvement likely reflects quality assurance work on this language, 8 but we did not anticipate reaching ceiling performance. Insofar as the above quality assurance and error analysis techniques prove effective and generalizable, we may soon be able to ask what makes a language hard to pronounce (cf. Gorman et al. 2020:45f.). As mentioned above, the data here are a mixture of broad and narrow transcriptions. At first glance, this might explain some of the variation in language difficulty; for example, it is easy to imagine that the additional details in narrow transcriptions make them more difficult to predict. However, for many languages, only one of the two levels of transcription is available at scale, and other languages, divergence between broad and narrow transcriptions is impressionistically quite minor. However, this impression ought to be quantified.
While we responded to community demand for lower-and higher-resource subtasks, only one team submitted to the high-and medium-resource subtasks, respectively. It was surprising that none of the medium-resource submissions were able to consistently outperform the baseline model across the ten target languages. Clearly, this year's baseline is much stronger than the previous year's.
Participants in the high-and medium-resource subtasks were permitted to make use of lemmas and morphological tags from UniMorph as additional features. However, no team made use of resources. Some prior work (e.g., Demberg et al. 2007) has found morphological tags highly useful, and error analysis ( §8.2) suggests this information would make an impact in French.
There is a large performance gap between the medium-resource and low-resource subtasks. For instance, the baseline achieves a WER of 10.6 in the medium-resource scenario and a WER of 25.1 in the low-resource scenario. It seems that current models are unable to reach peak performance with the 800 training examples provided in the lowresource subtask. Further work is needed to develop more efficient models and data augmentation strategies for low-resource scenarios. In our opinion, this scenario is the most important one for speech technology, since speech resourcesincluding pronunciation data-are scarce for the vast majority of the world's written languages.

Conclusions
The second iteration of the shared task on multilingual grapheme-to-phoneme conversion features many improvements on the previous year's task, most of all data quality. Four teams submitted thirteen systems, achieving substantial reductions in both absolute and relative error over the baseline in two of three subtasks. We hope the code and data, released under permissive licenses, 9 will be used to benchmark grapheme-to-phoneme conversion and sequence-to-sequence modeling techniques more generally.