Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation

,


Introduction
Text normalization refers to a range of tasks that consist in replacing non-standard spellings with their standard equivalents.This procedure is beneficial for many downstream NLP tasks, since it increases data homogeneity and thus reduces the impact of unknown word forms.Establishing identity between different variants of the same form is also crucial for information retrieval tasks, and in particular for building efficient corpus querying systems.Furthermore, it can facilitate building applications for a wider audience, such as spelling and grammar checkers.
Important progress has been made on normalization of historical texts (e.g., Tang et al., 2018;Boll-mann, 2019) and user-generated content (UGC) in social media (e.g.van der Goot et al., 2021).However, the existing work on dialect normalization (e.g., Scherrer and Ljubešić, 2016;Abe et al., 2018;Partanen et al., 2019) remains fragmented: it typically focuses on a single language and uses different models, experimental setups and evaluation metrics, making direct comparisons difficult.
In this paper, we aim to establish dialect-tostandard normalization as a distinct task alongside historical text normalization and UGC normalization.We make the following contributions: • We compile a multilingual dataset from existing sources and make it available in a unified format to facilitate cross-lingual comparisons.The dataset covers Finnish, Norwegian, Swiss German and Slovene.These languages come from three different families and have different morphological systems.For the two largest datasets (Finnish and Norwegian), we provide different data splits corresponding to different use cases for dialect normalization.
• We test a wide range of sequence-tosequence models that performed well in other normalization tasks: statistical machine translation, RNN-based and Transformer-based neural machine translation, and byT5, a pre-trained byte-based multilingual Transformer.We compare character with subword tokenizations as well as full-sentence contexts with sliding windows of three words.We evaluate the models on word accuracy, but also provide character error rates and word error reduction rates to facilitate comparison with previous work.Finally, we provide an error analysis on two of the Finnish data splits. 12 Related Work

Historical text normalization
Historical text normalization consists in modernizing the spelling of the text such that it conforms to current orthographic conventions.Pettersson et al. (2014) evaluate three different normalization methods in a multilingual setup: a simple filtering model, an approach based on Levenshtein distance, and an approach using character-level statistical machine translation (CSMT).They find that CSMT is the overall most promising approach.Scherrer and Erjavec (2016) use CSMT in supervised and unsupervised settings to normalize historical Slovene data.Tang et al. (2018) and Bollmann (2019) provide multilingual comparisons of neural and statistical MT approaches, whereas Bawden et al. (2022) evaluate different normalization methods on historical French.In most settings, SMT outperformed neural models.In several settings, BPE-based subword segmentation led to better results than characterlevel segmentation.

Normalization of user-generated content
UGC, typically found on social media, contains various non-standard elements such as slang, abbreviations, creative spellings and typos.De Clercq et al. (2013) present various experiments on normalizing Dutch tweets and SMS messages and show that a combination of character-level and word-level SMT models yields the optimal results.Matos Veliz et al. (2019) follow up on this work and show that data augmentation techniques are crucial for obtaining competitive results with NMT models.The MoNoise model (van der Goot, 2019) significantly improved the state-of-the-art in UGC normalization.It contains several modules such as a spellchecker, an n-gram language model and domain-specific word embeddings that provide normalization candidates.
The first multilingual, homogeneous dataset for UGC normalization was published in the context of the MultiLexNorm shared task in 2021 (van der Goot et al., 2021).The results of the shared task also supported the usefulness of normalization for downstream tasks such as PoS-tagging and parsing.The best-performing submission (Samuel and Straka, 2021) proposed to fine-tune byT5, a bytelevel pre-trained model (Xue et al., 2022), in such a way that normalizations are produced one word at a time.

Dialect-to-standard normalization
There has been comparatively less research in the domain of dialect normalization.Scherrer and Ljubešić (2016) apply CSMT to Swiss German.They create models for normalizing individual words and entire sentences and show that the larger context provided by the latter is beneficial for the normalization of ambiguous word forms.Lusetti et al. (2018) work on a different Swiss German dataset and show that neural encoder-decoder models can outperform CSMT when additional targetside language models are included.Abe et al. (2018) work on the NINJAL corpus (Kokushokankokai Inc, 1980), and propose to translate Japanese dialects into standard Japanese using a multilingual (or rather, multi-dialectal) LSTM encoder-decoder model.The transduction is done on the level of bunsetsu, the base phrase in Japanese, which corresponds to a content word, potentially followed by a string of functional words.Partanen et al. (2019) compare LSTM-based and Transformer-based character-level NMT models for normalizing Finnish.The authors use contexts of one word, three words, or the entire sentence, the best results being achieved using three words.On the contrary, Hämäläinen et al. (2020) report optimal results while using individual words in normalizing Swedish dialects spoken in Finland.Hämäläinen et al. (2022) use generated dialectal Finnish sentences to normalize Estonian dialects.
Machine translation from Arabic dialects to Modern Standard Arabic (MSA) can also be considered a dialect normalization task, although Arabic dialects differ from the standard variety to a greater extent than the languages used in our work and therefore contain lexical replacements and reorderings.The 2023 NADI shared task (Abdul-Mageed et al., 2023) provides a subtask on Arabic dialect translation.The MADAR corpus (Bouamor et al., 2018) is a popular resource for Arabic dialect translation and covers 25 dialects.Additionally, Eryani et al. (2020) describe the creation of a normalization corpus for five Arabic dialects, but do not report any experiments on automatic normalization.Zhao and Chodroff (2022) similarly report on corpus compilation of Mandarin dialects, but they focus on acoustic-phonetic analysis.
Each of these works focuses on dialects of a single language and uses different models and experimental setups, as well as different evaluation metrics (including BLEU, word error rate and ac-  curacy).All of these factors make meaningful comparisons between different approaches difficult.

Datasets
We propose a multilingual dataset that covers Finnish, Norwegian, Swiss German and Slovene.
The dataset is compiled from existing dialect corpora, which are presented in detail below.The languages originate from two language families (Uralic and Indo-European) and three branches (Finnic, Germanic, Balto-Slavic).All languages adopt the Latin script.This enables the comparison of language structure, rather than differences in script.Some quantitative information about the individual corpora is available in Table 1, and Table 2 provides some example sentences.

Finnish
The Samples of Spoken Finnish corpus (Suomen kielen näytteitä, hereafter SKN) (Institute for the Languages of Finland, 2021) consists of 99 interviews conducted mostly in the 1960s.2It includes data from 50 Finnish-speaking locations, with two speakers per location (with one exception).The interviews have been transcribed phonetically on two levels of granularity (detailed and simplified) and normalized manually by linguists.We only consider the utterances of the interviewed dialect speakers, not of the interviewers.Although the detailed transcriptions have been used for the normalization experiments in Partanen et al. (2019), we use the simplified transcriptions here to make the annotations more consistent with the other languages.The simplified transcriptions do not make certain phonetic distinctions and share the alphabet with the normalized text.

Norwegian
The Norwegian Dialect Corpus (Johannessen et al., 2009, hereafter NDC) was built as a part of a larger initiative to collect dialect syntax data of the North Germanic languages. 3The recordings were made between 2006 and 2010, and typically four speakers per location were recorded.Each speaker appears in an interview with a researcher and in an informal conversation with another speaker.We concatenate all utterances of a speaker regardless of the context in which they appear.The recordings were transcribed phonetically and thereafter normalized to Norwegian Bokmål.The normalization was first done with an automatic tool developed specifically for this corpus, and its output was manually corrected afterwards.The publicly available phonetic and orthographic transcriptions are not well aligned; we automatically re-aligned them at utterance and word level.4

Swiss German
The ArchiMob corpus of Swiss German (Samardžić et al., 2016;Scherrer et al., 2019) consists of oral history interviews conducted between 1999 and 2001. 5The corpus contains 43 phonetically transcribed interviews, but only six of them were normalized manually.We use the interviewee's utterances of these six documents for our experiments.The selected texts originate from five dialect areas, covering approximately one third of the German-speaking part of Switzerland.

Slovene
Our Slovene dataset is based on the GOS corpus of spoken Slovene (Verdonik et al., 2013). 6The original corpus contains 115h of recordings.Two transcription layers are included: a manually transcribed phonetic layer and a semi-automatically normalized layer (with manual validation).
Since the degree of non-standardness in the full corpus was relatively low (16%), we select a subset of the data for our experiments.We retain speakers whose productions contain at least 30% nonstandard tokens and who have produced at least 1000 words.This results in a set of 36 speakers from 10 dialect regions.

Preprocessing
To ensure the datasets are comparable, we have applied several preprocessing steps: removing punctuation and pause markers, substituting anonymized name tags with X, and excluding utterances consisting only of filler words.The Slovene data includes some utterances in Italian and German which are normalized to the corresponding standard.They have been excluded from the data.

Data splits
The Finnish and Norwegian datasets contain multiple speakers per location, which provides a possibility to test the generalization capabilities of the models in different scenarios.We create three different data splits: 1. Normalizing unseen sentences of seen speakers.We divide each speaker's data in such a way that 80% of sentences are used for training, 10% for development and 10% for testing.The sentences are selected randomly.
2. Normalizing unseen speakers of seen dialects.We pick speakers from selected locations for the development and test sets, while the rest of the speakers are used for training.For each location, at least one speaker is present in the training set.In other words, 80% of speakers are used for training, 10% for development and 10% for testing.While previous work (Scherrer and Ljubešić, 2016;Partanen et al., 2019) mostly relies on split 1, this setup potentially overestimates the models' normalization capabilities: in a given conversation, utterances, phrases and words are often repeated, so that similar structures can occur in the training and test sets.We argue that splits 2 and 3 more realistically reflect dialectological fieldwork, where new texts are gradually added to the collection and need to be normalized.

Tokenization and context sizes
Text normalization is generally viewed as a character transduction problem (Wu et al., 2021), and it seems therefore most natural to use single characters as token units.Character tokenization has also been shown to work well for normalization tasks in various recent studies.
Most other character transduction problems, such as transliteration or morphological inflection, are modelled out-of context, i.e., one word at a time.This assumption does not seem accurate for text normalization: as shown in Table 1, between 5 and 10% of word types have more than one possible normalization, and disambiguating these requires access to the sentential context.Moreover, depending on the annotation scheme, there are sandhi phenomena at the word boundaries (cf. the SKN example in Table 2: syänys instead of syänyt because of assimilation with the following s) that cannot be taken into account by models that operate word by word.Thus, the most obvious strategy for text normalization is to consider entire sentences and break them up into single character tokens.
This strategy combining long contexts with short tokens leads to rather long token sequences, and NMT approaches especially have been shown to underperform in such scenarios (Partanen et al., 2019).We include two alternative ways of addressing this issue: (1) by shortening the instances from full sentences to sliding windows of three consecutive words, and (2) by lengthening the tokens using subword segmentation.
Sliding windows.Partanen et al. (2019) propose to split each sentence into non-overlapping chunks of three consecutive words.We adapt this approach and use overlapping chunks of three words to ensure that the model always has access to exactly one context word on the left and one on the right.At prediction time, only the word in the middle of each chunk is considered.
The preprocessing needed to create input for both entire-sentence and sliding-window models is illustrated in Appendix D.
Subword segmentation.Tang et al. (2018) and Bawden et al. (2022) found that subword segmentation could outperform character-level segmentation on the task of historical text normalization.We follow this work and experiment with subword segmentation as well.Several segmentation schemes have been proposed for general machine translation, e.g.byte-pair encoding (Sennrich et al., 2016, BPE) or the unigram model (Kudo, 2018).Kanjirangat et al. (2023) found the unigram model to perform better than BPE on texts with inconsistent writing.This is the case for our speech transcriptions and we thus opt to use the unigram model.We train our models with the SentencePiece library (Kudo and Richardson, 2018), and optimize the vocabulary size separately for each dataset.

Evaluation
We evaluate the models on word-level accuracy, i.e., the percentage of correctly normalized words.Since the reference normalizations are tied to the words in the source sentences, and since the models' output can differ in length from the source sentence, we need to re-align the model output with the reference normalization.We apply Levenshtein alignment to the entire sequence pair and split the system output at the characters aligned with word boundaries of the input (see Appendix D for an illustration). 7ord-level accuracy lacks granularity and does not distinguish between normalizations that are only one character off and normalizations that are completely wrong.Therefore, we include character error rate (CER) as a more fine-grained metric.8CER is defined as the Levenshtein distance between the system output and the reference, normalized by the length of the reference.Another advantage of CER is that it can be computed directly on sentence pairs without re-alignment.9 Following van der Goot et al. ( 2021), we also report error reduction rates in Appendix C.
We compare the systems to two baselines: the leave-as-is (LAI) baseline corresponds to the percentage of words that do not need to be modified.The most frequent replacement (MFR) baseline translates each word to its most frequent replacement seen in the training data, and falls back to copying the input for unseen words.

Methods and Tools
We utilize both statistical and neural machine translation tools trained from scratch, as well as a pretrained multilingual model.We train (or fine-tune) the models for each of the four languages separately.Our hyperparameter choices largely follow recent related work on text normalization.The main characteristics of the models are summarized below, and a detailed description of the hyperparameters is given in Appendix A. The methods used in the experiments are: SMT.Our statistical machine translation method corresponds mostly to the one implemented in the CSMTiser tool.10It uses the Moses SMT toolkit (Koehn et al., 2007) with a 10-gram KenLM language model trained on the training sets.11Scherrer (2023) found eflomal (Östling and Tiedemann, 2016) to produce better character alignment than the more commonly used GIZA++, and we adopt this method.Minimum error rate training (MERT) is used for tuning the model weights, using WER (word error rate, which effectively becomes character error rate in a character-level model) as the objective.
RNN-based NMT.This model uses a bidirectional LSTM encoder and a unidirectional LSTM decoder with two hidden layers each.The attention mechanism is reused for copy attention.
TF-based NMT.This model has 6 Transformer layers in the encoder and the decoder, with 8 heads each.We found in preliminary experiments that position representation clipping was beneficial to the results.All NMT models are trained with the OpenNMT-py toolkit (Klein et al., 2017).

Results and Discussion
We evaluate the models presented in Section 5 with the metrics described in Section 4.3.We run the models on the three folds of each data split, and present the average scores and standard deviations for each metric.
The word-level accuracies are presented in Table 3 and the character error rates in Table 4.We also provide accuracy scores for the development sets in Appendix C. Transformer-based methods appear as the most robust: the Transformer trained on sliding windows is best for SKN, ArchiMob and GOS, while the Transformer-based byT5 produces the best results for Norwegian.For all other corpora, byT5 is a close second in accuracy, and is thus the best alternative for entire sentences.
While CSMT and RNN-based methods do not yield the best score for any dataset, they still perform reasonably well.For the NMT models, using the sliding window instead of sentences enhances the results for all corpora but NDC.Regarding tokenization, the subwords improve the performance on the large corpora (SKN and NDC) but worsen it on the small corpora (Archimob and GOS).For all datasets, the best results are obtained on the character (or byte) level.
We expected rising levels of difficulty between data split 1 and 3, but neither the baselines nor the model outputs confirmed our expectations.The  differences between splits are very small, and most models achieve the lowest results on split 2.
The character error rates presented in Table 4 follow the same pattern as the word accuracies when it comes to the best models.However, for SKN, byT5 drops below sliding window RNN and SMT with this metric.Some of the Finnish byT5 models generate much shorter predictions than the other models, but it remains to be investigated why this occurs and why it only affects some training runs.Table 4 also highlights poor performance and large standard deviation with sentence-level RNN on GOS.This is in line with earlier findings about neural models' tendency to overmodify the predictions (Bawden et al., 2022).

Comparison with previous work
Partanen et al. ( 2019) worked on the normalization of the Finnish SKN dataset, reporting word error rate (WER) as their main metric.Although they use the detailed SKN transcriptions instead of the simplified ones, their results are roughly comparable with our SKN1 data split (see Table 5).While they were not able to successfully train sentence-level models, our parameterization closes the gap to the chunk models.Their best reported word error rates Partanen et al. (2019)  are however slightly lower than ours.Scherrer and Ljubešić (2016) presented normalization experiments on the ArchiMob corpus.Our results are largely comparable to theirs.A detailed comparison is provided in Appendix B.

Error analysis
We examine the effects of different data splits by looking at the output of the sliding window Transformer on splits 1 and 3 of the Finnish SKN corpus.As a reminder, the test set in split 1 contains unseen sentences from seen texts (and therefore seen dialects), whereas in split 3 it comes from unseen locations.It can be expected (1) that the model trained on SKN1 performs better on dialect-specific phenomena, such as normalizations involving diphthongs, consonant grade and inflection marks; and (2) that the two models behave similarly on phenomena that are not dialect-specific, such as capitalization and proper names.
Error type SKN1 SKN3  We analyze model output on the sentences that appear in both test sets and focus on words for which at least one model produced an erroneous normalization.We identify 382 such cases.On this set of words, the SKN3 model produces a much higher number of errors ( 318) than the SKN1 model (190).The higher number of errors on SKN3 is coherent with the intuition that this split is more difficult, but the nature of the errors produced by the two models does not fully conform to the expectations (see Table 6).In absolute terms, the model trained on SKN3 does produce more inflection and consonant grade errors, but fewer diphthong errors.In relative terms, the SKN3 model produces a lower percentage of inflection errors than its counterpart (40% vs 46%).This seems to indicate that split 3 does not preclude the model from learning dialectrelated patterns.We hypothesize that this is due to the fact that the training set contains material from the same dialect area as the test set (although not from the exact same location).
To identify the critical point beyond which the cross-lectal model performance would be clearly affected, it could be useful to introduce a fourth data split which would exclude larger dialect areas from the train set and test on unseen dialects.

Task
LAI ERR Historical norm.17  As mentioned above, dialect-to-standard normalization shares fundamental properties with historical text normalization and UGC normalization.Here, we compare the respective difficulties of these three tasks.It can be seen that dialect normalization has the lowest LAI rates on average and thus requires the most changes of the three tasks.The models perform roughly equally well on both historical and dialectal normalization, whereas UGC normalization seems to be a more difficult task.

Conclusions
In this paper, we present the dialect-to-standard normalization task as a distinct task alongside historical text normalization and UGC normalization.We introduce a dialect normalization dataset containing four languages from three different language branches, and use it to evaluate various statistical, neural and pre-trained neural sequence-to-sequence models.
In our base setup with models trained on entire sentences with character (or byte) segmentation, the pre-trained byT5 model performs best for all languages and data splits.Moving from character segmentation to subword segmentation increases the accuracies for the large datasets (SKN and NDC), but not enough to surpass byT5.In contrast, the sliding window approach outperforms byT5 on all languages except Norwegian.The superior performance of byT5 on Norwegian cannot be directly explained by the amount of training data,14 but it is likely that the closely related languages Swedish and Danish enhance its performance.A further analysis on character error rate shows that the neural models sometimes offer very poor predictions, which are not visible when using accuracy as a metric.
In this work, we have evaluated the most common and most popular model architectures, but it would be interesting to test specific model architectures for character transduction tasks, e.g.models that put some monotonicity constraint on the attention mechanism (Wu and Cotterell, 2019;Rios et al., 2021).We defer this to future work.
Another point to be investigated in future work is data efficiency.Our training sets are relatively large in comparison with other character transduction tasks, and it would be useful to see how much the data requirements can be reduced without significantly affecting the normalization accuracy.

Limitations
We see the following limitations of our work: • The proposed multilingual dataset is biased towards European languages and European dialectal practices.It may therefore not generalize well to the types of dialectal variation present in other parts of the world and to transcriptions in non-Latin scripts.In particular, there is an extensive amount of research on the normalization of Arabic and Japanese dialects (e.g., Abe et al., 2018;Eryani et al., 2020).We address some of these issues in Scherrer et al. (2023).
• We voluntarily restrict our dataset to "clean" corpora, i.e., interviews transcribed and normalized by trained experts.This contrasts with other data collections specifically aimed at extracting dialectal content from social media (e.g., Ueberwasser and Stark, 2017;Mubarak, 2018;Barnes et al., 2021;Kuparinen, 2023).Such datasets compound the features and challenges of both dialect-to-standard normalization and UGC normalization.
• We did not perform extensive hyperparameter tuning in our experiments, but rather use settings that have performed well in other normalization tasks.It is therefore conceivable that the performance of NMT models in particular could be improved.Furthermore, specific model architectures for character transduction tasks have been proposed, e.g.constraining the attention to be monotonic (Wu and Cotterell, 2019;Rios et al., 2021).We did not include such architectures in our experiments since they generally only showed marginal improvements.

Ethics statement
All our experiments are based on publicly available datasets that were costly to produce.It is important to ensure that these are appropriately acknowledged.Anybody wishing to use our dataset will also need to cite the publications of the original datasets.Details are given on the resource download page: https://github.com/Helsinki-NLP/dialect-to-standard.
The datasets have been anonymized where necessary.Text normalization is explicitly mentioned as a possible research task in the literature presenting the ArchiMob corpus (Samardžić et al., 2016;Scherrer and Ljubešić, 2016), and the SKN dataset has been previously used to evaluate normalization models.We are not aware of any malicious or harmful uses of the proposed dialect-to-standard normalization models.

A Experimental details
We trained all neural models on a single NVIDIA V100 GPU.The SMT models were trained on a Xeon Gold 6230 CPU.(2019) and it represents, roughly speaking, the improvement of a model relative to the LAI baseline.Thus, it makes it easier to compare models across datasets, which may not be the case with accuracy due to different LAI values.ERR is defined as follows:

C Additional results
=   −   1.0 −   By and large, these results follow the same pattern as the accuracy and CER scores reported in Section 6.
Table 11 shows the accuracies on the development sets.They are comparable with the test set

Figure 1 :
Figure 1: The three data splits (SKN1 on the left, SKN2 in the center, SKN3 on the right) visualized on a subset of the Finnish dataset.Each dot represents one speaker.There are generally two speakers per location in SKN.

3.
Normalizing unseen dialects.All speakers from a given location are assigned to either the training, development or test set.In other words, 80% of locations are used for training, 10% for development and 10% for testing.The different data splits are visualized in Figure 1.For the smaller and less geographically diverse Swiss German and Slovene datasets, we only use split 1.For each split, we create three folds with random divisions into train, development and test sets.
word per line.Gloss: 'and I also delivered a sewing machine to her.' _ ù n d _ d è r e _ h a n i _ a _ e _ n è è m a s c h i n e _ g l i f e r e t _ _ u n d _ d i e s e r _ h a b e _ i c h _ a u c h _ e i n e _ n ä h m a s c h i n e _ g e l i e f e r t _ Resulting training instance for the full sentence models (original sentence in the top row, normalized sentence in the bottom row)._ ^_ ù n d _ d è r e _ _ ^_ u n d _ d i e s e r _ _ ù n d _ d è r e _ h a n i _ _ u n d _ d i e s e r _ h a b e _ i c h _ _ d è r e _ h a n i _ a _ _ d i e s e r _ h a b e _ i c h _ a u c h _ _ h a n i _ a _ e _ _ h a b e _ i c h _ a u c h _ e i n e _ _ a _ e _ n è è m a s c h i n e _ _ a u c h _ e i n e _ n ä h m a s c h i n e _ _ e _ n è è m a s c h i n e _ g l i f e r e t _ _ e i n e _ n ä h m a s c h i n e _ g e l i e f e r t _ _ n è è m a s c h i n e _ g l i f e r e t _ $ _ _ n ä h m a s c h i n e _ g e l i e f e r t _ $ _ Resulting training instances for the sliding window models.The number of training instances corresponds to the number of source words in the verticalized data.

Figure 2 :Figure 3 :
Figure 2: Data preprocessing for full sentence and sliding window models, illustrated on an example of the Swiss German ArchiMob corpus.

Table 2 :
Normalization examples of the four languages.The top row presents the original phonetic transcription, the middle row the normalized version, and the bottom row provides an English gloss.

Table 3 :
Word-level accuracy (↑).We report averages and standard deviations over the three folds of each data split.

Table 6 :
Comparison of error types between the models trained on SKN1 and SKN3.

Table 7
Table 7 reports the LAI ranges and ERR ranges of the best systems reported in Bollmann (2019) and van der Goot et al. (2021).

Table 8 :
Table 8 presents the average training time and number of parameters for a single fold of the largest of our datasets, the Norwegian NDC.Training runtime (average) and number of parameters for a single character-level NDC model.