Dialect Transfer for Swiss German Speech Translation

This paper investigates the challenges in building Swiss German speech translation systems, specifically focusing on the impact of dialect diversity and differences between Swiss German and Standard German. Swiss German is a spoken language with no formal writing sys-tem, it comprises many diverse dialects and is a low-resource language with only around 5 million speakers. The study is guided by two key research questions: how does the inclusion and exclusion of dialects during the training of speech translation models for Swiss German impact the performance on specific dialects, and how do the differences between Swiss German and Standard German impact the performance of the systems? We show that dialect diversity and linguistic differences pose significant challenges to Swiss German speech translation, which is in line with linguistic hypotheses derived from empirical investigations.


Introduction
There are three main challenges when building Swiss German speech-to-text systems.First, Swiss German is a spoken language with no formal writing system.Thus, the task is formulated as the translation of Swiss German audio into Standard German text (Plüss et al., 2021) without access to an intermediate textual representation of the source language.This leads to difficulties where Swiss German and Standard German differ, e.g. in the usage of tenses (where Swiss German does not use preterite) or lexical items that are distinct between the two languages.The second challenge is that Swiss German consists of many dialects that tend to differ.This issue makes training a speech translation system from Swiss German to Standard German a task of translating many dialects into a single language.The third challenge is that Swiss German is a low-resource language with only around 5 Million speakers.Furthermore, the fact that some dialects differ significantly yields even fewer speakers for each dialect (e.g., Valais only has about 80K speakers).
These three challenges motivate the investigation of the following two questions: • How does the inclusion and exclusion of dialects during the training of speech translation (ST) models for Swiss German impact the performance on specific dialects?• Do the differences between Swiss German and Standard German negatively impact the performance of the ST systems?
The first question investigates the issue of the diversity of dialects in a low-resource setting, while the second question investigates the differences between Swiss German and Standard German.
Contributions.To answer these questions, we first review the Swiss German dialect landscape and the differences to Standard German.In particular, we devise a set of hypotheses from the literature stating which dialects are expected to differ from each other and which Swiss German phenomena are expected to impact the performance.We then investigate empirically whether the hypotheses match the results of training ST models in various settings.
Our findings show that the empirical results follow the linguistic investigations.That is, there are dialects that we expect to differ significantly from others, which is confirmed by our empirical results.Furthermore, the differences between Swiss German and Standard German impact the performance, where the past tense has the highest impact.
2 Swiss German Dialects 2.1 Linguistic background "Swiss German" commonly refers to the dialects spoken in German-speaking Switzerland.All Swiss German dialects are Alemannic dialects 1 belonging to the High German languages within the West Germanic language family.The Alemannic dialects in Switzerland can be further split into High Alemannic and Highest Alemannic variants 2 .
The sociolinguistic situation in Germanspeaking Switzerland is particular: unlike in other linguistic areas, high prestige is associated with dialects (Ender and Kaiser, 2009), which are considered important markers of regional identity.This means that Swiss German dialects are spoken in most everyday situations, including formal ones.They are, however, only written in informal contexts, such as social media/text messages, and sometimes also for advertisement purposes.On the other hand, formal correspondence, laws, and newspapers are always written in Standard German.The Standard German of Switzerland differs from the varieties in Germany and Austria and is therefore often referred to as "Swiss Standard German."

Differences between Swiss German and Standard German
Alemannic dialects (AL), including Swiss German, differ from Standard German (DE) in several ways such as: Phonology: Middle German monophthongs and diphthongs are preserved in Swiss German (AL "huus" vs DE "Haus" (house) and AL "[li@b]" vs DE "[li:b]" (dear)) and most dialects have completed the High German consonant shift (AL "['xaSt@]" vs DE "['kastn]" (box)).Stress is placed on the first syllable of a word in Swiss German more often than in DE.
Grammar: AL nouns have no genitive case but rather use possessive constructions ("s huus vom buur" -"the house of the farmer" and "im buur sis huus" -"the farmer his house") and the accusative and nominative cases are conflated (except for personal pronouns).The verbal system has no preterite tense, perfect constructs are used instead.Relative clauses always use the particle "wo", and some verbs reduplicate in present infinite form when forming a complex predicate with another verb (e.g."du lohsch mi lo ässe" -"you let me [let] eat").
Lexis: There are a rather large number of Swiss 1 Except for the dialect of Samnaun in Grisons, which is a variant of the Bavarian language.
2 The traditional dialect of Basel City has many features of Low Alemannic German, but due to internal migration, these have mostly leveled and are now close to High Alemannic.German / Alemannic vocabulary items which are not intelligible to speakers of Standard German3 .In Swiss German, some of these originate from French (e.g., "trottoir" for pavement).Further, a highly typical Swiss German word formation process is "-li" suffixation (e.g., "Hündli" -little dog).
In this work, we will investigate the impact of two differences between Swiss German and Standard German: first, preterite tense, which does not exist in Swiss German (Section 6.1), and second, Swiss German vocabulary items that are expected to notably differ from Standard German (Section 6.2).

Dialect regions and dialect differences
The STT4SG-350 speech translation corpus (Plüss et al., 2023a), which we use for the experiments reported in this paper, defines seven dialect regions.They correspond to Swiss cantons as follows: • Basel (BS): Basel-City, Basel-Country and parts of Aargau  1 visualizes these seven dialect regions.We will investigate if similarities between the di-alect regions impact ST performance.To quantify the differences between the dialects, we use dialectometric data from the Dialect Atlas of German-Speaking Switzerland (DAGSS; (Hotzenköcherle et al., 1962(Hotzenköcherle et al., -1997))) presented by (Scherrer and Stoeckle, 2016).The digitised DAGSS data set, version 3 (DDAGSS), consists of 289 linguistic features (107 phonological, 118 morphosyntactic, and 64 lexical features) collected from local respondents in a field survey across 565 Swiss locations.
We use the DDAGSS features to calculate linguistic distance indices based on the Relative Identity Value (RIV) metric (Goebl, 2010).To apply RIV to the STT4SG-350 data, we match the DDAGGS survey sites to our seven dialect regions, then calculate RIV for all site pairs 4 and average this for all region pairs.The result when using all DDAGSS features 5 is shown in Figure 2. From the pairwise distances, we derive the following hypotheses: • VS-hyp.VS is the most distant dialect from other dialect regions, which corresponds well with local perceptions.Thus, we expect that systems that are trained only on VS perform badly on other dialects, and systems not trained on VS will not perform well on VS. 4 Specifically, we use the following process to calculate RIV: for two survey sites, we first identify all features for which they both have a response (total).We then count all features where the responses are not identical (different).We finally calculate the distance metric as RIV = different / total 5 In Appendix C, we display the distance matrices per feature group.(ASR) model to different English accents leads to faster training and higher accuracy than a model trained from scratch.Transfer learning from English ASR to different languages such as German, Spanish, and Russian also leads to a higher accuracy than training from scratch (Luo et al., 2021).In (Woldemariam, 2020), transfer learning for lowresource speech recognition in the case of Amharic from English and Mandarin has been studied.For both pre-training languages fine-tuning ASR for Amharic led to an absolute WER reduction of 14.22% and 10.25%, respectively.In our work, we extend the research on transfer learning for different dialects from the ASR set to a speech translation task.Swiss German Speech Translation.Recent high quality datasets such as SwissDial (Dogan-Schönberger et al., 2021), SPC (Plüss et al., 2021), andSDS-200 (Plüss et al., 2022) enabled notable advancements in Swiss German Speech Translation (ST).These datasets successfully cover a wide range of Swiss German dialects.However, we are not aware of any comprehensive investigation into the interactions between these dialects in ASR or ST.

Data
For this investigation, we rely on the STT4SG-350 corpus (Plüss et al., 2023a), which consists of 343 hours of speech translation data.That is, it contains audio in Swiss German and the corresponding Standard German sentence.The sentences were presented to speakers who were asked to translate them to their dialect (without writing down the translation) and record them.One effect of this setting is that speakers may use synonyms and different tenses, word order, etc. when saying the sentence aloud in their dialect, leading to a deviation from the reference sentence.The corpus consists of a train and test set.The particular setting of the test set is that it contains 5 hours of audio for each of the seven dialect regions (cp.Section 2.3) based on the same set of sentences.This allows for conclusive comparisons of dialect transfer effects.STT4SG-350 also provides a balanced training split with 34 hours for each dialect region (see Table 1).In the train set, the sentences are diverse across the various regions.Furthermore, STT4SG-350 is also balanced with respect to gender and age, achieving an almost uniform distribution across these dimensions.

Models
For the subsequent experiments, we used three models to ensure that the results generalize to various architectures.XLS-R.The first model is a pre-trained XLS-R-300M model (Babu et al., 2021).We initially conducted two different XLS-R-300M experiments, with CTC decoding and with a sequence-tosequence architecture.The experiments reported here used the CTC architecture, as they perform better, and the sequence-to-sequence results do not provide any additional insights.A comparison of CTC and sequence-to-sequence results is reported in Appendix B.
Trafo.The second model is a randomly initialized transformer model implemented in the FAIRSEQ S2T library (Ott et al., 2019) and replicating the model of (Plüss et al., 2022), which we trained from scratch on our data.
Whisper.The third model is Whisper small (Radford et al., 2022), a pre-trained transformer-based sequence-to-sequence model.
For the investigation, we relied on medium-sized models: the XLS-R model has 317M parameters, the Trafo model 72M parameters, and Whisper small 244M parameters.We avoid larger models as they make the experiments prohibitively time and cost intensive6 .

Transfer Experiments
We are interested in the interplay of the dialects: which dialects benefit from other dialects being in the training data, and are there dialects where specialized fine-tuning is more adequate?For this, we run two experiments: 1) Leave-One-Out (LOO), where we train the models on all dialects except one and measure the impact on the performance of the left-out dialect as well as on the other dialects, 2) Single-Dialect (SD), where we fine-tune the models on a single dialect and measure the impact on the performance of the various dialects.
For both experiments, we compare the results to the All-dialects setting, which consists in training the model on the full STT4SG-350 corpus, i.e., on all dialects.For all experiments, we report the BLEU score (Papineni et al., 2002), which is calculated using the script provided by (Plüss et al., 2023b).To analyse the difference between Alldialects and the different settings, we compute the ratio between the BLEU score achieved by the Alldialects setting and the BLEU score achieved by the specific experiment (retainment ratio).We visualise this performance retainment using heatmaps, which show the percentage of performance retained on the j th (column) when leaving out the i th (row) dialect.Thus, a value of 0.9 is to be read as the model achieving 90% of the BLEU score that Alldialects achieved.The average at the end of each row indicates the average retainment of the other dialects (i.e., excluding the value of the diagonal), which summarizes the influence of one dialect on other dialects.The absolute values are provided in Table 2.

Leave-One-Out
The leave-one-out (LOO) experiment fine-tunes the models with the data of all dialects while leaving one dialect out.This tests the impact one dialect has on the others through ablation.Table 2: BLEU scores of all experiments.Each experiment is a tuple of model and experiment type.There are three types: All-dialects, Leave-One-Out, and Single Dialect.For each setting, we report the score achieved on each dialect separately, and the average score achieved on the full test set.(so when BS is left out, the BLEU score on the BS test set is 67.7).It is apparent (and expected) that leaving out a dialect leads to lower performance on the left-out dialect compared to using the full dataset.For a deeper analysis of the dependence of the dialects, Figure 3 shows the retainment ratios.
We make the following observations: VS needs in-dialect data.The strongest reduction is measured when leaving out the VS dialect.
The drop is consistent across all three models.For XLS-R, the BLEU score only retains 87% of the performance.Trafo only retains 78% of the original BLEU score, and Whisper retains 83% of the All-dialects score.This drop was expected, as the VS dialect is the one that differs the most from all the other dialects (cp. Figure 2).ZH and CS do not require much in-dialect data.
Other dialects, such as ZH or CS, experience a smaller reduction.In fact, with XLS-R, both dialects almost completely retain their performance, and with Trafo, they retain 92% and 94% of the performance on average, respectively.This finding matches the CS/ZH-central-hyp hypothesis since ZH and CS benefit the most from other dialects.Mutual dependence.We note that when one dialect is left out with XLS-R, the other dialects do not suffer a performance loss.In some cases, the performance even exceeds the All-dialects setting, with ratios above 1.This is due to XLS-R's extensive pre-training, which allows it to adapt to new languages with relatively small amounts of data.Thus, the results with Trafo are more revealing.
When VS is left out, the other dialects suffer almost no deterioration, indicating the special status of this dialect, which is in line with the VS-hyp hypothesis.For the other dialects, omitting them leads to a greater performance loss on the other dialects.The largest overall drop in performance occurs when BS or ZH are left out, where the average retainment ratio of the other dialects is at 92%.For BS, this drop is not expected, as BS does not have as high pairwise similarities as ZH or CS.For ZH, the decrease is consistent with the centrality hypothesis of ZH.However, there are no pairs of dialects where omitting them leads to a loss of more than 10% of the original performance.

Single Dialect Fine-Tuning
The single dialect fine-tuning (SD) experiment finetunes the models on a single dialect and measures the effect on all dialects.Overall, the experiments confirm the hypotheses that we derived from the dialect distances based on the DDAGGS data.Most notably, the difference of VS to all the other dialects is as pronounced as the hypotheses predicted.Also, the status of ZH as a central dialect is confirmed by our experiments.On the model side, we note that both XLS-R and Whisper have a strong pre-training, where leaving out one dialect does not hurt performance.On the other hand, training only on a single dialect leads to subpar performance on samples of a different dialect.

Swiss German differences to Standard German
We investigate two main differences between Swiss German and Standard German (cp.Section 2.2).First, the usage of the past tense, i.e., samples where the Standard German text includes the preterite (simple past), which is not used in Swiss German.Second, the usage of words which are different in Swiss German compared to Standard German.

Preterite
We expect to see a mismatch between the transcript and the reference in those samples where the Standard German reference contains the simple past tense.We applied spaCy7 to find samples in the test set where the preterite tense is used in the Standard German text.There are 5908 samples containing preterite tense (23.4%).We used the All-dialects model to compute the BLEU scores separately for   the samples with past tense and those without past tense.Figure 5 shows the results.Samples with preterite tense in the Standard German text perform significantly worse than the ones without (mean BLEU of 0.625 vs. 0.722).Qualitative Analysis.A qualitative analysis revealed two types of errors among the samples containing preterite tense in the target.The first category of mistakes is those that could not be traced back to the use of past tense, i.e., those where other mistakes yielded a lower BLEU score.
In the second category, the mistakes are due to the past tense.There, we noticed that the generated text often did not match the tense of the hypothesis.Thus, following the idea of paraphrasing to extend the set of references (Paonessa et al., 2023), we extended the set of references for all the past tense samples by translating them into the past perfect form.For this, we used ChatGPT8 .In Figure 5, the impact is shown (Past Tense + Reference).Extending the references yields an improvement of almost 5 points in BLEU score, i.e., there are 1197 out of 5908 samples where the BLEU score improves when adding the rewritten reference.
This result shows that measuring the correctness of samples with past tense is not trivial since there are two cases to consider: 1) the transcript is correct but uses the past perfect form, which does not match the reference.2) the transcript is wrong due to difficulties handling the past tense.In Table 3, we show some examples.In the first example, the hypothesis uses the past perfect form correctly, while the target uses the past perfect.In the second example, which is in the passive voice, the hypothesis uses the present perfect but omits the participle.In the last example, the hypothesis uses the simple past while the target uses the past perfect.These examples illustrate the difficulty of measuring the effects of past tense.

Vocabulary
We collect lists of words where Swiss German and Standard German are likely to differ.We then measure the differences in BLEU scores between samples that contain such words and those that do not.To our knowledge, there is no inventory of Standard German words which are realised differently in Swiss German dialects.Therefore, we use three data sources as a proxy and apply (semi-)manual filtering to each.
2) The Wikipedia entry on Helvetisms9 contains 503 entries representing lexical items specific to Swiss (Standard) German.We remove proverbs and obsolete expressions and then manually filter the remaining entries: we try to intuit whether a given Standard German word is usually realised in Swiss German with a different lexeme (because the Standard German lexeme does not exist or is rare/dispreferred).This is a subjective process that is limited by the evaluator's knowledge and perception of the different dialects.After filtering, 262 words remain.
3) The GSWNorm 2022 Shared Task (von Däniken et al., 2022) presented a data set of 7284 Swiss German sentences (106K tokens / 20K types) with word-level normalisation into Standard German.In order to reduce the set of types to be filtered manually, we apply a heuristic: we keep only those pairs where the first letters of the two words are not identical, leading to 2569 candidate types.We then filter these in the same way as Wikipedia Helvetisms, resulting in a final list of 267 items.We combine these three word lists and after eliminating duplicates between the lists, 522 vocabulary items remain.There are 2975 samples (12.1%) that contain at least one special vocabulary item.Results. Figure 6 shows the difference in BLEU score between those samples without special vocabulary (No Special), the samples with special vocabulary (Special), and All-dialects (All).The difference in BLEU score is large (66.13 vs. 70.39).However, as with the Past Tense, the source of the BLEU difference is unclear, i.e., we do not know whether the difference stems from the transcript being wrong or the hypothesis using a synonym not covered by the reference (such synonyms may be introduced during translation and recording, see Section 4.1).To illustrate this issue, Table 4 shows examples of mismatches between the transcript and Figure 6: Comparison of distributions of sentence level BLEU scores for sentences containing words of special vocabulary.The differences between "No Special" and "Special" are significant according to Welch's t-test (p=5.3e-13).
the target.In most examples, the Swiss version of the word was used.For instance, Swiss German "wie schaut es aus", vs.Standard German "wie sieht es aus".Thus, a fair amount of the mismatch in terms of BLEU score can likely be attributed to using the Standard German version in the target.

Conclusion
Our findings show that the empirical results are consistent with the linguistic distances between the dialects and the differences between Swiss German and Standard German.For example, dialects similar to other dialects (such as ZH) positively affect others when included in the training data.In contrast, dialects that differ from others (such as VS) need in-dialect data to perform well.The usage of preterite and vocabulary specific to Swiss German impact the BLEU score negatively.Thus, future work is concerned with investigating methods dealing with more efficient dialect transfer and handling the differences between Swiss German and Standard German.

Limitations
No Learning Curve.In our experiments, we either used the full set of dialect data (SD) or none of the dialect data (LOO).A more fine-grained analysis would include computing learning curves to understand how much "new" dialect data is needed.This would also give more hints on the interplay between dialects.However, these experiments would be very costly to run.No Tests on very large SOTA models.We limited our experiments to models with around 300M parameters and not very large billion-parameter models.There are two reasons: first, the training time and cost would have become prohibitively large, and second, as shown in Appendix A, the insights are expected to be largely the same.Error Attribution Past Tense.Measuring the mistakes caused by the past tense is not trivial since the mistakes could two main sources: 1) a non-tense-related error or 2) a tense-related error.From the latter, there are again two subcases: 1) the model could not handle the tense and made mistakes due to that (e.g., second example in Table 3) 2) the model behaves well, but the reference does not match the tense generated by the model (which could be caused by the translation and recording process, see Section 4.1).Thus, we cannot measure how many errors are due to which of the above error types, only the impact on the BLEU score.Error Attribution Vocabulary.Similarly, for the special vocabulary, we can only measure the impact on the BLEU score.The errors could also be due to a non-vocabulary-related source.If they are caused by vocabulary, it is still not clear whether the error stems from using a wrong word or a synonym that is not covered by the reference.Word List Subjectivity.The creation of the word list was done mostly ad-hoc and based on the subjective interpretations of the word-list creator.Furthermore, we did not differentiate between dialects and some dialects may use vocabulary similar to Standard German.

B Comparison of XLS-R 300M with CTC vs. sequence-to-sequence decoder
We executed all XLS-R-300M experiments with two different decoders: The standard CTC decoder and a sequence-to-sequence (Seq2Seq) decoder.In   on the 118 morphosyntactic features and Figure 8c based on 64 lexical features.We can see that VS is the most distinct dialect across all feature types, followed by GR, BE and ES in changing order changing depending on the feature.

D Experiment Details
The vocabulary used to preprocess the sentences is limited to lower-case characters and the German umlauts ä, ö, and ü.All characters with other accents are transformed into their corresponding character without accents, and hyphens are replaced with a space.XLS-R.We use the fairseq implementation10 and replicate the training procedure and model settings from (Plüss et al., 2023a).The runs on the full dataset and the Leave-One-Out (LOO) experiment are trained for 80K steps.The Single Dialect Fine-Tuning (SD) models are only trained for another 20K steps without any freezing during the warmup phase.All the final models correspond to the checkpoint with the best Word Error Rate on the validation dataset during training.There is no taskspecific fine-tuning of the hyperparameters.The training for the 300M version of the model is conducted on 2 NVIDIA A100 40 GB GPUs.The 1B All-dialects model is trained on 4 NVIDIA A100 40 GB GPUs.
Trafo We replicate the model architecture and training procedure from Plüss et al. (2022).This model is based on the FAIRSEQ S2T library (Ott et al., 2019;Wang et al., 2020).The runs on the full dataset and the Leave-One-Out (LOO) experiment are trained for 80K steps.The Single Dialect Fine-Tuning (SD) models are only trained for another 10K steps.There is no task-specific fine-tuning of the hyperparameters.The models are trained on a single NVIDIA A100 40 GB GPUs.Whisper We use the Huggingface (Wolf et al., 2019) implementation11 of the Whisper small model including the provided fine-tuning procedure.The runs on the full dataset and the Leave-One-Out (LOO) experiments are trained for 80K steps and a warmup phase of 10K steps.The Single Dialect Fine-Tuning (SD) models are only trained for another 10K steps with a warmup phase of 1K steps.The learning rate is set to 1e − 5.There is no task-specific specific fine-tuning of the hyper-  parameters.The models are trained on 4 NVIDIA A100 40 GB GPUs.

E Experiments -Alternative Metrics
In the main text, we use the BLEU score for our analysis, here, we present the same type of results using different metrics: WER, TER and chfF.Figure 9 shows the LOO experiment results using WER, figure 10 shows the LOO experiment results using TER, and figure 11 shows the LOO experiment using chrF.All the metrics yield the same underlying results and conclusions as using the BLEU score.In table 6, the All dialects scores for the different metrics are presented.Figure 12 shows the SD experiment results using WER, figure 13 shows the SD experiment results using TER, and figure 14 shows the SD experiment using chrF.
All the metrics yield the same underlying results and conclusions as using the BLEU score.

Figure 2 :
Figure 2: Matrix of linguistic distances between dialect regions.

Figure 3 :
Figure 3: Results of LOO Experiment: heatmap with retainment ratio.Each row shows the scores achieved for each dialect when leaving out the dialect of the row.The last column shows the average score of the ratios (the average excludes the diagonal values.)

Figure 4 :
Figure 4: Results of SD Experiment, when the system is trained on the dialect of the row.The last column shows the average score of the ratios (the average excludes the diagonal values.)

Figure 5 :
Figure 5: Comparison of distributions of sentence level BLEU scores for sentences with (Past Tense) and without past tense (No Past Tense).The Past Tense + Reference shows the BLEU distribution for samples with the rephrased reference.The differences between "No Past Tense" and "Past Tense" are significant according to Welch's t-test (p=5.95e-104).

C
Linguistic Distance Matrices by Feature TypeIn Figures 8a, 8b and 8c we show the linguistic feature matrices by feature type.While Figure2shows the Relative Identity Values (RIV) across all 289 features of the DDAGGS, they are presented individually here: Figure8ashows the values based on the 107 phonological features, Figure8bbased Comparison of two XLS-R models: XLS-R with CTC decoding and XLS-R with Seq2Seq head.(a) Matrix of linguistic distance between dialect regions: phonological features (b) Matrix of linguistic distance between dialect regions: morphosyntactic features (c) Matrix of linguistic distance between dialect regions: lexical features.

Figure 8
Figure 8 BLEU scores of all experiments.Each experiment is a tuple of model and experiment type.There are three types: All-dialects, Leave-One-Out, and Single Dialect.For each setting, we report the score achieved on each dialect separately, and the average score achieved on the full test set.

Figure 9 :Figure 10 :Figure 11 :Figure 12 :Figure 13 :Figure 14 :
Figure 9: Results of LOO Experiment: heatmap with retainment ratio.Each row shows the scores achieved for each dialect when leaving out the dialect of the row.The last column shows the average score of the ratios (the average excludes the diagonal values.)
Table 2 presents the BLEU scores of the LOO experiment.For each dialect column, the LOO row shows the performance of the LOO model on this dialect's test set .8 68.4 72.7 72.9 69.5 70.6 72.5 Leave One Out 70.4 67.7 64.9 71.0 72.3 67.4 61.2 72.2 Single Dialect 57.6 68.7 65.5 69.3 71.2 69.0 71.8 69.5 Trafo All dialects 62.0 59.1 57.5 64.8 65.9 62.1 61.1 63.8 Leave One Out 57.8 50.5 49.9 59.3 62.2 56.2 47.7 59.0 Table2gives an overview of the results.We can see that the final BLEU scores for Trafo were 0, as 50 hours of data are too little to train such a model from scratch.Thus, we only show and discuss the XLS-R and Whisper retainment ratios in Fig 4.Training on a single dialect is sufficient for that dialect.The results show that with XLS-R, single dialect training leads to at least 95% of the Alldialects performance.VS Status confirmed.The SD experiments again confirm VS's status, as the most distinct dialect as training on VS alone leads to an improvement of 2% over the All-dialects setting.We also note that VS and ES have the highest distance in both the XLS-R and the Whisper experiments, which corresponds to the linguistic distance in Figure2.ZH is a central dialect.The SD experiments also confirm that the similarity of ZH to ES and CS is reflected in the empirical results.When training on ZH only, the CS and ES scores retain at least 90% (for Whisper the retainment is also at least 76%).

Table 3 :
Examples of Target, Rewritten Target, and generated Hypothesis for the Past Tense experiments.

Table 4 :
Examples of Targets and generated Hypotheses for the vocabulary experiments.