Quantifying the Dialect Gap and its Correlates Across Languages

,


Introduction
Across the globe, humans speak over seven thousand unique languages (Eberhard et al., 2022).Many of these languages contain a plethora of internal variation due to the environmental, cultural, and socioeconomic diversity inherent to large populations (Honkola et al., 2018).These dialects are categorised into two groups: standard and nonstandard (Trudgill, 2004).Standard dialects are, by definition, supported by governmental and educational institutions resulting in more opportunities of all kinds for their speakers.On the other hand, speakers of non-standard and minority dialects find themselves at a disadvantage compared to their counterparts (Trudgill, 1979).These effects are compounding; those who speak minority dialects are provided fewer opportunities to advance socially and economically, resulting in a self-fulfilling cycle of oppression.Many people who speak a minority dialect as a first language find themselves modifying their use of language throughout their life to appear to belong to the group of standard dialect speakers, much to the detriment of the maintenance of dialectal diversity (Carlson and McHenry, 2006).In losing these dialects, we lose not only the form of expression itself but aspects of the unique culture and society it belongs to (Fishman, 2007).
NLP has been moving in recent years to provide more methods of communication, both between people and with digital systems.In doing so, it has been bridging information-and access-based gaps for people in many historically marginalized communities (Bouillon et al., 2021;Mariani et al., 2022;Zhang et al., 2022).However, it is important to acknowledge that variation within languages is rarely addressed in mainstream tools.Modern systems that do provide access to variants still focus on wealthy, standard dialects, such as British, Australian and American English, while disregarding commonly spoken minority dialects like African American English.Speakers of under-resourced dialects, variants of both high-and low-resource languages with little available training data, face language barriers when using many of the tools taken for granted by speakers of well-resourced dialects.This reduced accessibility further entrenches existing disparities by continuing the historical trend of disenfranchising speakers of minority dialects (Trudgill, 1979).
In this paper, we examine the performance of large language models (LLMs) from two crucial multilingual tasks, machine translation and automatic speech recognition, across a diverse set of dialects and analyze the linguistic, socioeconomic, and computational factors that may contribute to the dialect gap.This study determines that the largest indicator for better performance for underresourced dialects is linguistic proximity to wellresourced dialects, regardless of the size or wealth of the dialects' speaker base.This connection is predicted to be due to the lack of dialectal data included in training large language models, leading to dialects performing better or worse on the basis of incidental similarity to the dialect used in training.Unfortunately, the size of the performance gap and the amount/makeup of data required to overcome it is not predictable from external information about the language since it varies across task, model, and environment.As a result, further analysis will need to be done by researchers for individual systems to examine how the dialect gap can be closed for their work through a unique combination of higherquality, larger, and more balanced datasets.

Dialect Diversity in Research
Studies in Linguistic Diversity A significant problem in the study of linguistic diversity across NLP is the lack of attention paid to language variation.In the past few years, increased awareness has been drawn within the NLP community to the disparities present in modern research.In particular, researchers have begun to notice the relative lack of papers that address languages spoken outside of Europe and East Asia, even in subfields like multilingual NLP (Blasi et al., 2022;Joshi et al., 2020;Ruder et al., 2022;Søgaard, 2022).
While these works offer insight into the disadvantages faced by speakers of under-resourced languages, they still are discussed under the assumption that if languages were appropriately attended to, all their speakers would gain equal access to NLP tools.Similarly, they present their comparisons as if all speakers of well-resourced languages, especially English, have superior access to tools.Unfortunately, this is not necessarily the case.Twothirds of English's one-and-a-half billion speakers are second-language (L2) speakers (Eberhard et al., 2022).Many L2 speakers struggle with NLP systems due to their accent or their use of code-switched and mixed language.Even many first-language (L1) speakers, such as speakers of African American or Scottish English, do not see their native dialect supported by speech, dialogue, or translation systems and are forced to mask their natural speech patterns, which is harmful to their mental health and sense of identity (Johnson et al., 2022;Santiago et al., 2021).As such, existing evaluations of linguistic diversity in NLP are fundamentally incomplete.

Dialectal Models
The advent of large language models has made it possible to train models that perform well on even low-resource languages (Aharoni et al., 2019;Conneau et al., 2020).The term LLM is not strictly defined, but in this study, we use it to refer to multilingual Transformer-based systems pretrained on large amounts of scraped internet data and finetuned for specific tasks.In these systems, under-resourced languages have their training supplemented by this unannotated, scraped data and cross-lingual transfer (Dabre et al., 2020).The performance gain seen by low-resource languages when using LLMs does not extend to under-resourced variants of languages.Some LLMs provide allocational support for dialects by treating them as separate languages but their performance is not necessarily comparable to that of the standard form.As an example, Arabic speakers often write in their native dialects when communicating casually online, a phenomenon noted by both the linguistic and NLP research communities (Alshutayri, 2017;Abdul-Mageed et al., 2018).Still, attempts by social media to translate Arabic posts are far less successful than their attempts on French and English, despite many consumer translation systems offering support for major regional dialects of Arabic (Harrat et al., 2019).For dialects outside of those explicitly included in systems, this problem is only exacerbated by a lack of allocational support.
The Data Problem The same marginalised languages that face lower performance at the hands of LLMs also face a larger data problem across dialects.Most of the task-annotated data available online for low-resource languages comes from religious texts, government documents, or multinational newspapers (Agić and Vulić, 2019;Skadin , š et al., 2014;Chen et al., 2020).These sources often use a formal manner and avoid dialectal markers, especially when their target population is mostly diglossic and has already had to learn a more standard dialect for survival in the modern world (Alshutayri, 2017;Abdul-Mageed et al., 2018).As a result, the LLMs trained on this data are not built to function on minority dialects and have unclear performance capabilities.Before this problem can be solved, questions must be answered about the amount, quality, and type of data needed to over-

Models
Languages Metrics
come the data problem.The survey done in this paper across languages provides insight into how well dialects perform 'as is' and identifies that linguistic and socioeconomic knowledge should be leveraged to inform future decisions on data collection and usage.

Tasks
The two tasks evaluated in this paper are machine translation (MT) and automatic speech recognition (ASR).These tasks are some of the few with sufficient data for evaluation of dialects and both have a focus on increasing access to people, tools, and information by removing linguistic barriers (Jin et al., 2021).They are also safe tasks to use as a starting point because they do not deal with personal information or abusive language.The list of models, languages, and metrics used in the evaluation of each task can be found in Table 1.More information about the datasets and languages used can be found in Appendix A. In total, there are six model versions evaluated for each task and 30 dialects across 7 languages compared for MT and 33 dialects across 7 languages compared for automatic speech recognition.Other than Tamil and Telugu for ASR, each language is taken from a different language family in order to extract information that is independent of specific linguistic features.

Machine Translation
Machine translation is already used in domains such as medicine, law, and information as a method of increasing access to systems (Büttner et al., 2022;Vieira et al., 2021) (Johnson et al., 2019).NMT is a widespread consumer tool, to the point that Google has had to parse out bitext generated using it when scraping internet data for training (Ni et al., 2022).
We also evaluate the University of Helsinki's OpusMT, a model based on MarianMT and trained on Wikimedia monolingual and multilingual text (Tiedemann and Thottingal, 2020).This model is an interesting comparison to NLLB and NMT because it is not an LLM and represents a different approach -covering more languages at the cost of performance across the board.This model was constructed in an academic setting with a more transparent set of training data and significantly fewer parameters.All evaluations are conducted with English as either the target or source language due to data constraints.
Evaluation metrics are a biased measure of output quality and fluency but are required to empirically showcase the dialect gap.To reduce some of the negatives associated with each metric, we report two types of metrics that measure different aspects of the output.The first metric is a BLEU score, which is a classic n-gram evaluation technique for translation (Papineni et al., 2002).Secondly, a representation-backed metric is used to determine the semantic similarity between two sentences since MT is a task with multiple possible solutions.Most semantic similarity metrics are based on transformer embedding models, so we use a multilingual variant of SentenceBERT2 (Reimers and Gurevych, 2019).Full results for both metrics are reported in Appendix C.

Automatic Speech Recognition
Automatic speech recognition (ASR) is a task that is important in bringing access to those who are unable or disinclined to communicate through text (Ahn and Lee, 2016;Doumbouya et al., 2021).As of late, representation learning and LLMs for endto-end ASR have been becoming more common.Many models are trained on unsupervised audio data and then finetuned for specific tasks.This is the case for Meta's XLS-R, a model that is trained on thousands of hours of speech data across languages (Conneau et al., 2020).We evaluate both on a multilingual variant3 and a monolingual variant of the 300M parameter base model4 finetuned on a single language at a time using the Common Voice dataset (Ardila et al., 2020).
Another model examined is OpenAI's Whisper, which is trained on a combination of existing ASR datasets and automatically generated transcripts scraped from the internet (Radford et al., 2022).The version of the model tested here is the medium variant5 .Like XLS-R, the Common Voice dataset was used to finetune this model by language for an additional evaluation (Ardila et al., 2020).
Lastly, Google has released two ASR models: the monolingual Speech-To-Text (STT) and their newer multilingual Universal Speech Model (USM) (Chiu et al., 2018;Zhang et al., 2023).These models were both evaluated through Google Cloud API because neither has been released for open-source use.STT in particular functions as a good comparison to the LLMs evaluated here because it is an older, monolingual model.Overall, six models will be compared -three "monolingual" models (including those finetuned for a specific language) and three multilingual models.
While there has been discussion on whether word error rate (WER) and character error rate (CER) adequately predict performance, no better system has been used by the community at large (Favre et al., 2013).There have been other options, but these are primarily for downstream endto-end tasks, such as speech translation, natural language understanding, and information retrieval (Kim et al., 2021;Roy, 2021).For this work, we will stick with the community standard and use WER, with CER scores reported in Appendix C.

Linguistic Analysis of Dialects
There are many ways to identify and quantify the similarity between two variants of a language.Many have been explored in NLP for cross-lingual transfer using features from syntax, lexicon, and morphology (Philippy et al., 2023;Eronen et al., 2023;Lin et al., 2019;Ponti et al., 2019).There have also been studies on dialects in computational linguistics, examining whether dialects are consistent across corpora and registers (Dunn, 2021).A similar method is used in this paper to examine lexical similarity, using Spearman's Rank Correlation Coefficient.This has been used previously to calculate corpus similarity and homogeneity (Kilgarriff and Rose, 1998).In Appendix Figure 3a, the similarity between each dialect and the bestperforming variant of that language is shown, as well as the lexical similarities between scripted and conversational samples from each dialect of the Babel dataset.
Additionally, we examine the phonetic similarity of selected ASR datasets, specifically for Arabic and Spanish.Here, random samples were manually annotated for vowel positioning through formant analysis and plotted in the Appendix; see Figure 3b.Then, the average Euclidean distance across vowels between each dialect and the standard form was taken to serve as a measure of phonetic similarity.More details on the exact methodology can be found in Appendix B.

Dialect-Wise Performance Gaps
Examining the performance across dialects in Figure 1, some trends appear immediately.As mentioned in Appendix A, the dialects evaluated were largely dictated by data availability.As a result, Arabic and Spanish are heavily represented while other lower resource (dialect-wise) languages see coverage of only two to three dialects.This is something that may be reflected as well in the training data for pre-trained models, resulting in Arabic and Spanish both having relatively more even dialectal performance than the other surveyed languages.
For MT, there are steeper performance gaps when translating into the dialect.This makes sense if input robustness is taken into account; in other words, models may be able to handle some level of dialect variation in their input but cannot know to output the non-dominant dialect.Additionally, models that perform better on the standard dialect show steeper drop offs in performance, something  Within each language the same dataset is used.Because Bengali, Georgian, and Tamil are heavily diglossic, the standard written form (spt) and the best performing spoken form (cnv) are compared rather than regional dialects.
very clearly exemplified across the Finnish dialects.This demonstrates the interesting point that higher performing models -which have access to more parameters and data during training -have greater inequalities in their coverage.The same trends are not apparent for ASR, where the worst performing model (OpenAI's Whisper) has the highest amount of variance across dialects.Interestingly, all three multilingual models seem to prefer the spoken dialect (cnv), likely due to the fact that they are mostly trained on unsupervised internet data from websites like Youtube.On the other hand, the finetuned models prefer the written dialect (spt), which is understandable since most are finetuned using CommonVoice, a heavily scripted data source.

Correlations with Proximity to Power
Even within the same task and model, different dialects have different performance disparities, as seen in Figure 1.In order to examine this phenomenon in an equivalent environment, we compare performance across MT using BLEU %, which is the percentage of the best-performing dialect's BLEU score achieved by the minority dialect.Likewise, for ASR, the relative percentage loss of performance for each dialect compared to the standard dialect is used.Note that this means that in Figure 2, a positive MT correlation and a negative ASR correlation both mean there is positive correlation between the metric and performance.
In choosing metrics for comparison, we aimed to cover the range of economic, social, and linguistic factors that capture the idea of proximity to power.As proxies for wealth, we examine gross domestic product (GDP) for cumulative economic power and GDP per capita for individual economic power (The World Bank, 2023).Socially, we are interested in both population size and how well- Lastly, for linguistic factors, we utilize the lexical similarity and phonetic similarity extracted from evaluation data and normalized to a scale from −1 (lowest similarity) to 1 (highest similarity).Unfortunately, some economic and social metrics are only reported at the national level, so there is no data for minority dialect groups within countries.As a result, certain dialects (e.g.Kven Finnish, Vernacular Malay) are not included.In the past, population factors have been shown to loosely correlate with factors such as performance and appearance in NLP research (Blasi et al., 2022).Here, in Figure 2, we see that these correlations do not necessarily hold for dialects.In fact, these results are contradictory to common expectations and narratives, which assume that wealthier, larger, and more educated populations are better served across the board.
Gross Domestic Product GDP represents the overall wealth of a speaker population and their economic power in the world.As such, we would expect groups with high cumulative wealth to be well-served by technology.While GDP has a small impact, it varies heavily by model and can't be used as a consistent predictor of performance.Certain models show a relatively consistent positive correlation, such as OpusMT and USM/STT, but others show no correlation at all.Ohers showcase a correlation only in one set of models, such as NLLB which is uncorrelated when translating into English but positively correlated when translating into the dialect.On average, worse-performing models and environments show a stronger correlation, with translation into the dialect being much more correlated than translation into English.
Gross Domestic Product Per Capita GDP per capita is an important metric as a proxy for estimating the wealth of individuals in a population and we would expect those with access to wealth to be well-served even if their population is smaller.Surprisingly, it seems to have no impact at all on MT across models, so wealthier minority populations are not better served than poorer ones despite having access to increased resources.In ASR, the result is even more unexpected with wealth correlating negatively with performance.
Population Size Population size intuitively would correlate with better performance, but previous studies on language diversity in NLP have shown that even languages with extremely high populations are not well-served if they are impacted by other factors like geographic distance from research institutions and low wealth (Blasi et al., 2022).Here, population size has little impact on MT performance, to the point that certain models show a negative correlation between the two.On the other hand, in ASR there is a strong positive correlation across all models except for the finetuned version of Whisper.This is an unexpected result because Whisper originally showcases a matching positive correlation and is finetuned on the same Common Voice datasets as XLS-R but demonstrates a complete trend reversal.This difference between MT and ASR may be a result of the type of data used for training each and the sources it came from, but further analysis is needed to confirm this.
Human Development Index HDI is a measure of how well a population is served in other accessbased metrics, such as education, healthcare, and standard of living.It would logically follow that a high HDI would then correlate with better performance, but this does not hold for MT.Instead, MT performance shows no correlation at all with HDI.Surprisingly, HDI correlates negatively with ASR performance, so better-educated and healthy minority dialect speakers have a harder time accessing ASR systems despite being otherwise well-served economically and socially.
Lexical Similarity Lexical similarity, on the other hand, is very correlated with performance for both MT and ASR.Since dialectal data is not used for training regardless of population features, performance is likely mostly based on linguistic proximity to the standard form.This result is also more robust than the other correlations mentioned here because every dialect of every language evaluated was included since the similarity score was not dependent on external data availability.Again, we also see in MT that the worse-performing directionality (EN → dialect) has a stronger correlation.This is expected in context since these models do not provide allocational support to these dialects, so they are translating into the standard dialect regardless of user intent but they may be robust to some amount of the lexical variation in the input.

Phonetic Similarity
The importance of linguistic similarity extends to phonetic similarity for ASR, which is strongly positively correlated with performance.Again we see that finetuning on the smaller, scripted Common Voice datasets makes the correlation stronger for XLS-R and Whisper, which suggests that models overfit to the dialects present in training data.It is important to remember that phonetics is a broad area of study in linguistics that encompasses many measures of acoustic similarity, so other forms of analysis may capture even higher impact forms of variation between dialects.
However, these results clearly already show that phonetic similarity plays a large part in determining the performance of dialects.
The results surrounding similarity suggest that the most useful method of addressing the dialect gap may lie in focusing on how to reduce the linguistic distance between the language used at evaluation versus training.In other words, this can be compared to a domain shift problem rather than a multilingual problem.A way to begin is by increasing the dialectal diversity of the training data to cover a larger variety of language patterns.
7 The Impact of Datasets

Machine Translation & Dataset Size
For many languages, lower performance in MT is seen in parallel with a smaller dialect gap.As an example, the Mandarin dialects perform comparably on NLLB and OpusMT but the disparity becomes statistically significant under NMT, a model where Mandarin as a whole performs better.This trend suggests that the benefits of larger models and more training data are not equally felt by all dialects due to disparities in the training pipelinemore training data does not solve the dialect gap, it makes it worse.The question can then be raised: would training on more specifically dialectal data be sufficient to overcome these disparities?
To answer this question, two languages with enough dialectal data were chosen to finetune NLLB and OpusMT.Each model was trained with thirty different dataset sizes, on three different data subsets per size and three seeds per data subset to ensure that the results were statistically significant.
In Figure 2a, the training curves for each model and translation direction can be seen.
As more data is added, the languages begin to perform better, but not all at the same rate.For example, Vernacular Malay sees relatively little improvement as data is added to train NLLB, but OpusMT's initial training curve is steep.Therefore the same amount of data causes two different outcomes depending on the model architecture and the data it was previously trained on.In some cases, the improvements are marginal, while in others, even a small amount of data is enough to completely overcome the dialect gap.Likewise, the same inconsistencies can be seen between the two directionalities of the same model.Despite both versions of OpusMT being trained on the same Low German data, translation into English sees benefits while translation into Low German remains poor.This makes it clear that the amount of data that dissipates the dialect gap in one situation may not be enough for another model or language.

Speech Recognition & Dataset Makeup
Besides finetuning on more dialectal data, another possible method of addressing the dialect gap is modifying the makeup of training or finetuning data.Across the board, the data used to finetune LLMs for speech recognition is heavily influential on performance.This difference can be seen when comparing the performance of these ASR systems on conversational and scripted samples from the IARPA Babel dataset (Bengali, Georgian, Tagalog, Tamil, & Telugu).The models evaluated here are largely trained on unsupervised speech data from the internet, which mostly comes from unscripted conversational recordings.As a result, the multilingual models perform slightly better on conversational speech.To test the impact of data makeup, XLS-R and Whisper were finetuned for three languages (Bengali, Georgian, & Tamil) on Common Voice, an entirely scripted dataset (Ardila et al., 2020).These languages are all spoken by a diglossic population that uses both a regional dialect and a more linguistically conservative standard written form.As a result, the lexical distance between conversational and scripted samples is farther than might otherwise be expected.In Figure 2b, finetuning on scripted data almost exclusively benefits performance for scripted samples over conversational samples.In some cases, such as with Whisper, this comes at the detriment of performance on conver-sational samples.This ties back into the impact of lexical variation discussed in Section 6 since both scripted and conversational samples were collected by speakers of the same dialect with similar accents.The low lexical similarity between these dialects amplifies the fact that ensuring the training dataset accurately and fully represents the lexical variations across a language and its dialects is an important step in creating systems that perform well across dialects, domains, and registers.

Implications of the Dialect Gap
The existence of a dialect gap means that not all speakers are inherently well-served by a tool just because their language is supported.Past analyses examined inequities from the perspectives of multilingualism and therefore likely overestimated the number of speakers benefiting from the current system.As the field moves forward, it is important to step back and remember that languages are not static or monolithic.
Additionally, as we saw, the dialect gap is not identical in severity or structure across every system.This implies that researchers cannot take a one-size-fits-all approach towards solving the dialect gap.This issue needs addressing in different ways depending on the task and the existing state of the gap.A large component of dialect gaps is based on datasets -both dataset size and dataset makeup.As the NLP community moves towards furthering research for medium-and low-resource languages, discussions must be had on both collecting sufficient amounts of dialectal data and capturing the natural variations of every language by ensuring that data is collected from diverse populations.Appreciating and accounting for variation not only makes our systems more robust but supports groups that face marginalization in other ways.

Conclusion
This work examined an important subspace in NLP by evaluating the (regional) dialect gap present in tasks with the highest likelihood of impacting speakers directly.Still, there are countless LLMs which have been rapidly gaining popularity in the past few years with the release of open-ended dialogue and image models.Most tasks outside of MT and ASR do not have the data necessary to analyze the impact of language variation but as more data is collected and annotated, this may change.As a direct continuation of the line of inquiry started in this work, multi-dialectal analyses of the dialect gap across a wider variety of tasks should be next.
For MT and ASR, the next steps are two-fold.Firstly, the datasets used for evaluation and finetuning in this work were primarily determined by availability, but using a broader and higher-quality set of samples may lead to the rise of other interesting trends.Additionally, to address the dialect gap identified here, there is a clear path forward that involves collecting more dialectal data and ensuring it is representative of the languages and dialects it aims to serve.This should be done in conjunction with speakers of the language, linguists, and members of the NLP community to maximise utility while minimising the burden or harm on the speaker population.Lastly, this analysis is hardly complete.As new LLMs come out, it is on the developers of these tools and the researchers behind them to continuously produce evaluations around language diversity to ensure that the benefits these LLMs bring do not come at the cost of access for minority dialect speakers.

Limitations
Dataset Size & Quality Factors Dataset size is a very significant factor when evaluating models and drawing language-wide conclusions.While the languages seen in this work had enough data for evaluation, very few provided enough data for finetuning LLMs and none provided enough to train a model from scratch.As a result, models were largely evaluated out of the box, which serves to identify performance gaps as they may appear in non-academic use cases but does not fully address solutions to this problem.
Likewise, dataset quality makes a massive impact on the result of training and evaluation.Because the number of available datasets was already quite low, crowd-sourced datasets such as Tatoeba were used without additional filtering, which may result in increased noise due to improper annotations.For some datasets, such as the IARPA Babel speech dataset, there was filtering done but spontaneous speech data in general is often paired with background noise and distortion, causing a further drop in performance.
Some languages have several datasets available, but because these datasets were not all collected with the same methodology (and therefore similar errors and distortions), they were not directly comparable so only one dataset was used or the lan-guage was not evaluated.Spanish speech, for example, has been recorded in the OpenSLR, CALL-HOME, and Fisher datasets but CALLHOME was chosen alone to be used.On the other hand, a multitude of English accent and dialect datasets are available for speech, but because each was collected independently, they again could not be directly compared and were therefore omitted.Lastly, some languages supported by models (Telugu and Tagalog) were not present in the Common Voice finetuning dataset used for the ASR experiments and were therefore omitted from a large part of the discussion surrounding dataset makeup.
Computational Restraints Many of the models evaluated are large industry models, with hundreds of millions if not billions of parameters.Naturally, as an academic institution, we were limited in the computational power made available to train these models; certain models were so large that even with a batch size of one they are incapable of running on the machines we have available.If we had greater computational power available, we would have run our evaluations on the largest version of each model to provide a picture of the most state-of-the-art performance for each task and finetune these larger models longer.On the other hand, many minority dialect speakers do not have the economic resources to train or finetune super-massive models, so the evaluation of more accessible models is an appropriate reflection of what is available to these speakers.In the future, with access to greater resources, the evaluation of more systems and larger models, along with the evaluation on other user-facing tasks (Ruder et al., 2023), again through the optics of regional dialects, ould be a valuable extension of this work.

Ethics Statement
Dialectal NLP research is a burgeoning field without many precedents set for ethical research, but direction can be taken from the field of multilingual NLP for how to work with the languages of minoritised groups ethically.In this paper, the issue of ethics was largely sidestepped through the use of anonymised, public, and voluntarily collected datasets and the evaluation of tasks with a low likelihood of causing harm.Additionally, despite the importance of idiolects and moving beyond regional dialects, we purposefully did not work with dialects connected to identity features that may put people at risk, such as sexuality, gender, and reli-gion.Even as this paper supports the collection of larger and more representative datasets, these arguments do not apply in cases where it would be against the wishes or best interests of the groups involved.

A Languages
In total, there are seven languages evaluated in this study per task and over thirty dialects.The details of these datasets are discussed in Section A, but for easy reference, a table listing the datasets, the languages included, and any dialects is provided in Table 3.Additionally, any abbreviations used in figures throughout the paper are included, with language abbreviations pulled from ISO 639-1 and dialect abbreviations either based on the ISO 3166 country codes or created based on the regional names provided in the dataset.
AraBench Arabench is a dataset collected by the Qatar Research Computing Institute to encourage research into machine translation for Arabic dialects (Sajjad et al., 2020).The dataset includes parallel text data grouped by region, nation, and city from religious, media, and speech sources.Across the board for Arabic, Modern Standard Arabic (MSA) outperforms the vast majority of dialectal forms, even though social media's rise has led to dialectal forms of Arabic being used in written communication more often.In fact, in personal chat, dialectal forms of Arabic are represented more than MSA (Chelghoum, 2017).Google Region-Aware MT The dataset used for Mandarin and Portuguese MT is Google's Region-Aware Machine Translation Dataset, a small benchmarking dataset for few-shot translation for these two high-resource languages (Riley et al., 2022).
The dataset consists of Wikipedia articles that exist in both English and the target dialects.
Tatoeba The Tatoeba datasets are a set of translation datasets crowdsourced by the Tatoeba organisation 6 (Tiedemann, 2020).Languages have varying amounts of data, ranging from over a million sentences in English to fewer than ten for languages such as Sindhi.The majority of parallel text available in these datasets is for low-resource languages paired with English, so most translation systems trained on this data use English as either the source or target language.While Tatoeba is a project focused on language diversity, there are efforts within it to include some major regional dialects.Dialects of languages such as German Conversational Telephone Speech The dataset used to evaluate Arabic for ASR is the Conversational Telephone Speech (CTS) dataset, a spontaneous spoken language dataset with transcriptions available through the Linguistic Data Consortium (Appen Pty Ltd, 2006c,d, 2007a,b, 2006a,b).This set of datasets encompasses the Gulf (Emirati, Omani & Saudi), Mesopotamian (Iraqi), and Lev-6 https://tatoeba.org/antine (Jordanian, Lebanese, Palestinian & Syrian) dialects of Arabic.

CALLHOME
The dataset used to evaluate Spanish ASR is the CALLHOME telephone speech corpus, which encompasses several primarily Latin American Spanish datasets (Canavan and Zipperlen, 1996;Wheatley, 1996).In this work, we specifically focus on eight dialects: Argentinian, Chilean, Columbian, European, Mexican, Peruvian, Puerto Rican, and Venezuelan Spanish.Spanish is an interesting case of two "standard" forms, Mexican and European Spanish.

IARPA Babel
The IARPA Babel dataset 7 was a large set of speech recognition datasets collected for medium-and low-resource languages to make ASR effective on a broader set of languages.While there are many languages included, most are not currently supported by large ASR systems, so five have been selected for these experiments: Bengali (Kamrupa, Radha, & Varendra), Georgian (Eastern & Western), Tagalog (Central, Northern, & Southern), Tamil (Central, Madurai, Northern, Southern, & Western), and Telugu (Central, Eastern, Northern, & Southern).Each of these languages is written in a different script, increasing the difficulty.It is important to note that the Babel project's goal was not to represent all dialects of these languages, so certain significant dialects (e.g.Sri Lankan and Singaporean Tamil for the Tamil dataset) are omitted due to the focus on a single major region.The Babel dataset divides each language into two to five regional dialects, with half the samples being scripted speech and the other half being spontaneous utterances.While we are primarily interested in spontaneous speech for this work, all of these languages have a distinct written form which is considered a separate dialect.As a result, the scripted audios are kept but separated from the spontaneous samples to examine how much lexical variation (between the scripted and spontaneous samples of each dialect) impacts performance in comparison to phonetic variation (across dialects).

B Linguistic Analysis
Lexical Similarity The lexical similarity between dialects is calculated by splitting the full dataset for each dialect and randomly sampling it into halves one hundred times.The top hundred most frequently occurring words are then ranked by frequency to generate a vocabulary list.For each dialect pair of interest, the position of each word in the ranking is compared using the following formula: Here, d i is the difference in ranking between each i word in the list and n is the number of words being compared (100 in this case).If a word appears in one list but not the other, the ranking given is n + 1.The same process is utilized for calculating homogeneity, except both halves are taken from the same corpus.Homogeneity is calculated to ensure that there isn't a large degree of internal variation that is influencing the degree of lexical cross-similarity between dialects.While this analysis can be impacted by domain mismatches, by focusing on only the top one hundred words we deal primarily with very common words that are less likely to be specialized vocabulary.

Phonetic Similarity
The phonetic similarity between dialects is computed in a time-intensive method for this paper, so the results of this computation should be considered preliminary until further annotations can be collected.For each language of interest, the primary vowels of that language are determined based on the orthographic choices made in that language.This amounted to three vowels in Arabic (/a/, /u/, & /i/) and five vowels in Spanish (/a/, /e/, /i/, /o/, & /u/).For simplicity, other highly sonorant phonemes, such as /j/, are not included, although some may consider them vowel-like.From this, the set of samples from each dialect where all the language's vowels are present is extracted in order to reduce discrepancy due to variations in distortion or recording environments as much as possible.Ten samples per dialect, from unique speakers (when possible) are then annotated with the first (f 1) and second (f 2) formant, which provides information on the height and backness of the vowel.While roundedness can be gleaned from further formants, this adds another level of complexity so it was not considered in this study.The height of the vowel was then set as f 1 and the backness was set as f 2 − f 1.The average distance between the dialects' mean vowel positions and those of the best-performing dialect was taken to represent phonetic similarity.

C Full Evaluation Results
Machine Translation Here, we include the full results achieved across dialects for transparency.The BLEU score results can be seen in Table 4 and the semantic similarity results can be seen in Table 5.The label "di → EN" refers to translation into English from the dialect and "EN → di" refers to translation from English into the dialect.The closest language tag was used when possible, such as regional dialect tags for Arabic in NLLB.
Automatic Speech Recognition Likewise, full results are included for both the raw and finetuned versions of our ASR models.In Table 6, the word error rates (WER) are provided and in Table 7 the character error rates (CER) are provided.When there is a "+ CV" included in the model title, that refers to further finetuning on the Common Voice dataset.Additionally, the "spt" and "cnv" titles refer to the scripted and conversational splits of the languages in the IARPA Babel dataset.Spanish and Arabic are only evaluated with conversational samples, so there is no data in the scripted column.

D Correlations
The plot comparing various metrics with dialect performance can be found in Figure 4.The vertical axis for the MT plots is BLEU %, or the percentage of the standard dialect's BLEU score achieved by the minority dialect.Likewise, the vertical axis for the ASR plots is WER %, or the percentage worse WER achieved by the minority dialect, since a lower WER is better.This means that a higher score on the two left columns is better, while a lower score on the two right columns is better.As you can see heuristically, the only obvious corre-

Machine Translation Correlations Automatic Speech Recognition Correlations
Figure 4: Correlation between various metrics and the relative performance of minority dialects, normalised to be relative to the best performing dialect.Note: The MT graphs are with respect to BLEU scores, where a higher value means better performance, but the ASR graphs are with respect to WER, where the inverse is true.Linguistic similarity is positively correlated with better performance in both systems.
lates are both linguistic measures, which is confirmed in our calculations.

E Concerns on Ethics in Dialect Research
Dialectal research is a relatively new field in NLP and as such, it is important to consider the ethical impacts of conducting it.In the past, researchers have examined the ethics of many aspects of NLP as they relate to marginalized and vulnerable populations, including raising concerns about sources of data (Olson et al., 2023;Rogers et al., 2021;Shmueli et al., 2021), evaluations of representational bias (Hutchinson et al., 2020;Lalor et al., 2022;Sun et al., 2019), and bringing awareness to allocational bias (Aji et al., 2022;Blasi et al., 2022).
Additionally, there has been significant work done in the medical domain, where the population being served is particularly vulnerable to both coercion and harm from publicized data (Don't Walk et al., 2022;Thompson et al., 2021;Suominen et al., 2007).We can take inspiration from these prior works when examining the ethical pitfalls of dialectal research, since minority dialect communities face many of the same hurdles as other minoritized groups, especially when their language reflects aspects of their identity that face persecution such as gender, race, or class.Currently, the issues that impact minority dialect speakers are largely around representational and allocational bias.Speakers of certain dialects are viewed as less intelligent or more difficult to understand, leading to systems classifying them as such when it comes to usages such as call screening or academic evaluations (Koenecke et al., 2020;Wassink et al., 2022).Important tools in use, such as automated emergency service or medical phone systems, may not function well on their dialects, reducing access to necessary care.Additionally, these biases are perpetuated externally towards users as well; female voices are often used when building personal assistants, which continues the societal trend of female systems being "bossed around" or acting in a subservient manner.This is especially the case since there has been no strong correlation in increased trust or comfort when personal assistants are of a specific gender, but the vast majority of consumer systems use female voices anyway (Tolmeijer et al., 2021).As dialect research improves, there is unfortunately room for these problems to worsen.Currently, gender is easy to specify when generating synthetic speech, but in the future if ethnicity or socioeconomic class was equally able to be designated, what patterns would emerge in consumer systems?These tools may result in maintaining stereotypes regarding subservience and the "place" of certain groups in society.
Another aspect to consider is the result of datasets and tools that parse certain identity markers falling into the wrong hands.A key example of this in another domain, facial recognition processing, is the Stanford "Gaydar" which was purported to outperform humans in predicting sexual orientation from facial structure (Wang and Kosinski, 2018).This system was denounced by multiple prominent LGBTQ organizations for both its methodology and the risk it places on closeted members of the community since it could be used to out members living in unfriendly places (Levin, 2017).Likewise, in China, similar facial recognition systems have already been used to track the movement of Uighurs, a Muslim minority group that has been facing prosecution by the government (Mozur, 2019).As such, the existence of tools to identify a minority are unfortunately inextricably linked with the ability to repress that minority.Currently, there are no highly accurate systems that can classify most personal identity features from text alone, but as we work on the identification and usage of dialect data, this may soon change.When conducting this research, it is important to have a deep understanding of the communities impacted by these tools beyond their academic contribution and weigh the consequences of releasing such tools for public use against the value of open and collaborative science.

Figure 1 :
Figure 1: The performance of various dialects across Machine Translation and Automatic Speech Recognition.Within each language the same dataset is used.Because Bengali, Georgian, and Tamil are heavily diglossic, the standard written form (spt) and the best performing spoken form (cnv) are compared rather than regional dialects.
BLEU scores achieved over different finetuning dataset sizes for MT on Low German and Vernacular Malay.Change in WER after finetuning on scripted Bengali, Georgian, and Tamil data.

Figure 2 :
Figure 2: Impact of training dataset modifications on performance.

Figure 3 :
Figure3: Linguistic similarity measures comparing dialects.Left: Lexical similarity, taken either between each dialect and the best-performing dialect from that family or between spontaneous and conversational samples of the same dialect.Right: Vowel positions for the most common vowels in two languages across dialects.
msa irq omn qat sau yem jor pse lbn syr dza mar lby tun egy sdn fin kvn chn twn deu low che mys ver mol bra prt tza cod irq omn qat sau yem jor pse lbn syr dza mar lby tun egy sdn fin kvn chn twn deu low che mys ver mol bra prt tza cod msa

Table 2 :
Pearson correlation coefficients for each language metric.For MT, correlation is calculated against percentage drop in BLEU performance while for ASR, correlation is calculated against percentage increase in WER.As such, the correlations are reversed for WER.Correlations with p < 0.05 are marked.

Table 3 :
The datasets, languages, dialects and abbreviations used throughout this paper.