TADA: Task-Agnostic Dialect Adapters for English

Large Language Models, the dominant starting point for Natural Language Processing (NLP) applications, fail at a higher rate for speakers of English dialects other than Standard American English (SAE). Prior work addresses this using task-specific data or synthetic data augmentation, both of which require intervention for each dialect and task pair. This poses a scalability issue that prevents the broad adoption of robust dialectal English NLP. We introduce a simple yet effective method for task-agnostic dialect adaptation by aligning non-SAE dialects using adapters and composing them with task-specific adapters from SAE. Task-Agnostic Dialect Adapters (TADA) improve dialectal robustness on 4 dialectal variants of the GLUE benchmark without task-specific supervision.

As LLMs become a general-purpose technology, they are applied in an increasing number of scenarios by users who are not formally trained in Machine Learning (Bommasani et al., 2021).Nonexperts rarely look beyond accuracy (Yang et al., 2018), making them less likely to value robustness above the cost of training (Ethayarajh and Jurafsky, 2020).Unmitigated dialect bias in this long tail of tasks has the potential to exacerbate harms due to unfair allocation of resources (Bender et al., 2021). 1 We release code for training both traditional and taskagnostic adapters for English dialects on GitHub and finetuned models, adapters, and TADA modules on HuggingFace.

Adv(Dial) -Adv(SAE)
Figure 1: TADA trains adapters with both sequence and token level alignment loss between SAE and a target dialect.When stacked before task-specific SAE adapters, TADA provides dialect robustness for the target task.
Dialectal discrepancies originate in biases in the filtering of LLM pretraining data before finetuning (Gururangan et al., 2022).Despite dialects being definitionally similar, training which enables task-agnostic zero-shot transfer is underexplored relative to potential utility (Bird, 2022).Such taskagnostic transfer methods are natural, practical, and offer a scalable solution for English dialects across the growing spectrum of NLP applications.
This work contributes the first pursuit of these goals with Task-Agnostic Dialect Adapters (TADA).Adapters, bottlenecks placed between transformer layers, provide a parameter-efficient (Houlsby et al., 2019) and composable (Pfeiffer et al., 2020) foundation for task-agnostic dialect adaptation, given the low-resourced nature of most dialects.As shown in Figure 1, TADA modules are trained to align non-SAE dialect inputs with SAE inputs at multiple levels with both a sequence-level contrastive loss and a novel morphosyntactic loss.
We show the empirical effectiveness of TADA on 4 dialect variants of GLUE (Wang et al., 2018) with perturbations from Ziems et al. (2023).We re-lease TADA as a plug-and-play tool for mitigating dialect discrepancies, launching a scalable pathway to dialect-inclusive English NLP.

Related Work
NLP For English Dialects Existing work on NLP for English dialects has largely focused on data collection and weak supervision.Jørgensen et al. (2016) uses online lexicons to provide weak supervision for AAE.Blevins et al. (2016) manually annotates a small dataset and uses domain adaptation methods to enable transfer.Jurgens et al. (2017) collects a geographically diverse set of English data and uses distant supervision signals to annotate a large and representative language ID corpus.Multi-VALUE (Ziems et al., 2022(Ziems et al., , 2023) ) develops a data augmentation framework for taskspecific training in many common English dialects.Our work proposes a complementary task-agnostic intervention for English NLP.
Cross-Lingual Alignment Cross-lingual alignment has become a common approach for taskagnostic zero-shot transfer across languages.Explicit lexical alignment can be used to learn cross-lingual word embeddings for downstream tasks (Duong et al., 2016;Adams et al., 2017;Artetxe et al., 2018;Grave et al., 2019).More recent work shows that end-to-end models can implicitly learn to align representations (Zoph et al., 2016;Conneau and Lample, 2019;Conneau et al., 2020;Xue et al., 2021).These alignment methods often perform better on highly similar languages, making them theoretically well-suited for dialects.By using explicit alignment with composable modules, our work is the first to explore such techniques for English dialectal NLP.
Adapters A growing body of research has been devoted to finding scalable methods for adapting increasingly large-scale pre-trained models.Houlsby et al. (2019) adapt large models using bottleneck layers (with skip-connection) between each layer.This idea has been extended in many domains (Stickland and Murray, 2019;Pfeiffer et al., 2021;Rebuffi et al., 2017;Lin et al., 2020).Most relevant, Pfeiffer et al. (2020) showed that discrete language modeling adapters and task adapters can be composed for effective cross-lingual multi-task transfer.Our experiments exploit specialized dialectal data augmentation to extend this approach to English dialects using explicit alignment loss.

TADA: Task-Agnostic Dialect Adapters
As an initial effort, TADA aims to provide taskagnostic dialect robustness for English NLP.To do so, we build on work from both multilingual NLP and computer vision and apply explicit alignment losses for transfer learning.Concretely, we first generate a synthetic sentence-parallel corpus using the morphosyntactic transformations created by Ziems et al. (2023).Using these parallel sentences, we train TADA to align using a contrastive loss at the sequence level and an adversarial loss at the token level.At test time, TADA modules are stacked with task-specific adapters trained on SAE to improve the dialect performance on the target task without further training.

Synthetic Parallel Data
While cross-lingual transfer has leveraged the wealth of sentence parallel bi-texts from machine translation to learn alignment, there are no largescale parallel English dialectal datasets.Therefore, we leverage Multi-VALUE, a rule-based morphosyntactic SAE to a non-SAE translation system to create parallel data (Ziems et al., 2023).
We start with SAE sentences sampled from the Word-in-Context (WiC) Dataset (Pilehvar and Camacho-Collados, 2019).WiC is designed to contain lexically diverse sentences and is sourced from high-quality lexicographer written examples (Miller, 1994;Schuler, 2005).This avoids our alignment modules overfitting to specific vocabulary or noise from low-quality examples.We generate 1,000 such pairs, an amount which could be feasibly replaced with human-translated data.
This data limitation is intentional, as Multi-VALUE could alternatively used to do large-scale pretraining on transformed data (Qian et al., 2022).With smaller data limitations, the data used to train TADA can be manually curated native speakers and linguists to most accurately describe the dialect via minimal pairs (Demszky et al., 2021).Additionally, it opens the potential for TADA to be used for non-English dialects, related languages, and codeswitched variants where small amounts of manually translated data already exists (Diab et al., 2010;Salloum and Habash, 2013;Klubička et al., 2016;Costa-jussà, 2017;Costa-jussà et al., 2018;Popović et al., 2020;Chen et al., 2022;Agarwal et al., 2022;Hamed et al., 2022) Furthermore, using a small amount of data, in combination with a parameter-efficient method, reduces compute costs as a barrier for dialect speakers to develop and own language technology within their communities (Ahia et al., 2021).

Contrastive Sequence Alignment
Multilingual NLP has shown that L 2 alignment on small amounts of data can provide competitive performance gains to augmentation using translated data during finetuning (Conneau et al., 2018).This operates on the intuition that similar input representations are likely to lead to similar outputs.TADA extends this approach to dialects by minimizing the L 2 distance between a frozen representation of an SAE input CLS sae and the TADA representation of a non-SAE input CLS dial : (1)

Adversarial Morphosyntactic Alignment
Since our translated data is aligned at the sequence level, the contrastive loss is only applied to the CLS representations.However the variation, and therefore our ideal alignment procedure, operates at the morphosyntactic level.
Lacking token-level aligned data, we instead pursue morphosyntactic alignment using unsupervised adversarial alignment methods (Zhang et al., 2017;Lample et al., 2018).Since our goal is to capture morphosyntactic differences, we use an adversary which pools the entire sequence using a single-layer transformer (Vaswani et al., 2017) with a two-layer MLP scoring head.A transformer adversary has the expressive capacity to identify misalignment in both individual tokens and their relationships.
We leave the source dialect frozen which has been shown in computer vision to lead to representations that are composable with downstream modules (Tzeng et al., 2017).Given the adversarial scoring network Adv, a frozen SAE representation SAE, and a Non-SAE representation after TADA Dial, we train Adv to maximize: Then, define the morphosyntactic loss for TADA by minimizing the critic loss from Adv: (3)

Plug-And-Play Application
Finally, we propose a procedure for applying TADA to downstream tasks.We use composable invertible adapters (Pfeiffer et al., 2020) as our starting point.Using the 1,000 sentences from WiC, we train these adapters to minimize the combined contrastive and adversarial loss functions: At test time TADA modules can be stacked behind traditional task adapters (Houlsby et al., 2019).TADA serves to directly align the representations of Non-SAE inputs to the SAE embedding space that these task adapters were trained on.Our experiments show that this consistently improves adapter performance without further training.

Evaluating TADA
We benchmark TADA on 4 VALUE (Ziems et al., 2022(Ziems et al., , 2023) ) transformed versions of the GLUE Benchmark (Wang et al., 2018).As discussed in our limitations, these benchmarks are artificial but enable the evaluation of TADA across multiple tasks and dialects.First, we show how TADA compares to SAE models and task-specific baselines for African American English (AAE).Then, we show that TADA is effective across 4 global dialects of English.Finally, we perform an ablation to evaluate the contribution of each loss function.
For all TADA experiments, we train using 1,000 WiC sentences as described in Section 3.1.We train for 30 epochs with early stopping based on the lowest contrastive loss on a development set of 100 held-out WiC sentences.In Section 5, we report full hyperparameters along with the training details for SAE and VALUE models.

Training Details
TADA is trained with the ADAM optimizer for 30 epochs with batch size of 16 and with a hyperparameter search of 5e-4.We keep the model and epoch with lowest L 2 loss on the 100 held-out examples.Training takes approx.30 minutes on an Nvidia GeForce RTX 2080 Ti.
To find this hyperparameter setup, we performed a grid search over batch sizes from 8, 16, 32 and learning rates from 5 • 10 −3 , 5 • 10 −4 , 5 • 10 −5 for AAVE and used the configuration with the lowest L 2 loss on the 100 held-out examples.
For all SAE and VALUE GLUE models, we finetune RoBERTa base for 10 epochs with the ADAM optimizer, a learning rate of 2•10 −5 , a batch size of 16, and a linear learning rate warm-up of 6%.For all SAE and VALUE GLUE adapters, we finetune the original adapter architecture (Houlsby et al., 2019) inside RoBERTa base for 20 epochs with the ADAM optimizer, a learning rate of 1 • 10 −4 , a batch size of 16, and a linear learning rate warm-up of 6%.Training all baseline models took approx.3 days on an Nvidia GeForce RTX 2080 Ti.Additionally, we report experimental results on the BERT-base model in Appendix A1.

TADA vs. Task-Specific
Since ours is the first work to attempt task-agnostic dialect adaptation, we benchmark TADA in comparison to prior task-specific methods in Table 1.
We first establish pure SAE baselines for both full finetuning and adapter training (Houlsby et al., 2019).Interestingly, the gap between SAE performance and AAE performance is similar for adapters (-8.8) and full finetuning (-8.9) when trained on SAE.The minimal effects of the limited capacity of adapters on disparity indicate that dialectal discrepancy is largely within the pretrained LLM before finetuning.Without mitigation, SAE models alone perform poorly on non-SAE input.
We then train two task-specific dialect mitigation following the approach of VALUE, which augments training data with pseudo-dialect examples during finetuning.This is a strong baseline, as it allows the model to adapt specifically to in-domain augmented examples rather than the general sentences used to align TADA modules.When trained on augmented data, adapters (80.7 Avg.) 2 seem to outperform full finetuning (77.5 Avg.).We hypothesize that random initialization of adapters prevents 2 Avg.refers to the mean performance across GLUE tasks.
conflicting gradients across dialects which can lead to negative transfer (Wang et al., 2020).
Finally, we combine TADA with task-specific SAE modules for our task-agnostic approach.TADA succeeds in our goal of generalizable performance improvements, yielding improved robustness for 6 out of 7 tasks for an average increase of 2.8 points on the GLUE benchmark.However, TADA performs 4% worse on average than task-specific VALUE-augmented adapters.These adapters are trained on larger amounts of dialectal training data directly from each task than TADA, which likely explains their superiority.However, as noted in the table these approaches scale training and storage linearly with the number of tasks, while TADA requires only a constant overhead.
These results are the first to indicate the possibility of task-agnostic dialect adaptation.While performance lags behind the task-specific intervention, these results indicate similar quality is possible with vastly improved scalability.This scalability across tasks is key to truly addressing dialect disparities as NLP has a growing impact across a larger number of tasks.

Cross-Dialectal Evaluation
We then confirm that TADA generalizes across regional dialects using 3 global dialect translations introduced from Ziems et al. (2023) in Table 2. Beyond AAE, we select Nigerian English and Indian English as they are each estimated to have over 100 million English speakers 3 , Singaporean English as it was identified as particularly challenging.
Ultimately, this applicability across dialects reinforces TADAs potential as a general tool, but with key limitations at fully removing the dialect gap.Truly dialect-robust NLP requires generalization across both tasks and dialects, making measuring the performance of both essential.We recommend future works on dialect modeling evaluate both.

Ablation Study
Finally, we show the resuilts from an ablation in   function to the final TADA methods.Contrastive loss alone yields close performance to TADA; it consistently underperforms the combined loss functions on 6 out of 7 tasks (-0.3 Avg.).This extends evidence for the efficacy of this simple loss function from the multilingual (Conneau et al., 2018) to the dialectal domain.
When contrastive loss is removed, the adversarial loss quickly becomes unstable and suffers from mode collapse.This leads to pathological results, with the resulting adapters harming performance for all tasks (-44.9Avg.).

Conclusions
English dialects are underserved by NLP, but are both tractable targets for transfer learning and have huge speaking populations (Bird, 2022).Models which serve English speakers inherently serve a global population who use the language natively and as a second tongue.
However, current approaches to improve dialectal robustness in English have so far focused only on one task at a time.The scalability of these taskspecific methods limits their impact as language technology applications become increasingly diverse and pervasive.We argue that task-agnostic dialectal methods are a clear, yet unexplored path to serve these communities effectively.
We propose a simple yet effective technique TADA to address this, utilizing morphosyntactic data augmentation and alignment loss at both the sequence and morphosyntactic level to train adapter modules.When composed with SAE task adapters, TADA modules improve dialectal robustness consistently on the multi-task GLUE benchmark.Future work should work to further reduce the dialect discrepancy to create more inclusive and equitable English language technology.

Limitations
TADA makes use of the pseudo-dialectal translation systems of prior work Ziems et al. (2022Ziems et al. ( , 2023)).We rely on them as they are validated by dialect speakers and have been shown to be predictive of performance on Gold Dialect data.However, they were designed as stress tests of robustness which isolates morphology and syntax.We are therefore unsure how TADA performs when it faces the topical and register shifts which often are associated with naturally occurring dialects.These limitations are similar to localization issues in translated benchmarks (Moradshahi et al., 2020).
In this work, we evaluate TADA on only Encoder-only LLMs.Increasingly, both Encoder-Decoder and Decoder-only models are seeing widescale use due to their flexibility (Wang et al., 2022).Evaluating TADA and developing alternate tailored task-agnostic methodologies on these alternate LLM architectures is left to future work.
fore not understand TADA to remove discrepancies across all speakers as improvements may vary within subcommunities within a dialect (Koenecke et al., 2020).Additionally, as TADA is taskagnostic, it is especially vulnerable to dual use.To mitigate this, we will release TADA under a license that forbids usage with intent to deceive, discriminate, harass or surveil dialect-speaking communities in a targeted fashion.

Table 1 :
(Liu et al., 2019) GLUE results of RoBERTa Base(Liu et al., 2019)for the 7 GLUE Tasks (Matthew's Corr.for CoLA; Pearson-Spearman Corr.for STS-B; Accuracy for all others).T is the number of target tasks for dialect adaptation.Tasks where TADA improves the performance of task-specific SAE adapters, are marked with +.
Table 3 to evaluate the contributions of each loss

Table 2 :
Multi-Dialectal evaluation results across all Tasks (Matthew's Corr.for CoLA; Pearson-Spearman Corr.for STS-B; Accuracy for all others) for 4 Non-SAE Dialect Variants of GLUE created using Multi-VALUE.

Table 3 :
TADA Loss Ablation results for RoBERTa Base for the 7 GLUE Tasks (Matthew's Corr.for CoLA; Pearson-Spearman Corr.for STS-B; Accuracy for all others) for African-American English.Our results show that the combined loss functions of TADA lead to the strongest results.

Table A1 :
(Devlin et al., 2019)UE results of BERT Base(Devlin et al., 2019)for the 7 GLUE Tasks (Matthew's Corr.for CoLA; Pearson-Spearman Corr.for STS-B; Accuracy for all others).T is the number of target tasks for dialect adaptation.