DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules

Existing large language models (LLMs) that mainly focus on Standard American English (SAE) often lead to significantly worse performance when being applied to other English dialects. While existing mitigations tackle discrepancies for individual target dialects, they assume access to high-accuracy dialect identification systems. The boundaries between dialects are inherently flexible, making it difficult to categorize language into discrete predefined categories. In this paper, we propose DADA (Dialect Adaptation via Dynamic Aggregation), a modular approach to imbue SAE-trained models with multi-dialectal robustness by composing adapters which handle specific linguistic features. The compositional architecture of DADA allows for both targeted adaptation to specific dialect variants and simultaneous adaptation to various dialects. We show that DADA is effective for both single task and instruction finetuned language models, offering an extensible and interpretable framework for adapting existing LLMs to different English dialects.


Introduction
As Natural Language Processing (NLP) becomes even more impactful, the equitable distribution of its benefits becomes an increasing concern.Specifically, NLP tooling is often trained and evaluated on dominant language variants, such as Standard American English (SAE).This results in a significant decline in the performance when these tools are applied to non-SAE dialects.Studies have revealed that SAE models tested on African American Vernacular English (AAVE) encounter difficulties in language identification (Jurgens et al., 2017a) as well as various other natural language tasks (Jørgensen et al., 2016a;Kiritchenko and Mohammad, 2018;Blodgett et al., 2018).These challenges extend to automated speech recognition used by virtual assistants (Koenecke et al., 2020) and hate drop_aux: AAVE allows copula deletion and other auxiliary dropping.

Dialect Adaptation via Dynamic Aggregation
Adapter Training

Linguistic Rule
Feature Adapter Frozen Layer

Adapter Fusion
Frozen Layer Figure 1: DADA dynamically composes adapters which handle specific features of dialectal variation to adapt an SAE model to various dialects by leveraging their commonality.We train nearly 200 feature adapters to capture the linguistic differences between SAE and its dialect variants.These feature adapters can be composed flexibly and arbitrarily to target different dialects.
Existing research to mitgate this disparity has mainly focused on dialectal adaptation targeting individual dialects of interest (Ziems et al., 2022;Garcia and Firat, 2022;Ziems et al., 2023;Sun et al., 2022).This approach is a powerful first step, but has key limitations as English alone has 77 recognized variants which themselves vary internally (Koenecke et al., 2020;Demszky et al., 2021).However, prior adaptation methods would require arXiv:2305.13406v1[cs.CL] 22 May 2023 high accuracy dialect identification systems for real world use even if separate systems are designed for each dialect.Such systems are not yet available for dialects and related languages (Malmasi et al., 2016;Aepli et al., 2023;Chakravarthi et al., 2021;Aepli et al., 2022).Alternatively, a more direct approach involves training the model using a combination of various dialect variants in a multitask learning manner (Caruana, 1997;Liu et al., 2019a).However, this approach requires training new models for dialectal NLP from scratch simultaneously with data from all desired dialects.This repeated training process is prohibitive, especially given the trend towards larger language models with costs upwards of millions of dollars1 .Thus, there is a pressing need for an effective and extensible approach which can adapt existing models to the multi-dialectal setting.
Previous works have developed a collection of lexical and morphosyntactic features that describe the differences between SAE and various other English dialects (Kortmann et al., 2020;Ziems et al., 2023).Many dialects can be described by this common set of features or linguistic rules, with each dialect expressing a subset of the feature space.In addition, dialects are not deterministic speech patterns but rather ranges of acceptable use of these features that speakers adjust based on social and personal contexts (Ziems et al., 2023;Koenecke et al., 2020;Demszky et al., 2021).As a result, dialects do not neatly fit into predefined categories.
To this end, we develop a model which handles this reality by accommodating the diversity of English variants at a fine-grained level (linguistic features).Concretely, we propose Dialect Adaptation via Dynamic Aggregation (DADA): a modular approach to adapt an established model trained on SAE to dialect variants by composing linguistic features.DADA captures and encapsulates each feature using adapters (Houlsby et al., 2019) trained on individual feature rules.Feature adapters dynamically aggregated at test time using adapter fusion (Pfeiffer et al., 2021), which enables the SAE model to flexibly adapt to dialects.The modular design of DADA enables targeted adaptation to specific dialect variants or simultaneous adaptation to multiple dialects, and its compositional nature makes it easily allows re-use of feature adapters even as the density of feature usage varies across dialects, speakers, and time.Beyond performance, the modular architecture improves interpretability, by allowing analysis the components factors responsible for the observed performance improvement.
To sum up, our work contributes the following: • We propose a modular approach DADA to adapt the standard SAE model to dialect variants via a dynamic aggregation of different linguistic features.(Sec.3)

Multi-Head Attention Adapter Fusion
Add & Norm

Feed Forward
For each linguistic feature, we train a feature adapter.systems would require another system to recognize these dialects so that the appropriate model can be used for each input.This task itself is challenging, with state-of-the-art systems showing relatively low accuracy even when distinguishing high-resource dialects of English (Zampieri et al., 2023).Our work avoids this flaw by modeling multiple dialects at once using multidialectal training data.Multidialectal training data has been shown to potentially increase robustness across all dialects in multiple prior works around data collection (Jurgens et al., 2017b) and augmentation (Ziems et al., 2023).
Parameter-Efficient Learning To efficiently transfer pretrained language models to downstream tasks, several techniques (He et al., 2022) have been proposed to update only a small number of extra parameters while keeping most pretrained parameters frozen.For example, adapter tuning (Houlsby et al., 2019;Pfeiffer et al., 2020a) adapts large models using small bottleneck modules.Prefix tuning (Li and Liang, 2021) and Prompt tuning (Lester et al., 2021) prepend additional tunable prefix tokens to the input or hidden layers.Recently, Brown et al. (2020); Liu et al. (2022a,b) prompt language models for specific tasks without any parameter updates through in-context learning.Numerous research efforts have been carried out in order to exploit the full potential of parameter-efficient com-ponents.Pfeiffer et al. (2021) proposes to aggregate the adapters trained on the source tasks with an attentional layer in order to transfer the acquired knowledge to a target task.Asai et al. (2022) introduce a similar approach, but they aggregate soft prompts rather than adapters.MAD-X employs adapter stacking for effective cross-lingual multitask transfer (Pfeiffer et al., 2020b).Wang et al. (2022) propose AdaMix, which combines adapters with random routing and consistency regularization to improve performance.Liu et al. (2023) use a gating network to ensemble adapters, improving model robustness against multiple spurious correlations in datasets simultaneously.
Instruction Tuning Inspired by the success of prompting LLMs to adapt to various tasks (Brown et al., 2020), instruction tuning (Sanh et al., 2022;Wei et al., 2022;Ouyang et al., 2022) propose to finetune language models on a variety of tasks described through instructions to achieve the multitask capability and to enhance zero-shot performance on unseen tasks.Since instruction tuning involves prompting the language models at the input level, our approach is orthogonal to it and can be employed in conjunction to enhance model's multitask and multi-dialect abilities simultaneously.
We introduce Dialect Adaptation via Dynamic Aggregation (DADA), a modular method for adapting an existing model trained on the Standard American English (SAE) to accommodate dialect variants at a finer-grained level.Our proposed method deploys a dynamic aggregation of feature adapters, which characterize the divergence of linguistic features between SAE and its dialect variants.Specifically, DADA involves the creation of a synthetic training dataset for each individual feature using transformation rules (Ziems et al., 2023).These synthetic datasets are used to train respective adapters for each linguistic feature.Finally, we compose these feature adapters to create a single model via an additional fusion layer.

Synthetic Datasets
Previous works have discerned a series of linguistic divergences and devised Multi-VALUE, a collection of lexical and morphosyntactic transformation rules2 between SAE and its 50 dialect variants (Ziems et al., 2022(Ziems et al., , 2023)), including Appalachian English (AppE), Chicano English (ChcE), Colloquial Singapore English (CollSgE), Indian English(IndE), and African American Vernacular English (AAVE), among others.For instance, a well-known linguistic feature of AAVE is the use of Negative Concord, where two negative morphemes are employed to convey a single negation (Martin et al., 1998).This transformation rule is sensitive to the verb-object dependency structure and necessitates an indefinite noun object (Green, 2002).As an example, the SAE sentence "He doesn't have a camera" could be rendered as "He don't have no camera" in AAVE.
Let T = {T 1 , T 2 , ...T N } denote the set of transformation rules between SAE and its dialect variants.For each transformation rule T i ∈ T , we can generate a corresponding synthetic dataset D i by applying the respective rule to each individual training example within the original training dataset D.

Feature Adapter
Adapter tuning is known for its ability to adapt quickly to new tasks without catastrophic forgetting (Pfeiffer et al., 2021).Given these benefits and the inherent modularity of adapter, we develop a feature adapter A i for each of the N linguistic transformation rules T i ∈ T by training it on the corresponding synthetic dataset D i created in Sec.3.1.We insert an adapter module after each feedforward layer3 of the backbone model M that has been trained on the original SAE task datasets, in order to target specific lexical and morphosyntactic differences between SAE and its dialect variants.

Dynamic Aggregation
In Sec.3.2, we described the process of training feature adapter A i for each linguistic transformation rule to capture a specific type of linguistic differences between SAE and its dialect variants.However, it is common for multiple linguistic differences to co-occur within a single sentence in realworld scenarios, thereby necessitating the model to simultaneously consider these distinct linguistic features to varying degrees.
Therefore, we propose to dynamically aggregate the N trained feature adapters, denoted as A = {A 1 , A 2 , ...A N }, into the SAE-trained backbone model M via an additional fusion layer (Pfeiffer et al., 2021).For this purpose, we first construct a super-synthetic training dataset D, employing the same approach as described in Sec.3.1, but with all lexical and morphosyntactic transformation rules T = {T 1 , T 2 , ...T N } applied.After incorporating the N trained feature adapters A and a fusion layer into each layer of the backbone model, we train the fusion layers using the super-synthetic training dataset D, while keeping the feature adapters A and the backbone model M frozen.
Following Pfeiffer et al. (2021), we define the fusion layer as a composition of Key, Value and Query matrices at each layer l of the transformer, denoted by K l , V l and Q l respectively.The output of the feedforward layer h l is taken as the query vector and the output of each feature adapter A i , denoted as a l,i is used as input to both the value and key transformations.With this attention-like fusion layer (Vaswani et al., 2017), the outputs of all feature adapters are combined as followed: where [•, •] indicates the concatenation of vectors and o l is the output of the l-th fusion layer.
Through training on the super-synthetic dataset D, a parameterized compositional mixture of feature adapters can be learned to identify the applied linguistic features for a given input and activate the corresponding feature adapters, thereby facilitating effective addressing linguistic discrepancies between SAE and its dialect variants.
To sum up, the compositionality of DADA enables targeted adaptation to specific dialect variants by selecting appropriate feature adapters.DADA uses modularity and compositionality to adapt a model to linguistic features present at test time since the pervasiveness of a feature can vary greatly based on it's applicability and density (Demszky et al., 2021).This allows DADA to simultaneously adapt to various dialects by using comprehensive set of feature adapters.We explore this property further in Sec. 5, using its interpretability to study individual feature adaptations utilized (see Sec. 5).

Multi-Dialect Adaptation
In this section, we demonstrate how DADA can enable the adaptation of an existing SAE model to multiple dialect variants, taking Multi-Genre Natural Language Inference (MNLI; Williams et al. ( 2018)) task as an example.

Experimental Setup and Evaluation
As described in Sec.3.2, we train a feature adapter for each transformation rule from Ziems et al. (2023), the collection of lexical and morphosyntactic transformation rules between SAE and its dialect variants.In total, we train nearly 200 feature adapters for downstream use.Here, we demonstrate that these features can be flexibly composed in DADA to improve model performance across multiple dialects simultaneously.We evaluate on five representative dialects: AppE, ChcE, CollSgE, IndE, AAVE.We employ RoBERTa Base (Liu et al., 2019b) that has been finetuned on the original SAE MNLI training dataset as the backbone model.
For each transformation rule, we generate a synthetic dataset by applying only that specific transformation rule to each example in the original MNLI training dataset.We only retain examples that differ from the original example, i.e., examples that have been transformed.Afterward, we train feature adapters using these synthetic datasets, as described in Sec.3.2.To aggregate trained fea-ture adapters into the backbone model, we train a large fusion layer for 5 epochs on a synthetic dataset that applies all dialectal variations simultaneously, termed Multi.Additionally, we include a null adapter that remains as the identity function.This is kept for purely SAE inputs.In Appendix B, we report full hyperparameters along with the training details.We evaluate DADA on five English dialects: AppE, ChcE, CollSgE, IndE, AAVE and report the results in Table 1.Followed by Ziems et al. (2022Ziems et al. ( , 2023)), we construct each dialect-specific MNLI dataset by utilizing a subset of transformation rules that correspond to the respective dialect.

Results
Compared to the standard SAE model trained on the original MNLI dataset (SAE baseline), DADA demonstrates significant performance improvements across all evaluated dialects and even on SAE, with an average improvement of 2.16%.Moreover, DADA delivers comparable performance to the strong baseline provided by individual further fine-tuning or adapter tuning on the SAE trained model with dialect-specific training data (Single Finetuning and Single Adapter).However, while these two approaches require a perfect dialect identification system and D models, our approach uses a single model and therefore does not rely on dialect identification.This makes DADA a simpler and more realistic method for use when the target dialect distribution is unknown.
Compared to additional finetuning or adapter tuning Multi on standard SAE model (Multi Finetuning and Multi Adapter), DADA brings an average improvement of 0.32% and 0.47%, respectively.Moreover, it tunes fewer parameters during a single training run compared to Multi Finetuning.We confirm that the empirically strong performance of DADA stems from the effective use of the correct individual feature adapters in Sec. 5.
Note that with DADA, in instances where a new dialect arises, the integration of this new dialect can be achieved through the identification of the linguistic transformation rules that govern the shift from SAE to the new dialect, followed by the training of a feature adapter for each new transformation rule, and finally the retraining of the fusion layer.Furthermore, the potential for reusability of trained feature adapters is significant as many dialects often share common linguistic features.

Interpretability
As discussed in Sec. 3, DADA can implicitly identify the relevant linguistic features for a given input and activate the corresponding feature adapters.
We validate this by investigating the correlation between attention scores within each layer of DADAand the presence of linguistic features, to determine whether the contributing feature adapters are relevant to the features present.

Analyses Setup and Results
Here, we use the AAVE dialect and MNLI task as an example.To adapt a standard MNLI-finetuned RoBERTa Base model to target the AAVE dialect, we only need to take into account the 10 transformation rules between SAE and AAVE proposed by Ziems et al. (2022).We select the corresponding feature adapters from our collection and report the details in These results demonstrate the superior performance of DADA over all other methods evaluated.

Correlation Analysis of Fusion Activation
We perform a correlation analysis of these 10 feature adapters for the linguistic features applied to the input data.For each transformation rule, we calculate the softmax activation for each adapter, for each input to which the specific linguistic feature applies, and average over all activations within the same layer calculated over all instances in the AAVE MNLI test set.For better clarity, our final metrics takes the average utilization score of each feature adapter for the entire dataset and then subtracts the average utilization score associated with each transformation rule.We plot the results for layers 1, 3, 7, 11 in Figure 3.We found that significant correlations in utiliza- tion on the lower layers (0-3) are observed, while those on the middle and higher layers are found to be negligible.This is consistent with our intuition, as the primary distinction between SAE and its dialect variants lies in their linguistic features (lexical and morphosyntactic), which are mainly captured by the lower layers of the model4 .This analysis demonstrates that DADA has the capability to detect which linguistic features are relevant to the given input, and subsequently trigger the corresponding feature adapters.This highlights the interpretability of DADA with regard to the underlying factors that contribute to performance improvement.

Multi-Task Dialect Adaptation
Recent LLMs such as T0 (Sanh et al., 2022), FLAN-T5 (Chung et al., 2022), and InstructGPT (Ouyang et al., 2022) are instruction-tuned (Wei et al., 2022) for various tasks, which is orthogonal to our method, making it possible to combine the two approaches easily.In this section, we demonstrate how DADA can be employed to instructiontuned large language models to improve their taskagnostic performance on dialects.

Experimental Setup
Using AAVE dialect as a case study, to demonstrate the effectiveness of our method in adapting the SAE model across multiple tasks, we include the tasks from the AAVE transformed version (Ziems et al., 2022) of the GLUE Benchmark (Wang et al., 2018), including CoLA, MNLI, QNLI, QQP, SST-2, and STS-B.For our backbone model, we employ a FLAN-T5 Base (Chung et al., 2022).Despite the original paper incorporates GLUE within the FLAN-T5's training data, we retrain the model on these specific tasks to enhance its suitability.

Multi-task training
For each transformation rule of AAVE dialect, we construct synthetic training data following the procedure described in Sec.3.1.However, in the case of a multi-task model, we construct a synthetic dataset for each task considered and utilize the mixture to train the corresponding feature adapter.Subsequently, we proceed to fuse these feature adapters by training a fusion layer on the super-synthetic dataset Multi-Task AAVE, which is constructed by applying all the AAVE transformation rules.In Appendix D, we provide the templates used to train the FLAN-T5 model.In Appendix B, we report full hyperparameters along with the training details.
We assess the performance of DADA on AAVE transformed version of the GLUE Benchmark, and compare its results with the SAE baseline and Adapter Tuning with Multi-Task AAVE.

Results
It is surprising to note that although single Adapter Tuning with Multi-Task AAVE demonstrates improvements in 4 out of 7 tasks, the overall average performance is even inferior to that of the SAE baseline.In contrast, DADA consistently outperforms both the SAE baseline and Adapter Tuning across all evaluated tasks, resulting in an overall improvement of 1.80/1.92points on the AAVE GLUE benchmark, respectively.Specifically, on the relatively large datasets, DADA achieves a notable accuracy improvement of 2.0%/1.0%on MNLImm, 0.9%/1.2% on QNLI, and 1.5%/0.9%on QQP when compared to the SAE Baseline and Adapter Tuning, respectively.These results demonstrate that our proposed approach, DADA, is not limited to single-task applications but can be easily scaled up to accommodate various tasks for use with the increasingly common multi-task instruction-tuning setup using in popular large-scale industrial systems (Ouyang et al., 2022;OpenAI, 2023a;Anil et al., 2023;OpenAI, 2023b).
In Table 4, we also present the results obtained with ChatGPT (OpenAI, 2023a).Due to budget constraints, we were only able to evaluate 500 examples from the development set of each task.However, even with this limited evaluation, we can still gain insights that ChatGPT performs significantly worse than the SAE FLAN-T5 Base model on 5 out of 7 tasks.This emphasizes that merely scaling up the model is inadequate for tackling the challenge of dialect disparities.These limitations persist even in the context of large language models.Inspired by "expert" prompts (Odena et al., 2021;Shi et al., 2022), we incorporate a "Native Speaker" Prompt for ChatGPT: "You are a native [DIALECT_NAME] English speaker, and here is your task:" However, ChatGPT + "Native" Prompt does not yield improved results and, in fact, performs even worse than the vanilla ChatGPT on all evaluated tasks.This highlights that dialect adaptation is not solved with trivial prompt-based interventions while being simultaneously less grounded in expert linguistic resources than DADA.

Conclusion
In this paper, we present Dialect Adaptation via Dynamic Aggregation (DADA), a fine-grained and modular approach designed to adapt an established model trained on Standard American English to its dialect variants through the compositional aggregation of linguistic features.Our experiments demonstrate that the compositionality of DADA enables targeted adaptation to specific dialects, and demonstrated improved robustness across multiple evaluated dialects, including AppE, ChcE, CollSgE, IndE, and AAVE.Our analysis also highlights the interpretability of DADA, as shown through its capability to identify relevant linguistic features for a given input and trigger the corresponding adapters.Furthermore, our experiments on FLAN-T5 illustrate the potential of applying DADA to taskagnostic instruction-tuned large language models, showcasing its generalizability.
DADA involves the training for feature adapters and the fusion layer, which can make it computationally expensive, especially when dealing with a substantial number of linguistic rules.However, each training run only requires a small number of parameters to be learned, and parallelization is feasible for feature adapter training.More importantly, these trained feature adapters exhibit significant reusability; the same set of feature adapters can be reused and employed for multiple dialects, though the fusion layer would need to be retrained for these dialects.However, if a use case does not involve significant reuses, this aspect may indeed remain a limitation.We will release our trained feature adapters so that future studies will not need to reincur the up-front training cost.
Furthermore, while DADA has the flexibility to utilize any linguistic rules, in our experiments, we specifically employed these linguistic transformation rules that are well-established in prior work for English (Ziems et al., 2022(Ziems et al., , 2023)).These rules were chosen because they were curated by linguists, validated by dialect speakers, and because English has many globally relevant dialects (Bird, 2022).However, evaluating DADA for other language groups and broader sets of lexical variation is key area for future work.
While DADA mainly relies on Multi-VALUE (Ziems et al., 2022(Ziems et al., , 2023)), they are orthogonal processes with different assumptions about dialect use.For each dialect, Multi-VALUE defines the density of a dialectal feature as the probability of the feature occurring when it is applicable, as well as the probability of the corresponding perturbation to be used in converting a sentence from SAE into that dialect.However, the actual prevalence of a feature heavily depends also on applicability.
instead focuses on adapting to the linguistic features present in a given sentence.We learn a parameterized compositional mixture of the dialectal features automatically, rather than relying on static assumptions of density.This avoids what we view as a major issue: it is often difficult to determine the dialect of an input since dialects themselves vary depending on context and speaker.The density of a dialectal feature represents an approximate of density across the entire dialect, but may not be accurate to a specific speaker and context (Koenecke et al., 2020).On the other hand, DADA can dynamically recognize the applicable dialectal features for a given input and activate the corresponding feature adapters.It remains to be explored in future work how the density of dialectal features as captured in the linguistic literatuere relates to the compositional mixture of these features as learned in the fusion layer of DADA.
A Tranformation Rules Details Ziems et al. (2022Ziems et al. ( , 2023) ) developed a collection of lexical and morphosyntactic transformation rules that account for the differences in linguistic features between SAE and its various dialect variants.In our study, we build upon this work by training transformation adapters for each rule in this collection.In their original paper, they present a comprehensive overview of each transformation rule in Appendix B. Tables 9-21, they provide detailed Multi-VALUE implementations, including an enumeration of the implemented dialects and features, accompanied by illustrative examples for each.

B Training Details
Multi-Dialect Adaptation We train feature adapters for each transformation rule using synthetic datasets, as described in Sec.3.2, with learning rate 3e-4 and batch size 64 followed by Houlsby et al. (2019).To prevent significant performance differences among the trained feature adapters due to varying sizes of synthetic datasets, we fix the number of training steps to 10,000.For each feature adapter, we choose the checkpoint with the highest accuracy on the validation matched split of a synthetic dataset that applies all dialectal variations simultaneously, termed Multi.For dynamic aggregation, we train a large fusion layer for 5 epochs on Multi.We set the learning rate to 2.5e-5 and the batch size to 64.
Multi-Task Dialect Adaptation For feature adapter training, we set the learning rate to 1e-3 and fix the number of training steps as 50000.To fuse these feature adapters, we train a fusion layer for 5 epochs using a learning rate of 8e-5.

C Utilization Correlation Coefficients Plots
In Sec. 5, we showcase the effectiveness of DADA in adapting the RoBERTa Base (Liu et al., 2019b) model that has been finetuned on the original SAE MNLI training dataset to AAVE.To demonstrate the interpretability of DADA, we conduct an analysis of the utilization correlation among the aggregated 10 transformation adapters.We present utilization correlation coefficient plots for all layers in Figure 4 and 5.

Figure 2 :
Figure 2: The overall process of DADA.We first constuct a synthetic dataset D i by applying each linguistic transformation rule T i ∈ T , such as drop_aux: "AAVE allows copula deletion and other auxiliary dropping", to each individual training example within the original training dataset D. Then we develop a feature adapter A i for each linguistic rule T i by training it on the corresponding synthetic dataset D i .We select the backbone model trained on the original SAE task datasets to enable the feature adapter to capture linguistic differences while disregarding the task-specific information.

Figure 4 :
Figure 4: Correlation Coefficients between the transformation adapters (column) and the inputs to which specific transformation rules (row) apply in layers 0-5.

Figure 5 :
Figure 5: Correlation Coefficients between the transformation adapters (column) and the inputs to which specific transformation rules (row) apply in layers 6-11.
that these questions are the same?{answer} SST-2 The SST-2 (Stanford Sentiment Treebank; Socher et al. (2013)) task is a widely used benchmark for sentiment analysis.It involves classifying the sentiment of a given sentence as either positive or negative.For the SST-2 task, we adopt the following template: Review: {sentence} Is this movie review sentence negative or positive?The answer is: {answer} STS-B The Semantic Textual Similarity Benchmark (STS-B; Cer et al. ( {answer} 5 https://www.kaggle.com/c/quora-question-pairs

Table 1 :
(Liu et al., 2019b)ation results of SAE RoBERTa Base(Liu et al., 2019b)model for five English dialects: AppE, ChcE, CollSgE, IndE and AAVE.Due to the submission limitations of the GLUE benchmark, the results are reported on the validation mismatched split.The significance bars of the mean accuracies are determined through a paired bootstrap test conducted on the concatenation of each individual dialect dataset.D is the number of target dialects for dialect adaptation.DADA outperforms the standard SAE baseline on all five dialects and SAE (marked as (+)), with an averge of 2.16% improvement.The underline indicates that the performance of DADA even surpasses that of individual models.nulladapterForSAE inputs, every adapter has the potential to incorrectly change the model's original predictions.Therefore, we introduce a null adapter that which preserves the output of the original SAE model at each layer.We conduct an ablation study to evaluate the necessity of the null adapter by comparing with models where it is excluded.We denote this variant as DADA w/o null .As shown in Table1, excluding the null adapter results in a slight drop in performance for SAE.

Table 2
model on the test split of the AAVE matched MN-LIdataset as shown in Table 2.In comparison to the standard SAE model, DADA demonstrates a 3.2% and 1.4% improvement on AAVE and SAE, respectively.Moreover, DADA outperforms simple

Table 2 :
Transformation rules for AAVE dialect proposed by Ziems et al. (2022) along with the number of training examples in the corresponding synthetic datasets, and evaluation accuracies of the resulting feature adapters.
additional finetuning and adapter tuning of AAVE on SAE model by 0.4% and 0.5%, respectively, achieving the best performance of 86.6% on AAVE.

Table 3 :
AAVE (Liu et al., 2019b)of RoBERTa Base(Liu et al., 2019b).pretrained denotes the pretrained RoBERTa Base model, while SAE finetuned denotes the RoBERTa Base model that has been finetuned on the original SAE MNLI dataset.FT refers to "fine-tuning".DADA demonstrates superior performance on AAVE and SAE compared to baselines (marked as ✓).

Table 4 :
Multi-Task AAVE Adaptation results of SAEFLAN-T5 Base (Chung et al., 2022) (Matthew's Corr.for CoLA; Pearson-Spearman Corr.for STS-B; Accuracy for all others).SAE Baseline denotes the FLAN-T5 Base model that has been finetuned using the original SAE mixture of task datasets.In comparison to both the SAE Baseline and Adapter Tuning with Multi-Task AAVE, DADA consistently exhibits superior performance across all evaluated tasks (marked with ✓).Due to budget constraints, the results of ChatGPT are reported on a 500 example subset of the development sets.Prompt based interventions do not improve ChatGPT's performance on AAVE.On the contrary, it can even result in further degraded performance (marked with ↓).