Efficient Test Time Adapter Ensembling for Low-resource Language Varieties

Adapters are light-weight modules that allow parameter-efficient fine-tuning of pretrained models. Specialized language and task adapters have recently been proposed to facilitate cross-lingual transfer of multilingual pretrained models (Pfeiffer et al., 2020b). However, this approach requires training a separate language adapter for every language one wishes to support, which can be impractical for languages with limited data. An intuitive solution is to use a related language adapter for the new language variety, but we observe that this solution can lead to sub-optimal performance. In this paper, we aim to improve the robustness of language adapters to uncovered languages without training new adapters. We find that ensembling multiple existing language adapters makes the fine-tuned model significantly more robust to other language varieties not included in these adapters. Building upon this observation, we propose Entropy Minimized Ensemble of Adapters (EMEA), a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. Experiments on three diverse groups of language varieties show that our method leads to significant improvements on both named entity recognition and part-of-speech tagging across all languages.


Introduction
Massively multilingual pretrained models (Devlin et al., 2019;Huang et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020) combined with cross-lingual transfer now define the state of the art on a variety of NLP tasks (Hu et al., 2020). Within this paradigm, multilingual pretrained models are fine-tuned on annotated data of a task in a high-resource language, and transferred to other languages. Several recent works propose parameter-efficient fine-tuning methods that insert small adapter modules between the layers of pretrained models (Rebuffi et al., 2017;Houlsby et al., 2019). In this line of work, the pretrained model is usually frozen while only the adapters are fine-tuned for a downstream task, which is conducive to both improving the model's learning ability and compactness with respect to storage on disk or in memory. The adapters can be applied to the cross-lingual transfer setting by training separate language and task adapters (Pfeiffer et al., 2020b;Üstün et al., 2020). Specifically, Pfeiffer et al. (2020b) propose to perform zero-shot transfer by first training language-level adapters on monolingual data in different languages and then a task adapter on annotated data in the source language.
One drawback of this framework is that a separate language adapter is required for each target language, which is problematic in cases where the data to train these adapters cannot be easily obtained, such as for languages with diverse regional or demographic variations. In fact, certain language varieties are not included in the standard language identification tools, which makes it challenging to reliably obtain even unlabeled data (Salameh et al., 2018;Caswell et al., 2020;Demszky et al., 2021). To give just one example, the Nordic languages and dialects form a dialect continuum where the total number of language varieties is difficult to estimate, and language varieties constantly emerge in culturally and linguistically diverse areas (Svendsen and Røyneland, 2008;Røyneland and Jensen, 2020). Although highly related, these language varieties have many systematic differences, which need to be addressed by NLP systems that equitably serve all speakers (Kumar et al., 2021). One potential mitigation strategy is directly using an adapter trained on another similar language variety, but we find this sub-optimal in experiments ( § 4).
Instead, we propose two methods to combine existing language adapters to adapt the model to new language varieties at test time without any training data. First, we find that simply ensembling multiple related language adapters can significantly improve the fine-tuned model, compared with using individual language adapters. Second, we propose Entropy Minimized Ensemble of Adapters (EMEA; Fig. 1), which adapts the ensemble weight of the language adapters for each test instance by minimizing the ensembled model's prediction uncertainty. Our experiments show that EMEA further improves over vanilla ensembling for three groups of uncovered language varieties on both the named entity recognition and part-of-speech tagging tasks.

Adapters for Cross-lingual Transfer
To facilitate our discussion, we briefly summarize the MAD-X framework (Pfeiffer et al., 2020b) for zero-shot cross-lingual transfer and identify its shortcomings. The goal of MAD-X is to fine-tune a multilingual pretrained model M to m downstream tasks T 1 , T 2 , ..., T m , each of which could be in n languages L 1 , L 2 , ..., L n . To this end, MAD-X relies on language and task adapters, which are lightweight functions inserted in the Transformer layers in M-usually a feed-forward down-projection followed by an up-projection. Specifically, let h be the output of an intermediate layer in M, then L j (h) is the transformation that projects h into the embedding space for language L j , and T i (L j (h)) is the transformation that projects L j (h) into the embedding space for task T i .
MAD-X trains the adapters T i (·) and L j (·) in two steps. First, for each language L j , its adapter L j is inserted into M to replace the output of each layer h with L j (h). The resulting model, which we denote as L j • M, is trained on unlabeled data in L j using an unsupervised objective such as masked language modeling (MLM; Devlin et al., 2019). Second, for each task T i , its adapter T i is inserted on top of a src language adapter L src . The resulting model T i • L src • M is trained on the downstream task T i in language L src . After these two steps, T i • L j • M can be used to perform zero-shot cross-lingual transfer for any task T i and language L j .
Shortcomings This approach requires a separate adapter for each language one wishes to support. The online database AdapterHub 1 aims to improve the efficiency and reuse of trained language and task adapters (Pfeiffer et al., 2020a) but currently supports only about 50 languages, and hence most languages are not covered. More importantly, as mentioned in the introduction, certain languages have diverse regional varieties and difficulty of reliably obtaining data for them makes adapter-based approaches especially brittle in these cases. In the following § 3, we propose strategies to improve the robustness of language adapters to uncovered languages without training new adapters.

Generalizing Language Adapters to Related Languages
We consider the setting where we have a multilingual pretrained model M as well as the pretrained task adapters T 1 , T 2 , ..., T m and language adapters L 1 , L 2 , ..., L n . We want to use M and the existing adapters to support a new language L new , which is not in {L 1 , L 2 , ..., L n } on a given task T without training a new adapter for L new . Related Language Adapters One potential solution is to find the most related language L rel ∈ {L 1 , L 2 , ..., L n } and then use T • L rel • M to do inference in L new . However, this has two disadvantages. First, the task adapter T is only trained in the setting of T • L src • M, so it might not generalize well to the test time setting of T • L rel • M (as shown in § 4.1). Second, while the pretrained model M may be relatively robust against distribution shifts (Hendrycks et al., 2020), the specialized language adapters might make the model brittle to language variations because they are trained for specific languages. Our experiments in § 4.1 show that this solution indeed leads to poor performance.
Adapter Ensembling As a first solution to this problem, we propose an extremely simple strategy of averaging the transformed outputs of multiple language adapters. Specifically, we use both the source language adapter L src and adapters from related languages with similar linguistic properties to the new language. Let R be the set of the source and related language adapters. To do inference on a task T for the new language L new , we transform the output h of each layer in M with the language adapters as L avg (h) = 1

Entropy Minimized Ensemble of Adapters
While ensembling is a simple and effective strategy to combine multiple potentially beneficial language adapters, the equal weighing of all language adapters could be sub-optimal for L new ; different language varieties, or even sentences, could benefit from a different weighting of the pretrained language adapters. To further improve adapter ensembling, we generalize L avg (h) into a learnable weighted average: where α 1 , α 2 , ..., α R are learnable weights satisfying α i ≥ 0 and S i=1 α i = 1. Next, we propose Entropy Minimized Ensemble of Adapters (EMEA) method, which learns the adapter weightings for each sentence without additional training.
The intuition behind our method is that a good adapter weight α for a test input x should make the model more confident in its prediction for x, that is, it should lead to lower model entropy over the input (Shannon, 1948;Wang et al., 2021). Specifically for structured prediction tasks, we want to classify each word x w in a test input x with W words into one of the possible C classes. We consider the entropy: is a function of the ensemble weights α, we can calculate the gradient of α as g i = ∇ α i H(x; α).
To minimize the entropy loss, we can simply do gradient descent steps on each α i using the corresponding gradient g i by α i = α i − γg i , where γ is the learning rate. We can then use the updated α to calculate the final prediction for x. In § 4, we find that a single step of gradient update already leads to better performance than simple ensembling. We can additionally perform multiple steps of gradient descent to obtain a better α at the cost of lower inference speed. Alg. 1 shows the pseudo code of our method 2 .
Model We use the mBERT (Devlin et al., 2019) model, which shows good performance for lowresource languages on the structured prediction tasks (Pfeiffer et al., 2020b;Hu et al., 2020). We use the English annotated data to train the task adapter. Each experiment is run with 3 different random seeds and we report the average performance. More details can be found in Appendix A.
Languages Due to the scarcity of datasets for dialects, we focus on three groups of closely related languages to simulate the setup of language varieties. Each group has a language with a pretrained adapter available on the AdapterHub ( Table 2: F1 of the baselines and our methods for each language group. EMEA-s1 updates the adapter weights with a single gradient step while EMEA-s10 updates for 10 steps.  Baselines We compare with several baselines: 1) En: the English adapter; 2) Related: the best performing related language adapter; 3) Continual learning (CL): we use the English language adapter and update its parameters using the entropy loss for each test input; 4) Fusion: learn another set of key, value and query parameters in each layer that uses the layer output as a query to mix together the output of each adapter (Pfeiffer et al., 2021). Since we do not use labeled data in the new language, we train the fusion parameters on English labeled data.

Results
The results can be found in Tab. 2. For most languages using the English adapter is better than the best individual related language adapter. This confirms our hypothesis that specialized language adapters are not robust to language variations. CL leads to slight improvements for some languages but is generally comparable to En. Fusion improves over En for the NER task but it requires training and storing extra parameters. Its performance is also not consistent across languages and tasks, likely because it is only trained on English labeled data.
Using multiple language adapters brings significant gains Ensembling leads to significant gains for the non-Latin language group. It also brings im-provements or is comparable to the best baseline on other languages. EMEA delivers further improvements across almost all languages, demonstrating the effectiveness of adapting language adapter weights to each test sentence. With only a single gradient update step on the ensemble weights, EMEA-s1 already leads to significant improvements over ensembling for NER. EMEA-s10 brings additional improvements on both tasks because it learns more optimal ensembling weights with 10 gradient update steps (we list the inference cost for each method in Appendix B). We hypothesize that the proposed methods improve non-Latin languages more because these are low-performing languages that the model is more uncertain about.
Effect of test batch size In Fig. 2 we plot the result of using different test batch sizes with EMEA on the NER task. A smaller batch size leads to more fine-grained test time adaptation with a higher computational cost. Fig. 2 shows that a smaller batch size indeed leads to better performance while using a larger batch size still outperforms the baseline.
Significance of source language adapter We investigate whether the benefit of adding the src language adapter comes from the discrepancy between training and testing of the task adapter. We train Figure 5: Mean and standard deviation of the weight for each adapter for the is (left) and hi (right) language groups. different task adapters with language adapters other than English (en), and compare the improvement of adding the en adapter for the ensemble. Fig. 3 shows that the en adapter provides the largest benefit when it is used to train the task adapter, which verifies that using different language adapters with the task adapter between training and testing leads to sub-optimal cross-lingual transfer performance.
Comparison to training new adapters In order to better understand how much data is required to train new language adapters that are competitive with EMEA, we trained new adapters using a small amount of monolingual data in the target language. We focus on two languages, mr and no, on the NER task, and show the results in Fig. 4. Note that this setting puts EMEA at a disadvantage because EMEA does not require any training. It takes about 100k monolingual data for no to reach comparable performance with our method, while mr still lags behind EMEA. As large amounts of monolingual data are difficult to obtain for many language varieties and under-represented languages, EMEA can serve as a useful baseline for applying NLP models to such low-resource settings.
Analysis of weights We plot the mean and standard deviation of ensembling weights from EMEA in Fig. 5. The En adapter gets the highest weight for both language groups, in line with the results in Tab. 2 showing en as the best individual adapter. For the hi language group, the ar adapter tends to have the least benefit, probably because it has a different script from the languages we test on.

Related Work
Our work is related to parameter efficient finetuning of pretrained models (Bapna et al., 2019;Pfeiffer et al., 2020b;Li and Liang, 2021;Guo et al., 2021). Specifically, (Üstün et al., 2020;Karimi Mahabadi et al., 2021) make adapters more generalizable by learning a parameter generator, while our work aims to utilize existing pretrained adapters without further training. Pfeiffer et al. (2021) propose to learn extra parameters using la-beled data to combine pretrained multitask adapters whereas our method does not require any training or labeled data. While we focus on language adapters in this work, our method is also applicable to ensembling domain or task adapters. Finally, our method is inspired by the test time adaptation framework proposed for image classification (Sun et al., 2020;Wang et al., 2021;Kedia and Chinthakindi, 2021). Instead of adapting a single model, we focus on efficient utilization of many pre-trained language adapters to improve the model's robustness to language variations.

Discussion and Conclusion
Language and dialect cannot be simply categorized into monolithic entities. Thus a truly intelligent NLP system should be able to recognize and adapt to personalized language varieties after it is trained and deployed. However, the standard system evaluation is built on the assumption that an NLP model is fixed once it is trained. In this paper, we focus on a specific case of this general problem-we find that specialized language adapters might not be robust to unseen language variations, and that utilization of multiple existing pretrained language adapters alleviates this issue. We hope our findings can inspire future work on models that are robust and adaptive to language variations. We identify two limitations of this paper, which we leave to future work. First, there are limited datasets and benchmarks that evaluate NLP models' ability to generalize to unseen dialect variations. Therefore, we only test our method on NER and POS tagging tasks because they have the best language coverage. It is an important future direction to construct high-quality datasets that consider language and dialect variations. Second, our method has slower inference speed due to test time computation. Future work can aim to reduce the cost by algorithmic or hardware innovations.

A Implementation Details
We preprocess the data using scripts in XTREME (Hu et al., 2020). We use the best performing adapter configuration in Pfeiffer et al. (2020b). For NER, we train the task adapter for 100 epochs using learning rate of 1e-4. For POS tagging, we train the task adapter for 50 epochs with the learning rate of 1e-4. For EMEA, we search over the learning rate γ of 0.1, 1, 10 on the English validation set and pick γ = 10 for all experiments.
For Fusion, we use learning rate of 5e-5 which is recommended by (Pfeiffer et al., 2021). We search over the best learning rate for CL on the performance of English labeled data. We use the learning rate of 2e-5 and do 1 step of gradient update for each batch.
For our experiment on training new adapters, we find that training from scratch on no and mr is not competitive when using very small amount of data. Therefore, we continue training from their related language adapters.

B Decoding Speed
We list the inference time for various methods in the paper in Tab. 3. EMEA leads to better performance at a cost of lower inference speed. We leave it to future work to explore strategies that speed up the test time optimization.

C Examples of outputs
We compare the outputs of EMEA with the best baseline on the POS tagging task for Norwegian (no). Although both methods struggle with verb and adjective predictions, EMEA is often better at predicting the correct adjectives compared to the baseline.