TADA: Efficient Task-Agnostic Domain Adaptation for Transformers

,


Introduction
Pre-trained language models (Radford et al., 2018;Devlin et al., 2019) utilizing transformers (Vaswani et al., 2017) have emerged as a key technology for achieving impressive gains in a wide variety of natural language processing (NLP) tasks.However, these pre-trained transformer-based language models (PTLMs) are trained on massive and heterogeneous corpora with a focus on generalizability without addressing particular domain-specific concerns.In practice, the absence of such domainrelevant information can severely hurt performance in downstream applications as shown in numerous studies (i.a., Zhu and Goldberg, 2009;Ruder and Plank, 2018;Friedrich et al., 2020).
To impart useful domain knowledge, two main methods of domain adaptation leveraging transformers have emerged: (1) Massive pre-training from scratch (Beltagy et al., 2019;Wu et al., 2020) relies on large-scale domain-specific corpora incorporating various self-supervised objectives during pre-training.However, the extensive training process is time-and resource-inefficient, as it requires a large collection of (un)labeled domainspecialized corpora and massive computational power.(2) Domain-adaptive intermediate pretraining (Gururangan et al., 2020) is considered more light-weight, as it requires only a small amount of in-domain data and fewer epochs continually training on the PTLM from a previous checkpoint.However, fully pre-training the model (i.e., updating all PTLM parameters) may result in catastrophic forgetting and interference (McCloskey and Cohen, 1989;Houlsby et al., 2019), in particular for longer iterations of adaptation.To overcome these limitations, alternatives such as adapters (Rebuffi et al., 2017;Houlsby et al., 2019), and sparse finetuning (Guo et al., 2021;Ben Zaken et al., 2022) have been introduced.These approaches, however, are still parameter-and time-inefficient, as they either add additional parameters or require complex training steps and/or models.
In this work, we propose Task-Agnostic Domain Adaptation for transformers (TADA), a novel domain specialization framework.As depicted in Figure 1, it consists of two steps: (1) We conduct intermediate training of a pre-trained transformerbased language model (e.g., BERT) on the unlabeled domain-specific text corpora in order to inject domain knowledge into the transformer.Here, we fix the parameter weights of the encoder while

##tis chronic
Input text: Hidradenitis suppurativa (HS) is a chronic relapsing skin disease.updating only the weights of the embeddings (i.e., embedding-based domain-adaptive pre-training).

Encoder
As a result, we obtain domain-specialized embeddings for each domain with the shared encoder from the original PTLM without adding further parameters for domain adaptation.
(2) The obtained domain-specialized embeddings along with the encoder can then be fine-tuned for downstream tasks in single-or multi-domain scenarios (Lange et al., 2021b), where the latter is conducted with metaembeddings (Coates and Bollegala, 2018;Kiela et al., 2018) and a novel meta-tokenization method for different tokenizers.
Contributions.We advance the field of domain  and further study the effects of domain-specific tokenization ( § 2.2).We then utilize multiple domainspecialized embeddings with our newly proposed meta-tokenizers and powerful meta-embeddings in multi-domain scenarios ( § 2.3 and § 2.4).

Domain Specialization
Following successful work on intermediate pretraining leveraging language modeling for domainadaptation (Gururangan et al., 2020;Hung et al., 2022a) and language-adaptation (Glavaš et al., 2020;Hung et al., 2022b), we investigate the effects of training with masked language modeling (MLM) on domain-specific text corpora (e.g., clinical reports or academic publications).For this, the MLM loss L mlm is commonly computed as the negative log-likelihood of the true token probability (Devlin et al., 2019;Liu et al., 2019).
where M is the total number of masked tokens in a given text and P (t m ) is the predicted probability of the token t m over the vocabulary size.Fully pre-training the model requires adjusting all of the model's parameters, which can be undesir-able due to time-and resource-inefficiency and can dramatically increase the risk of catastrophic forgetting of the previously acquired knowledge (Mc-Closkey and Cohen, 1989;Ansell et al., 2022).To alleviate these issues, we propose a parameterefficient approach without adding additional parameters during intermediate domain-specialized adaptation: we freeze most of the PTLM parameters and only update the input embeddings weights of the first transformer layer (i.e., the parameters of the embeddings layer) during MLM.With this, the model can learn domain-specific input representations while preserving acquired knowledge in the frozen parameters.As shown in Figure 1, the encoder parameters are fixed during intermediate pre-training while only the embeddings layer parameters are updated.
As a result, after intermediate MLM, multiple embeddings specialized for different domains are all applicable with the same shared encoder.As these trained domain-specialized embeddings are easily portable to any downstream task, we experiment with their combination in multi-domain scenarios via meta-embeddings methods (Yin and Schütze, 2016;Kiela et al., 2018).We discuss this in more detail in Section § 2.3.

Domain-Specific Tokenization
Inspired by previous work on domain-specialized tokenizers and vocabularies for language model pre-training (Beltagy et al., 2019;Lee et al., 2019;Yang et al., 2020), we study the domain adaptation of tokenizers for transformers and train domainspecialized variants with the standard WordPiece algorithm (Schuster and Nakajima, 2012) analogously to the BERT tokenizer.As a result, the domain-specialized tokenizers cover more indomain terms compared to the original PTLM tokenizers.In particular, this reduces the number of out-of-vocabulary tokens, i.e., words that have to be split into multiple subwords, whose embedding quality often does not match the quality of wordlevel representations (Hedderich et al., 2021).

Meta-Embeddings
Given n embeddings from different domains D, each domain would have an input representation x Di ∈ R E , 1 ≤ i ≤ n, where n is the number of domains and E is the dimension of the input embeddings.Here, we consider two variants: averaging (Coates and Bollegala, 2018) and attentionbased meta-embeddings (Kiela et al., 2018).
Averaging merges all embeddings into one vector without training additional parameters by taking the unweighted average: In addition, a weighted average with dynamic attention weights α Di can be used.For this, the attention weights are computed as follows: with W ∈ R H×E and V ∈ R 1×H being parameters that are randomly initialized and learned during training and H is the dimension of the attention vector which is a predefined hyperparameter.The domain embeddings x Di are then weighted using the learned attention weights α Di into one representation vector: As Averaging simply merges all information into one vector, it cannot focus on valuable domain knowledge in specific embeddings.In contrast, the attention-based weighting allows for dynamic combinations of embeddings based on their importance depending on the current input token.
As shown in related works, these metaembeddings approaches suffered from critical mismatch issues when combining embeddings of different sizes and input granularities (e.g., character-and word-level embeddings) that could be addressed by learning additional mappings to the same dimensions on word-level to force all the input embeddings towards a common input space (Lange et al., 2021a).
Our proposed method prevents these issues by (a) keeping the input granularity fixed, which alleviates the need for learning additional mappings, and (b) locating all domain embeddings in the same space immediately after pre-training by freezing the subsequent transformer layers.We compare the results of two variants in Section § 4.More information on meta-embeddings can be found in the survey of Bollegala and O' Neill (2022).

Meta-Tokenization for Meta-Embeddings
To utilize our domain-adapted tokenizers in a single model with meta-embeddings, we have to align Domain Text: Acetaminophen is an analgesic drug => TOK-1: Ace #ta #mino #phen is an anal #gesic dr #ug (10 subwords) => TOK-2: Aceta #minophen is an anal #gesic drug (7 subwords) different output sequences generated by each tokenizer for the same input.This is not straightforward due to mismatches in subword token boundaries and sequence lengths.We thus introduce three different aggregation methods to perform the metatokenization: (a) SPACE: We split the input sequence on whitespaces into tokens and aggregate for each tokenizer all subword tokens corresponding to a particular token in the original sequence.
(b) DYNAMIC: The shortest sequence from all tokenizers is taken as a reference.Subwords from longer sequences are aggregated accordingly.This assumes that word-level knowledge is more useful than subword knowledge and that fewer word splitting is an indication of in-domain knowledge.
(c) TRUNCATION: This method is similar to the DYNAMIC aggregation, but it uses only the first subword for each token instead of computing the average when a token is split into more subwords.
Once the token and subword boundaries are determined, we retrieve the subword embeddings from the embedding layer corresponding to the tokenizer and perform the aggregation if necessary, in our case averaging all subword embeddings.Examples for each method are shown in Table 1.

Experimental Setup
This section introduces four downstream tasks with their respective datasets and evaluation metrics.We further provide details on our models, their hyperparameters, and the baseline systems.

Tasks and Evaluation Measures
We evaluate our domain-specialized models and baselines on four prominent downstream tasks: dialog state tracking (DST), response retrieval (RR), named entity recognition (NER), and natural language inference (NLI) with five domains per task.Table 2 shows the statistics of all datasets.
DST is cast as a multi-classification dialog task.Given a dialog history (sequence of utterances) and a predefined ontology, the goal is to predict the output state, i.e., (domain, slot, value) tuples (Wu et al., 2020) like (restaurant, pricerange, expensive).The standard joint goal accuracy is adopted as the evaluation measure: at each dialog turn, it compares the predicted dialog states against the annotated ground truth.The predicted state is considered accurate if and only if all the predicted slot values match exactly to the ground truth.
RR is a ranking task, relevant for retrieval-based task-oriented dialog systems (Henderson et al., 2019;Wu et al., 2020).Given the dialog context, the model ranks N dataset utterances, including the true response to the context (i.e., the candidate set covers one true response and N −1 false responses).Following Henderson et al. (2019), we report the recall at top rank given 99 randomly sampled false responses, denoted as R 100 @1.
NER is a sequence tagging task, aiming to detect named entities within a sentence by classifying each token into the entity type from a predefined set of categories (e.g., PERSON, ORGA-NIZATION) including a neutral type (O) for nonentities.Following prior work (Tjong Kim Sang and De Meulder, 2003;Nadeau and Sekine, 2007), we report the strict micro F 1 score.
NLI is a language understanding task testing the reasoning abilities of machine learning models beyond simple pattern recognition.The task is to determine if a hypothesis logically follows the relationship from a premise, inferred by ENTAILMENT (true), CONTRADICTION (false), or NEUTRAL (undefined).Following Williams et al. (2018), accuracy is reported as the evaluation measure.

Background Data for Specialization
We take unlabeled background datasets from the original or related text sources to specialize our models with domain-adaptive pre-training (details are available in Appendix C).For MLM training, we randomly sample up to 200K domain-specific sentences2 and dynamically mask 15% of the subword tokens following Liu et al. (2019).

Models and Baselines
We experiment with the most widely used PTLM: BERT (Devlin et al., 2019) for NER and NLI.For DST and RR as dialog tasks, we experiment with BERT and TOD-BERT (Wu et al., 2020) following Hung et al. (2022a) for comparing general-and task-specific PTLMs. 3We want to highlight that our proposed method can be easily applied to any existing PTLM.As baselines, we report the performance of the non-specialized variants and compare them against (a) full pre-training (Gururangan et al., 2020), (b) adapter-based models (Houlsby et al., 2019), and (c) our domain-specialized PTLM variants trained with TADA.

Hyperparameters and Optimization
During MLM training, we fix the maximum sequence length to 256 (DST, RR) and 128 (NER, NLI) subwords and do lowercasing.We train for 30 epochs in batches of 32 instances and search for the optimal learning rate among the following values: {5 • 10 −5 , 1 • 10 −5 , 1 • 10 −6 }.Early stopping is applied on the development set performance (patience: 3 epochs) and the cross-entropy loss is minimized using AdamW (Loshchilov and Hutter, 2019).For DST and RR, we follow the hyperparameter setup from Hung et al. (2022a).For NLI, we train for 3 epochs in batches of 32 instances.For NER, we train 10 epochs in batches of 8 instances.Both tasks use a fixed learning rate of 5 • 10 −5 .

Evaluation Results
For each downstream task, we first conduct experiments in a single-domain scenario, i.e., training and testing on data from the same domain, to show the advantages of our proposed approach of task-agnostic domain-adaptive embedding-based pre-training and tokenizers ( § 4.1).We further consider the combination of domain-specialized embeddings with meta-embeddings variants (Coates and Bollegala, 2018;Kiela et al., 2018) in a multidomain scenario, where we jointly train on data from all domains of the respective task ( § 4.2).

Single-Domain Evaluation
We report downstream performance for the singledomain scenario in Table 3, with each subtable being segmented into three parts: (1) at the top, we show baseline results (BERT, TOD-BERT) without any domain specialization; (2) in the middle, we show results of domain-specialized PTLMs via full domain-adaptive training and the adapter-based approach; (3) the bottom of the table contains results of our proposed approach specializing only the embeddings and the domain-specific tokenization.
In both DST and RR, TOD-BERT outperforms BERT due to its training for conversational knowledge.Our proposed embedding-based domainadaptation (MLM-EMB) yields similar performance gains as specialization with adapters for TOD-BERT on average.Inspired by previous work on domain-specialized subtokens for language model pre-training (Beltagy et al., 2019;Yang et al., 2020), we additionally train domain-specific tokenizers (MLM-EMBTOK) with the WordPiece algorithm (Schuster and Nakajima, 2012).The training corpora are either obtained from only background corpora (S) or from the combination of background and training set of each domain (X).Further, our domain-specialized tokenizers coupled with the embedding-based domain-adaptive pretraining exhibit similar average performance for DST and outperform the state-of-the-art adapters and all other methods for RR.Similar findings are observed for NLI and NER.MLM-EMB compared to MLM-FULL results in +0.7% performance gains in NLI and reaches similar average gains in NER.Especially for NLI, the domain-specialized tokenizers (MLM-EMBTOK) are beneficial in combination with our domain-specialized embeddings, while having considerably fewer trainable parameters.Given that TADA is substantially more efficient and parameter-free (i.e., without adding extra parameters), this promises more sustainable domain-adaptive pre-training.

Multi-Domain Evaluation
In practice, a single model must be able to handle multiple domains because the deployment of multiple models may not be feasible.To simulate a multi-domain setting, we utilize the domainspecialized embeddings from each domain ( § 4.1) and combine them with meta-embeddings ( § 2.3).
To train a single model for each task applicable to all domains, we concatenate the training sets of all domains for each task.As baselines for DST and RR, we report the performance of BERT and TOD-BERT and a version fine-tuned on the concatenated multi-domain training sets (MLM-FULL).We test the effect of multi-domain specialization in two variants: averaging (AVG) and attention-based (ATT) meta-embeddings.We conduct experiments to check whether including general-purpose embeddings from TOD-BERT (EMB+MLM-EMBs) is beneficial compared to the one without (MLM-EMBs).The results in Table 4 show that combining domain-specialized embeddings outperforms TOD-BERT in both tasks.In particular, averaging meta-embeddings performs better in RR while attention-based ones work better in DST by 3.8% and 2.2% compared to TOD-BERT, respectively.It is further suggested that combining only domain-specialized embeddings (i.e., without Table 4: Results of our multi-domain models leveraging meta-embeddings on four downstream tasks. adding general-purpose embeddings) works better for both meta-embeddings variants.These findings are confirmed by NLI and NER experiments.The meta-embeddings applied in our multi-domain scenarios outperform BERT by 0.7 points for NLI and 1.2 points for NER, respectively.An encouraging finding is that two domains (FINANCIAL, SCIENCE) with the smallest number of training resources benefit the most compared to the other domains in the NER task.Such few-shot settings are further investigated in § 5.1.
Overall, we find that the meta-embeddings provide a simple yet effective way to combine several domain-specialized embeddings, alleviating the need of deploying multiple models.

Analysis
To more precisely analyze the advantages of our proposed embedding-based domain-adaptive pretraining methods and tokenizers, we study the following: few-shot transfer capability ( § 5.1), the effect of domain-specialized tokenizers on subword tokens ( § 5.2), and the combinations of multiple domain-specialized tokenizers with metatokenizers in multi-domain scenarios ( § 5.3).

Few-Shot Learning
We report few-shot experiments in Table 5 using 1% and 20% of the training data for NLI.We run three experiments with different random seeds to reduce variance and report the mean and standard deviation for these limited data scenarios.MLM-EMB on average outperforms MLM-FULL by 1% in the single-domain scenario, especially for SLATE and TRAVEL domains with the largest improvements (i.e., 3.3% and 2.7%, re-spectively).In contrast, the adapter-based models (MLM-ADAPT) perform worse in this few-shot setting.This demonstrates the negative interference (-10%) caused by the additional parameters that cannot be properly trained given the scarcity of task data for fine-tuning.In multi-domain settings, attention-based meta-embeddings on average surpass the standard BERT model in both few-shot setups.Overall, these findings demonstrate the strength of our proposed embedding-based domainadaptive pre-training in limited data scenarios.

Domain-Specific Tokenizers
To study whether domain-specialized tokenizers better represent the target domain, we select the development sets and count the number of words that are split into multiple tokens for each tokenizer.
The assumption is that the domain-specialized tokenizers allow for word-level segmentation, and thus, word-level embeddings, instead of fallbacks to lower-quality embeddings from multiple subword tokens.
We compare three different tokenizers for each setting: (a) TOK-O: original tokenizer from PTLMs without domain specialization; (b) TOK-S: domainspecialized tokenizer trained on the in-domain background corpus; (c) TOK-X: domain-specialized tokenizer trained on the concatenated in-domain background corpus plus the training set.
Table 6 shows the results on all four tasks averaged across domains.It is evident that TOK-X compared to TOK-O in general significantly reduces the number of tokens split into multiple subwords RR;.This indicates that the domain-specialized tokenizers cover more tokens on the word-level,  and thus, convey more domain-specific information.
For domains with smaller background datasets, e.g., FINANCIAL and NEWS, the tokenizers are not able to leverage more word-level information.For example, TOK-S that was trained on the background data performs worse in these domains, as the background data is too small and the models overfit on background data coming from a similar, but not equal source.Including the training corpora helps to avoid overfitting and/or shift the tokenizers towards the dataset word distribution, as TOK-X improves for both domains over TOK-S.The finding is well-aligned with the results in Table 3 (see § 4.1) and supports our hypothesis that word-level tokenization is beneficial.

Study on Meta-Tokenizers
In Section § 4.2, we experiment with multiple domain-specialized embeddings inside metaembeddings.These embeddings are, however, based on the original tokenizers and not on the domain-specialized ones.While the latter are considered to contain more domain knowledge and achieve better downstream single-domain perfor- mance ( § 4.1), it is not straightforward to combine tokenized output by different tokenizers for the same input due to mismatches in subword boundaries and sequence lengths.Therefore, we further conduct experiments with meta-tokenizers in the meta-embeddings setup following § 2.4.We compare the best multi-domain models with our proposed aggregation approaches.The averaged results across domains are shown in Table 7 (per-domain results are available in Appendix D).Overall, it is observed that the SPACE and DYNAMIC approaches work better than TRUN-CATION.However, there is still a performance gap between using multiple embeddings sharing the same sequence from the original tokenizer compared to the domain-specialized tokenizers.Nonetheless, this study shows the general applicability of meta-tokenizers in transformers and suggests future work toward leveraging the domainspecialized tokenizers in meta-embeddings.
6 Related Work Domain Adaptation.Domain adaptation is a type of transfer learning that aims to enable the trained model to be generalized into a specific domain of interest (Farahani et al., 2021).Recent studies have focused on neural unsupervised or self-supervised domain adaptation leveraging PTLMs (Ramponi and Plank, 2020), which do not rely on large-scale labeled target domain data to acquire domain-specific knowledge.Gururangan et al. (2020) proposed domain-adaptive intermediate pre-training, continually training PTLM on MLM with domain-relevant unlabeled data, leading to improvements in downstream tasks in both highand low-resource setups.The proposed approach has been applied to multiple tasks (Glavaš et al., 2020;Lewis et al., 2020) across languages (Hung et al., 2023;Wang et al., 2023), however, requires fully pre-training (i.e., update all PTLM parameters) during domain adaptation, which can potentially result in catastrophic forgetting and negative interference (Houlsby et al., 2019;He et al., 2021).

Parameter-Efficient
Training.Parameterefficient methods for domain adaptation alleviate these problems.They have shown robust performance in low-resource and few-shot scenarios (Fu et al., 2022), where only a small portion of parameters are trained while the majority of parameters are frozen and shared across tasks.These lightweight alternatives are shown to be more stable than their corresponding fully fine-tuned counterparts and perform on par with or better than expensive fully pre-training setups, including adapters, prompt-based fine-tuning, and sparse subnetworks.Adapters (Rebuffi et al., 2017;Houlsby et al., 2019) are additional trainable neural modules injected into each layer of the otherwise frozen PTLM, including their variants (Pfeiffer et al., 2021), have been adopted in both single-domain (Bapna and Firat, 2019) and multi-domain (Hung et al., 2022a) scenarios.Sparse subnetworks (Hu et al., 2022;Ansell et al., 2022) reduce the number of training parameters by keeping only the most important ones, resulting in a more compact model that requires fewer parameters for fine-tuning.Prompt-based fine-tuning (Li and Liang, 2021; Lester et al., 2021;Goswami et al., 2023) reduces the need for extensive fine-tuning with fewer training examples by adding prompts or cues to the input data.These approaches, however, are still parameter-and time-inefficient, as they add additional parameters, require complex training steps, are less intuitive to the expressiveness, or are limited to the multi-domain scenario for domain adaptation.A broader overview and discussion of recent domain adaptation methods in low-resource scenarios is given in the survey of Hedderich et al. (2021).

Conclusions
In this paper, we introduced TADA -a novel task-agnostic domain adaptation method which is modular and parameter-efficient for pre-trained transformer-based language models.We demonstrated the efficacy of TADA in 4 downstream tasks across 14 domains in both single-and multi-domain settings, as well as high-and low-resource scenarios.An in-depth analysis revealed the advantages of TADA in few-shot transfer and highlighted how our domain-specialized tokenizers take the domain vocabularies into account.We conducted the first study on meta-tokenizers and showed their potential in combination with meta-embeddings in multi-domain applications.Our work points to multiple future directions, including advanced meta-tokenization methods and the applicability of TADA beyond the studied tasks in this paper.

A Computational Information
All the experiments are performed on Nvidia Tesla V100 GPUs with 32GB VRAM and run on a carbonneutral GPU cluster.The number of parameters and the total computational budget for domain-adaptive pre-training (in GPU hours) are shown in Table 8.Table 8: Overview of the computational information for the domain-adaptive pre-training.‡BERT variants: BERT (NLI, NER) and TOD-BERT (DST, RR).

B Hyperparameters
Detailed explanations of our hyperparameters are provided in the main paper in Section § 3.4.In our conducted experiments, we only search for the learning rate in domain-adaptive pre-training.The best learning rate depends on the selected domains and methods for each task.

C In-domain Unlabeled Text Corpora
We provide more detailed information on the background datasets that are used for domain-adaptive pre-training in Table 9.

Fiction
The books corpus (Zhu et al., 2015), used as the pre-training data of BERT (Devlin et al., 2019).299.5 K

News
The Reuters news corpus in NLTK (nltk.corpus.reuters).Similar to the training data of 51.0 K CoNLL (Tjong Kim Sang and De Meulder, 2003).

Clinical
Pubmed abstracts from clinical publications filtered following Lange et al. (2022).299.9 K Financial The financial phrase bank from Malo et al. (2014).
Table 9: Overview of the background datasets and their sizes as reported in Table 2 in the background column.The background datasets are used to train domain-specific tokenizers and domain-adapted embeddings layer.

Figure 1 :
Figure 1: Overview of the TADA framework consisting of two steps.Part A: Domain specialization is performed via embedding-based domain-adaptive intermediate pre-training with Masked Language Modeling (MLM) objective on in-domain data.Part B: The domain-specialized embeddings are then fine-tuned for downstream tasks in single-or multi-domain scenarios with two meta-embeddings methods: average (AVG) and attention-based (ATT).

Table 1 :
Examples of our proposed aggregation approaches for meta-tokenization: SPACE, DYNAMIC, TRUNCA-TION for a given text and two different tokenizers (TOK-1, TOK-2).The bottom of the table shows the results after aggregation.[a b . . .z] denotes the average of all embedding vectors corresponding to subword tokens a, b, . .., z.
Table2: Overview of the selected datasets for 4 tasks (DST, RR, NLI, NER) on 14 domains.For each domain, we report the number of collected in-domain texts for domain-adaptive pre-training, as well as the size and license of the downstream dataset.All selected datasets are applicable for commercial usage.†License: Open American National Corpus (OANC), Direct Universal Access (DUA), Creative Commons Attribution Share-Alike (CC-BY-SA), Creative Commons Attribution International License (CC-BY).

Table 3 :
By further domain-adaptive pre-training with full MLM training (MLM-FULL), TOD-BERT's Results of our single-domain models with domain-specialized embeddings and tokenizers on four tasks.
performance decreases (i.e., -4% for RR and -0.8% for DST compared to TOD-BERT).It is argued that full MLM domain specialization has negative interference: while TOD-BERT is being trained on domain data during intermediate pre-training, the model is forgetting the conversational knowledge obtained during the initial dialogic pre-training stage(Wu et al., 2020).The hypothesis is further supported by the observations for the adapter-based method which gains slight performance increases.

Table 6 :
The number of words that have to be split into multiple tokens (>= subwords) for different tokenizers.

Table 7 :
Results of meta-tokenizers in multi-domain experiments with meta-embeddings.Here bold indicates the best performance and underline indicates the best-performing meta-tokenization aggregation method.‡BERT variants: TOD-BERT (DST, RR) and BERT (NLI, NER).
you need.In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998-6008.Steven C.H. Hoi, Richard Socher, and Caiming Xiong.2020.TOD-BERT: Pre-trained natural language understanding for task-oriented dialogue.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 917-929, Online.Association for Computational Linguistics.Wenpeng Yin and Hinrich Schütze.2016.Learning word meta-embeddings.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1351-1360, Berlin, Germany.Association for Computational Linguistics.Xiaojin Zhu and Andrew B Goldberg.2009.Introduction to semi-supervised learning.Synthesis lectures on artificial intelligence and machine learning, 3(1):1-130.Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.2015.Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.In Proceedings of the IEEE international conference on computer vision, pages 19-27.