Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation

A well-known limitation in pretrain-finetune paradigm lies in its inflexibility caused by the one-size-fits-all vocabulary.This potentially weakens the effect when applying pretrained models into natural language generation (NLG) tasks, especially for the subword distributions between upstream and downstream tasks with significant discrepancy. Towards approaching this problem, we extend the vanilla pretrain-finetune pipeline with an extra embedding transfer step. Specifically, a plug-and-play embedding generator is introduced to produce the representation of any input token, according to pre-trained embeddings of its morphologically similar ones.Thus, embeddings of mismatch tokens in downstream tasks can also be efficiently initialized.We conduct experiments on a variety of NLG tasks under the pretrain-finetune fashion. Experimental results and extensive analyses show that the proposed strategy offers us opportunities to feel free to transfer the vocabulary, leading to more efficient and better performed downstream NLG models.


Introduction
Pretrain-finetune paradigm has been highly successful on tackling challenging problems in natural language processing, e.g., domain adaptation (Sato et al., 2020;Yao et al., 2020), incremental learning (Khayrallah et al., 2018;Wan et al., 2020), as well as knowledge transferring (Liu et al., 2020b). The rise of large-scale pre-trained language models further attracts increasing attention towards this strategy (Devlin et al., 2019;Edunov et al., 2019). Typically, these methods first pretrain a universal 1 We release the code at https://github.com/ DeepLearnXMU/embedding-transfer * Jinsong Su is the corresponding author. This work was done when Xin Liu was interning at DAMO Academy, Alibaba Group.

M-BERT
Ce no zo ic pala eo hy dro dyn ami c Out-of-Domain Cen ozo ic pal a e o hydro dynamic Thesis Cenozoic palaeohydrodynamic Table 1: Segmentation of English sequence "Cenozoic palaeohydrodynamic" learned from different data distribution as described in § 4. High frequent words in thesis domain are split into fine-grained and underrepresented tokens in pre-trained models.
model using a large-scale corpus, which is then finetuned to various downstream tasks via a few adjustments. Due to its simplicity yet impressive performance, pretrain-finetune paradigm becomes the undoubtedly dominant solution for building state-of-the-art models in many natural language understanding tasks (Xu et al., 2019;Yang et al., 2019a;Liu et al., 2020b). In comparison, this strategy often achieves disappointing or barely satisfactory performance in natural language generation (NLG) tasks. For example, several studies observe that M-BERT (Devlin et al., 2019) fails to enhance the decoder of a translation model (Edunov et al., 2019;Zhu et al., 2020), while Rothe et al. (2020) reach the same conclusion even when adapting an autoregressive model GPT (Radford et al., 2019). A natural problem arises: What is the crucial bottleneck in current pretrain-finetune framework and how to break it?
In this paper, we provide the first answer from the subword discrepancy aspect, namely, the subword vocabulary extracted according to the pretraining data distribution is insufficient to cope with the downstream NLG tasks. Such inflexibility stems from the fact that downstream NLG models have to inherit the vocabulary from their pre-trained counterparts. In order to deal with the open-vocabulary problem, it is de-facto standard for pre-trained models to employ heuristic subword segmentation methods (Sennrich et al., 2016;Kudo and Richardson, 2018). However, the segmentation learns on the upstream corpus other than the finetuned data and is likely to be sub-optimal (Cherry et al., 2018;Provilkov et al., 2020).
We argue that these lead to subword discrepancy and bring two defects. Firstly, the pre-trained model usually learns a fine-grained subword segmentation to maintain the coverage of a large amount of diverse vocabulary. Consequently, downstream NLG models may suffer from more serious exposure bias  and expensive computational cost caused by the increased sequence lengths. As one example, M-BERT exploits 100 thousand fine-grained subwords to encode hundreds of languages, while most of downstream NLG tasks, in fact, require only one language and its associate tokens. Secondly, words that are rare in upstream task but frequent in downstream task may be segmented end up poorly understood (Provilkov et al., 2020). Considering the English sequence "Cenozoic palaeohydrodynamic" shown in Table 1, all the words are frequent in a thesis domain translation task and can be well preserved in its vocabulary. Nevertheless, they are segmented into under-represented tokens by pre-trained models, preventing the finetuning stage from better learning their compositionality for generation. An alternative solution is reconstructing the pre-trained model by exploiting either a taskspecific vocabulary (Nguyen and Chiang, 2017;Kocmi and Bojar, 2018) or a subword regularization approach (Provilkov et al., 2020). However, retraining the upstream model from scratch for each task is time-consuming and unavailable for largescale models like M-BERT, GPT, etc. To this end, we propose a simple yet generalized pretrain-finetune strategy, where an embedding transfer stage is inserted between pre-training and finetuning to eliminate their token granularity gaps. Unlike the prior strategy using a fixed vocabulary, our vocabulary is changeable and its items including mismatched ones can be easily initialized by the pre-trained embeddings. Concretely, we equip the pre-trained model with a plug-and-play embedding generator, which is able to produce the embedding of any token by feeding its subwords and hyperwords that appeared in pre-trained vocabulary. To train this generator, we randomly split or merge some tokens to replace their original embeddings with those produced by the generator. The parameters of the generator are optimized under the vanilla pre-training framework to minimize the divergence before and after replacing the embeddings. Accordingly, we can use a task-specific vocabulary for the downstream task, where common tokens are immediately initialized with pre-trained embeddings while mismatched ones are initialized by our generator.
We conduct experiments on various tasks under NLG context, in a range from domain adaptation to knowledge transferring, and from machine translation to answer-aware question generation. Empirical results demonstrate the universal-effectiveness of the proposed strategy comparing with strong baselines and related approaches. Quantitative and qualitative analyses verify that tackling subword discrepancy can exactly alleviate the problem of exposure bias, large computational cost, and the under-represented tokens in vanilla pretrainfinetune paradigm. To summarize, the contributions of our work are as follows: • Through in-depth analyses, we point out and formally analyze subword discrepancy, affecting the conventional pretrain-finetune strategy in NLG tasks. • We propose a simple, flexible, and generalized pretrain-finetune training strategy, where an embedding generator is introduced to leverage the knowledge of the pre-trained model to initialize embeddings of any required tokens. • Extensive experiments show that our strategy is able to efficiently decrease the vocabulary gaps in pretrain-finetune paradigm and significantly boost the performance of NLG models.

Related Work
Recent studies observe that pre-trained models suffer a bottleneck when they are applied to NLG tasks (Edunov et al., 2019;Zhu et al., 2020;Rothe et al., 2020). This problem has been attributed to many reasons. For example, Yang et al. (2019b) point out pretrain-finetune discrepancy caused by the absent masked frames in real data when adopting pretrained masked language models. Chronopoulou et al. (2019) investigate catastrophic forgetting in finetuning stage. It can be said that how to successfully employ pretrain-finetune to enhance NLG models remains a great challenge. We explore this problem from another direction, i.e., the unsuitable subword segmentation for downstream tasks.
Task-Specific Vocabulary A natural manner to address this issue is to adopt a task-specific vocabulary. Lewis et al. (2020) first replace the embedding layer with an independent encoder, of which vocabulary and parameters are learned from the downstream corpus. Along this line, Sato et al. (2020) exploit external monolingual data to construct a new embedding layer and achieve improvements in domain adaptation. This series of studies empirically confirm the necessity of the suitable vocabulary for the finetuning stage. However, these methods have to learn the task-specific embeddings separately before each adaptation, which brings in additional computational cost thus limiting their applicability. Besides, they completely discard the pre-trained embeddings, which have been proved to be useful by Aji et al. (2020). Extra encoder or embedding layer may fail to be well optimized with insufficient downstream resources. Accordingly, Rothe et al. (2020) employ a task-specific vocabulary to retrain M-BERT, which is then used to initialize neural machine translation (NMT) model. Considering more robust approaches, Kudo (2018) and Provilkov et al. (2020) randomly sample segmentations for each sentence at the training time. Unlike the above methods, our goal is to build a plug-andplay component, that involves neither retraining the pre-trained model nor learning task-specific embeddings separately.
Embedding Generator Our work is also related to studies with respect to generating embeddings for out-of-vocabulary (OOV) words. In this context, researchers use embeddings of characters or subwords to predict those of unseen words (Pinter et al., 2017;Sasaki et al., 2019;Fukuda et al., 2020). For example,  train an embedding generator through reconstructing the original representation of each word from its bag of subwords. Sasaki et al. (2019) progressively improve the generator using attention mechanism. Fukuda et al. (2020) further leverage similar words to enhance this procedure. Our work significantly differs from the above studies in two aspects. Due to the vocabulary is fixed once predefined, the embedding reconstruction can be merely drawn on a few of selected words. By contrast, our generator is able to produce embeddings of any tokens, since these embeddings are directly embedded into the pre-trained model with an objective in terms of minimizing the divergence. Moreover, previous studies mainly focus on handling the problem of OOV, while our work, to our best of knowledge, is the first study that exploits embedding generator to transfer granularity over subwords for pretrain-finetune paradigm.

Methodology
In this section, we introduce our proposed pretrainfinetune strategy in detail.

Main Steps in Our Strategy
Inner layer

Downstream Model
Inner layer

Downstream Model
Inner layer

Upstream Corpus Downstream Corpus
Apply Feed Initialize Train Figure 1: Illustration of our pretrain-finetune pipeline. We pretrain an embedding generator for the initialization of embeddings of unseen tokens. Thus, each downstream model can adopt its suitable vocabulary instead of the unchangeable one. E(·) and G(·) indicate the pretrained and generated embedding, respectively.
As shown in Figure 1, we extend the prior pretrain-finetune paradigm with an embedding transfer stage. Specifically, we revise the conventional pretrain-finetune pipeline as follows: Pretrain. As usual, we first construct a pre-trained model using an existing large-scale corpus. In addition, we further pretrain an embedding generator regardless of downstream tasks. It's expected to produce the embedding of any required token, by feeding pre-trained embeddings of its subwords and hyperwords. Hence, it can be employed into any downstream tasks for embedding transferring. Finetune. We differently initialize the word embeddings and the other parameters (inner layer) for the downstream model, respectively. For the former, we use the downstream-task training corpus to learn a task-specific subword segmentation and corresponding vocabulary. For an unseen token, we apply the generator to produce its initial representation. Otherwise, we directly initialize it with the corresponding pre-trained embeddings. Considering the latter, we directly adapt inner-layer parameters of the pre-trained model to the downstream model. Finally, we continue to train the downstream model using the finetuning data following the common fashion.
As seen, our strategy is lightweight and also able to avoid the issue of subword discrepancy, since it does not require retraining for the pre-trained model and can be quickly applied to various downstream NLG models.

Constructing the Embedding Generator
To make the word embedding generator applicable to all downstream NLG models, we design the generator so that it can generate the embedding of any input token according to those of its morphologically similar tokens from the learned pre-training vocabulary. The basic intuition behind our design stems from this fact: if the input token is a complete word, like motorcycle, its semantic meaning is related to those of its subwords, motor and ##cycle. On the contrary, if the input token is a subword, such as ##er, the words that contain the input token, which we call them hyperwords, e.g., worker, writer and singer, can be exploited to learn its semantic meaning.
Concretely, given a mismatch token w, we borrow the segmentation principle from pre-trained model to split w into subwords based on the pretraining vocabulary, and traverse the pre-training vocabulary to select all longer tokens containing w. Then, we combine the generated subwords and the selected hyperwords to form the morphologically similar token set of w, denoted by S m (w). Afterwards, we explore three kinds of generators to produce the embedding G(w) of w: AVG-EG: Averaging-Based Embedding Generator Intuitively, we can simply define G(w) as the average embedding of the words from S m (w): where E(w ) is the pre-trained embedding of the token w . In this way, our generator can be directly used, without increasing the cost of training time.
ATT-EG: Attention-Based Embedding Generator Another natural solution is to softly fuse information from different morphologically similar words using an attention mechanism (Bahdanau et al., 2015). The G(w) is formally expressed as: where W ∈ R 1×d indicates a learnable vector, d denotes the dimensionality of word embedding. Compared with the first generator, this generator can be jointly trained with the pre-trained model, therefore it is capable of better quantifying the effects of morphologically similar words in S m (w).

PATT-EG: Position-Aware Attention-Based
Embedding Generator From the linguistic perspective, different locations of morphemes in a word reflect distinct semantic meaning. Consequently, we refine the above attention-based generator by considering six kinds of morphology relationships between w and w ∈ S m (w): if w is a subword of w, w can be the prefix/infix/suffix subword of w. In turn, if w is a hyperword of w, w can be the prefix/infix/suffix subword of w . Formally, G(w) is produced in the following way: where W r ∈ R 6×d is a learnable parameter matrix, and I ∈ R 1×6 is the one-hot vector indicating the relationship between w and w . Note that, all the trainable generators are designed to lightweight architectures with a few of parameters. We believe this can achieve a more generalizable model and speed up their convergence. We will compare and investigate these generators in the subsequent experiment section.

Training the Embedding Generator
One principle of our strategy is plug-and-play, which can be directly applied to initialize any unseen tokens in all downstream NLG tasks, avoiding the time cost of retraining the model. To this end, we borrow the pre-trained model and its associated corpus to train our generator before finetuning.
In the specific implementation, we first preprocess the sentences of pre-training corpus, where two kinds of preprocessing operations are applied Figure 2: Illustration of the knowledge distillation procedure. Our strategy first performs a segmentation (differs from the pre-trained one) on the original sentence to create unseen tokens, of which embeddings can be produced by our embedding generator. We fix the inner layers of the pre-trained model and force our model to narrow the distance between its output layer and the conventional one.
to simulate unseen tokens: 1) randomly selecting some consecutive subwords and combining them into an unseen token; and 2) randomly choosing a token and splitting it into several consecutive unseen tokens. Figure 2 provides an example of sentence preprocessing, where the word nothing is randomly split into two unseen subwords noth and ##ing, while the subwords ima and ##gine are concatenated into an unseen token imagine.
Through this data preprocessing, we can obtain large amounts of samples with unseen tokens involving various granularities, which facilitates the robustness of our generator. Then, we embed our generator into the pretrained model to encode unseen words, and fix parameters of the pre-trained model to train the generator according to the following objectives: Reusing Pre-training Loss The generated embeddings should share the same latent space with the existing embeddings, in the meanwhile, representing appropriate semantic meaning. Accordingly, we serve to minimize the vanilla loss of pretrained model as the basic training objective of our generator. The loss function can be diverse according to the upstream tasks, which is denoted as L p (s ) with s being the preprocessed training sentence.
Knowledge Distillation We further exploit knowledge distillation (Hinton et al., 2015) to narrow the divergence between hidden states in the pre-trained model before and after applying the generated embeddings. Given a training example s, the vanilla pre-trained model and our generator preprocess it to s p and s , respectively. As shown in Figure 2, we transfer the knowledge of the output layer in terms of s p to that of s . Euclidean Distance is adopted to measure the divergence between representations output by vanilla pretrained model h p (w) and that of our model h (w) with respect to the same word w. Since each word may be split into different sequences of tokens, we regard the average hidden states of the corresponding token sequence as its representation. Thus, the loss function can be defined as: Finally, we assign a hyper-parameter λ to quantify the effect of L(·) and L d (·), which is empirically set to 0.5 as default: L(s p , s ) = L p (s ) + λL d (s p , s ). (5)

Experiments
In this section, we examine the effectiveness of the proposed strategy in a variety of NLG tasks. We first run a set of experiments to compare the variants of our approach and the related methods on domain adaptation translation tasks. Then, we assess the superiority of our approach on transferring the knowledge from M-BERT (Devlin et al., 2019) and M-BART (Liu et al., 2020c) to two downstream NLG tasks: machine translation (MT) and answer-aware question generation (QG).

Domain Adaptation
We conduct experiments on English-to-Chinese (En⇒Zh) domain adaptation translation tasks, where the pretrain-finetune paradigm resort as standard. The pre-training corpus is extracted from an out-of-domain dataset LDC † , in which 1.25M (M = million), 3K (K = thousand), 3K sentences pairs are randomly sampled as training, development and test set, respectively. We verify the effectiveness of our strategy on two downstream domains: Thesis and Laws, of which data are collected from UM-Corpus (Tian et al., 2014). We follow the same settings as Zeng et al. (2018) and  to preprocess two corpus and train models. The translation quality is evaluated by cased BLEU (Papineni et al., 2002), which is caculated by mteval-v13a.pl.
Implementation Details All the compared methods are re-implemented on top of FairSeq ‡ and built on Transformer (Vaswani et al., 2017). We apply Adam Optimizer (Kingma and Ba, 2015) with β 1 and β 2 being 0.9 and 0.999, respectively. The dropout ratio is set to 0.3 and each iteration batch consists of 25K tokens. For both pre-training and finetuning, we employ warm-up strategy where the linear warm-up phase takes 4K steps, reaching its maximum learning rate to 5 × 10 −4 . The training of each model is early-stopped to maximize BLEU score on the development set. Other hyperparameters are set following Base setting in Vaswani et al. (2017). We investigate the following methods: §  assign the domain-specific vocabulary for each downstream model, in which embeddings of the seen tokens are reused, while the mismatched ones are: 1) randomly initialized (Random Init, Aji et al., 2020); 2) learned by Word2Vec (Mikolov et al., 2013) using in-domain data; and 3) produced by a generator trained via reconstructing embeddings using Bag-of-Subwords (Embedding Recon, . • New Embedding Layer: These methods assigned the domain-specific vocabulary for each downstream model, but completely discard the embeddings of upstream models. The new embeddings are produced from: 1) randomly initialized Independent Encoder (Lewis et al., 2020); and 2) CBOW model trained under the downstream corpus (Sato et al., 2020). • Our Strategy: Our embedding generators are trained using the setting of pre-trained model with one epoch, as described in § 3. Note that, to eliminate the influence of control variables, all the vocabulary transfers in above models are conducted on the decoder-side only.  Table 3: Evaluation of our model on knowledge transferring tasks. "w/" denotes "with". Random Init uses the same architecture as " w/ M-BERT" while being initialized randomly. "# Param." denotes the trainable parameter size of each model. "Speed" indicates the inference speed measured in sentences per second. ¶ diately finetuning a downstream model with outof-domain vocabulary performs worse than merely training each model using in-domain data and taskspecific vocabulary. This is consistent with findings in Edunov et al. (2019) and Zhu et al. (2020). We observe that there are over 13K and 11K tokens in the vocabulary in terms of Out-of-Domain are mismatched with that of Thesis and Laws respectively, indicating that subword discrepancy indeed harms the performance of downstream NLG models. When adapting task-specific vocabulary to retrain upstream models, all the translation qualities are improved, confirming the necessity of bridging subword gaps between upstream and downstream models. In addition, we also appraise several existing embedding transfer strategies into pretrainfinetune pipeline. Interestingly, randomly initializing embeddings of unseen tokens yields even slightly better results than utilizing "Word2Vec" and "Embedding Recon". We attribute this to the fact that the training of the latter two generators is individual regardless of the pre-trained model, resulting in unshared latent space between the generated and pre-trained embeddings.

Results
Our models surpass all baselines and related methods on translation qualities. Most importantly, in contrast to existing approaches that have to either retrain the pre-trained model from scratch or learn a separate embedding generator for each domain, our strategy can be immediately adopted to any downstream tasks once ready. Specifically, PATT-EG achieves the best performance, confirming our hypothesis that softly summarizing information from morphologically similar tokens and considering positions of morphemes facilitate the embedding transferring. Besides, using knowledge distillation to narrow the divergence before and after applying our generator can progressively improve the performance. Accordingly, we use PATT-EG + Knowledge Distillation as the default setting in subsequent experiments.

Knowledge Transferring
We test our method on transferring the knowledge from two advanced large-scale language models: non-autoregressive M-BERT and autoregressive M-BART. For computational efficiency, we randomly extract 4M samples from the conventional pre-training corpus || to train our embedding generator using the configurations of pre-trained models with one epoch and 4,096 batch size. Comparisons are conducted on machine translation and question generation task. The pre-trained model is employed on both of encoder and decoder. Same as configurations in domain adaptation, we merely perform the embedding transferring in decoder. Since the two language models exploit different segmentation tools, i.e., WordPiece (Wu et al., 2016) and SentencePiece (Kudo, 2018), we set 32K and 10K as the number of word and sentence pieces for downstream tasks, respectively.
Machine Translation Considering machine translation, we examine our method on the widely used English-to-German (En⇒De) benchmarks: WMT14. We follow Rothe et al. (2020) and Liu et al. (2020c) to deal this task.
Question Generation We use the SQuAD v1.1 (Rajpurkar et al., 2016) dataset for question generation. We follow the common setting to preprocess dataset and train our models (Liu et al., 2020a). The answer and the passage are taken as the model input, while the question is the target output. ROUGE-L (Lin and Hovy, 2003), BLEU, and METEOR (Banerjee and Lavie, 2005) are treated as the assessment metrics.
Results As illustrated in Table 3

Analysis
To better understand subword discrepancy and our method, we make in-depth analyses on WMT En⇒De task to investigate three problems: Q1: How subword granularity affects NLG models? ( § 5.

Impact of Embedding Transfer
We further investigate how the embedding transfer impacts the initialization of downstream models. We draw Figure 4 to plot the BLEU scores of downstream models using the embedding generators trained with different steps. The X-axis indicates the training steps of the generator. Both "+Ours" and "w/ M-BERT" are fully finetuned, but the latter doesn't employ our embedding generator, resulting in an unchanged line. It is encouraging to see that the BLEU scores of downstream model converges very fast, indicating that our generator can be used with only a few of training steps. We argue that the commonalities in word compositionality lead to the fast transfer learning on generating different embeddings, and the simple architecture of our generator further speeds up such procedure.

Computational Costs
As shown in Figure 4, our generator converges very fast (around 20K steps). The training process of our generator takes about 2 hours under our experimental setting. As a reference, the vanilla WMT finetuning process takes approximately 40 hours. In addition, our generator only takes about 3 minutes for producing 13K embeddings in Thesis, which is also insignificant compare to the finetuning time.
Most importantly, once the embedding generator is well-trained, it's available for any downstream tasks. Thus, we argue that the computational costs are not the obstacle to the extensibility of our approach. Table 4 gives an example to show the effectiveness of our model on handling under-represented tokens. The German word dankbar (gratifying) is over segmented by M-BERT, and fail to be generated by the model trained under conventional pipeline. On the contrary, our approach offers an opportunity for the downstream model to preserve the word into vocabulary, thus better learning its semantic meaning and correctly predicting it during inference.

Conclusion
In this paper, we point out that the one-size-fits-all subword vocabulary, despite its all-encompassing superiority, is not the preferred solution for the popular pretrain-finetune paradigm. It causes the subword discrepancy among upstream and downstream models, which is given concrete form to the unsuitable granularity and under-represented words. Consequently, we propose a novel embedding transfer strategy with a plug-and-play embedding generator. Empirical results suggest that: 1) our approach is universally effective on overcoming subword discrepancy; 2) embedding transfer can bring benefits to computational efficiency; and 3) embedding generator can be achieved via either directly averaging the input embeddings or applying trainable components, the latter performs better but depends on few of training. As our approach is transparent to model architectures and tasks, we believe it can be widely applied and further raise the flexibility and applicability of pre-trained models.
In the future, we plan to investigate its effectiveness on other generation tasks, such as code generation (Jiang et al., 2021;, summarization (Shi et al., 2021) and so on.