Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model

This work empirically explores effective exploiting of intermediate output from pre-trained language models (PrLMs) for language generation tasks. For this purpose, we propose an improved method to integrate public check-points of PrLMs for the most convenience and perform extensive experiments on 6 different kinds of PrLMs, including BERT, ELEC-TRA, GPT2, Multi-lingual BERT, and XLM RoBERTa. Evaluation with automatic metrics shows that our approach signiﬁcantly improves the generation quality on the generation tasks, up to 1.8 BLEU points for neural machine translation (Korean-to-English, Korean-to-Chinese) and 1.8 ROUGE points improvements for text summarization.


Introduction
Pre-trained Language Models (PrLMs), such as BERT , RoBERTa , and ELECTRA , have thoroughly changed the landscape of state-of-theart performance on many Natural Language Understanding (NLU) tasks. Also, publicly released checkpoints of the PrLMs allow natural language processing (NLP) researchers to gain SOTA results while saving vast compute and time resources. The widely used method to exploit PrLM is fine-tuning. However, for Natural Language Generation (NLG) tasks, such methods do not get as much performance gain as in the NLU task. Several previous studies proposed methods that better use prior knowledge of the PrLM for NLG tasks (Yang et al., 2020;Zhu et al., 2020;Chen et al., 2020). Expending the previous studies, in this paper, we propose an improved method that exploits the checkpoint of the PrLM into the Transformer models (Vaswani et al., 2017).
The existing methods for leveraging PrLM in NLG tasks can be roughly classified into two categories: Reusing the PrLM as a starting point and Integrating the intermediate output of the PrLM. The former, the widely used in various NLP tasks, denotes to initialize the part of Transformer from the PrLM for generation tasks (Clinchant et al., 2019;Rothe et al., 2020) or replace the input embedding with the PrLM. The latter is an approach that first extracts the contextualized representation from a LM for an input sentence and fuses it into a neural model (Yang et al., 2020;Zhu et al., 2020;Chen et al., 2020). As our preliminary experiment shows, we expand this approach and explore in many ways towards better performance. In both of the preceding approaches, whether to freeze or fine-tun the parameter of PrLM is also an important issue. For the former (Reusing the PrLM), several works demonstrated that freezing the PrLM at training time led to a significant performance drop. Meanwhile, for the latter approach (Integrating the PrLM), prior studies adopted the whole or half-freezing instead of fine-tuning the parameters of the PrLMs. Yang et al. (2020) suggested that the reason why fine-tuning PrLM in neural machine translation (NMT) does not work as well as in other NLP tasks is due to the availability of large training data and the high capacity of baseline NMT models (i.e., Transformer), where excessive fine-tuning leads to the catastrophic forgetting phenomenon (Goodfellow et al., 2015). Also, Zhu et al. (2020) shows that freezing the BERT in NMT is better than fine-tuning with a large gap. This is in line with our experimental results. Thus in this empirical study, we freeze the parameters of the PrLMs in our experiments. This paper focuses on finding an effective

Model
We propose a modified Transformer-encoder that effectively integrates publicly available checkpoints of PrLMs. Figure 1 shows the architecture of our proposed model. We add an extra 1 https://github.com/tmtmaj/Exploiting-PrLM-for-NLGtasks flow for PrLM through additional PrLM-dedicated modules including PrLMAttn, Add&Norm, and FFN. Specific mathematical formulations are left at Appendix B. Given an input sequence x s = {x 1 , ..., x N }, there is a PrLM-input sequence x p = {x p 1 , ...x p M } of length M splited by the PrLMdedicated tokenizer. The PrLM-input sequence is fed to the PrLM for generating the PrLM representation H P = P rLM (x p ). Based on preliminary experiment, we adopt the second-to-last hidden state of the PrLM outputs as the contextualized representation. In our proposed encoder, the PrLM representation H P is merged with the source flow to generate the output H n S of n th encoder layer: where PrLMAttn is the PrLM-dedicated attention module that takes the previous hidden state H n−1 S as a query and the PrLM representation H P as a key and a value and Attn is the original one. We adopt the summation strategy for merging the two different flows, and it gains better results than previous works such as gate network (Yang et al., 2020) and dropnet (Zhu et al., 2020).

Experiments
To demonstrate the effectiveness of the proposed method, we perform extensive experiments on two NMT and abstractive text summarization tasks. For translation, we use BLEU (Papineni et al., 2002) for the evaluation of translation quality, and for text summarization, we report unigram and bigram overlap (ROUGE-1 and ROUGE2) to assess informativeness, and the longest common subsequence (ROUGE-L) to assess fluency with ROUGE scores (Lin, 2004). All the model training is on a single NVIDIA Tesla V100 GPU (16130MiB, Google Colab).

Datasets and Experimental Setting
We evaluate our approach on language generation tasks such as translation and text summarization.
For translation tasks, we use two machine translation datasets: AIHub Ko→En 2 (containing 1.6M

Explorations for leveraging PrLM
Our proposed method for leveraging PrLM is to use the Second-to-last hidden state of the PrLM as the contextualized representation and proceeds, in only Source-side (Encoder), Summation of the source input flow and the PrLM flow after the FNN.
In this subsection, the setting is the default, and we change only the target part of each experiment. We conducted the following four analyses on the Korean-Chinese and Korean-English datasets.

Which hidden state of the PrLM to extract?
We evaluated the impact on how to extract the contextualized representation from the PrLM. As shown in Table 3a, using the second-to-last (i.g., penultimate) hidden state of the PrLM performs the best. It has also been demonstrated in Yang et al. (2020). Moreover, as another attempt, we dynamically extracted the hidden state of each layer based on sentence embedding of n th layer, which can be gained by averaging the PrLM layer (known as PrLM embeddings, Dyn.

How to merge the PrLM representation with the source input flow?
We compared the impact of different merging strategies for the contextualized representation of PrLM. There are directly using the PrLM as the input embedding (Direct in Table 3b) and four merging strategies such as Summation, Average, using Gate Network (Yang et al., 2020), and using Dropnet (Zhu et al., 2020) As shown in Table 3b, the summation of the PrLM flow and the source input flow got the better improvement over others, so we adopted the Summation strategy in our experiments.

Where do the PrLM merge with the source input flow?
In the Table 3c, we analyzed where in the encoder layer of Transformer it would be better to merge the PrLM and source flows. There are four positions in a Transformer-encoder layer: after Attn, after 1 st Add&Norm, after FFN, and after 2 nd Add&Norm. It is interesting to find that merging after FFN can get the best performance. Another observation is that merging the contextualized representation of the PrLM before the Add&Norm (i.e., after Attn or FFN) works better.

Where do the PrLM flow add?
We evaluated where to add the PrLM: source-side, target-side, and both sides. As shown in the Ta Table 4: Leveraging multi-LMs sentation from the fixed PrLM contains univeral information, not information for generation tasks, combining it directly with the target context may adversely affect performance improvement.

Leveraging Multi-PrLMs
We assumed that because PrLMs were trained with different datasets (size, domain) and diverse configurations, they would contain specific prior knowledge. So, we tried to integrate two or more PrLMs simultaneously (Multi-PrLMs) by adding more extra modules in each encoder layer. Contrary to our expectations, as shown in Table 4, using Multi-PrLMs cannot get a significant performance boost over using single-PrLM.

Fine-tuning v.s. Freezing
We compared the impact of fine-tuning and freezing the parameters of PrLM when using our method.   ing the parameters of PrLM gains more significant improvement than fine-tuning. Another interesting observation is that using the ELECTRA base (112M parameters) when it is fine-tuning led to a significant performance drop, especially for relatively large corpus (Ko→En, 1.6M). It means that catastrophic forgetting issue is more pronounced in a resource-rich scenario and using large PrLM. Additionally, tuning separate learning rates (Yang et al., 2020) for the PrLM and the Transformer model may lead to better performance but we leave this to future work.

Inference Speed
We experimented with the inference speed of our method. Since our method has to obtain the intermediate output of PrLM for an input sentence, it takes more time in the inference process than the baseline model. The experimental results are shown in Table 6. Integrating PrLM into the Transformer model reduced the inference speed by about 5% (ELECTRA small, 14M parameters) to 17% (ELECTRA base, 112M parameters). However, considering the significant performance improvement and the ease of application to any language, it is acceptable of such extra cost.

Related Work
Previous studies relies on the structural compatibility of Transformer and PrLM. For example, Clinchant et al. (2019) presented initializing the encoder of Transformer from BERT (fine-tuned or fixed) and observed that freezing the PrLM causes a considerable performance drop. Conneau and Lample (2019) verified that initialization methods with CLM or MLM trained on multi-lingual corpora and showed such initialization are useful on MT. Rothe et al. (2020) used the publicly available PrLM checkpoints to initialize Transformer. While the initialization method is useful to some extent, there is a prerequisite for matching vocabulary and model size/hyper-parameters to them of PrLM. Zhu et al. (2020) proposed a new method that extracts the last hidden state of BERT for an input sentence and fuses it into the encoder and decoder of the Transformer through an extra attention module, and evaluated the effectiveness of their method on supervised, semi-supervised and unsupervised NMT. Yang et al. (2020) introduced a concerted training framework with three techniques for fusing PrLM and NMT model. Although they also extract the hidden state of PrLM and integrate it into NMT model, the NMT model must follow the PrLM model's configurations such as word segmentation rule and vocabulary.
Our work is related to both Zhu et al. (2020) and Yang et al. (2020) in the sense that we all aim to extract the intermediate output of the PrLM and integrate it into a neural model for better generation quality. As an extension of Zhu et al. (2020), we propose an upgraded method adopted through extensive empirical experiments. Our work differs from Yang et al. (2020) in that we use the publicly available checkpoints that have various configurations and fix the PrLM at training time.

Conclusion
While most of the previous works on PrLM address the integration of PrLMs with fine-tuning, we propose an alternative in which a modified Transformer-encoder takes the intermediate output from PrLM to exploit its prior knowledge effectively in a straightforward way. Our method does not have to consider the PrLM's configuration, such as its model size, model dimension, and vocabulary. Correspondingly, our approach and reported empirical settings can be smoothly applied to any languages using any checkpoints of PrLMs.

A.1 Model Setting
In our experiment, we use Transformer (Vaswani et al., 2017) as the baseline model for NMT and abstractive text summarization tasks. Additionally, for NMT tasks, we compare our approach to the following baselines: • (Zhu et al., 2020): A method that inserts the last-hidden state of fixed PrLM through PrLMdedicated attention module in Transformerencoder and decoder. • (Clinchant et al., 2019) (Direct* in Table 1): A method that replaces the input embedding with the PrLM that is fine-tuned in training time.
They all use ELECTRA base as the PrLM.
For the Transformer model, we use a base Transformer configuration (Vaswani et al., 2017) with an embedding size of 512, 6 encoder and decoder layers, 8 attention heads, shared source and target embedding, the standard relu activation function, and sinusoidal positional embedding. We train with a batch size of 3500 tokens and optimize the model parameters using Adam optimizer with a learning rate 7e-4 β 1 = 0.9 and β 2 = 0.98, learning rate warm-up over the first 4000 steps. Additionally, we apply label smoothing with a factor of 0.1. We average over the last 5 checkpoints and run inference with a beam size of 5. All models are trained for 50 epochs using the Torch-based toolkit, Fairseq(-py) . For the text summarization task, we reduce the number of encoder and decoder layers to 4 and use Trigram Blocking (Paulus et al., 2018) to reduce redundancy during inference time. Other settings are the same as above.
For all datasets, we first tokenize sentences using language-specific tokenizer such as KoNLPy 5 for Korean, jieba 6 for Chinese, and Moses (Koehn et al., 2007) for English and then apply Byte-Pair Encoding (Sennrich et al., 2016) to the tokenized sentences with 32K merge-operations. Besides, most of PrLMs have a limit for input sequence length (e.g., 512), so we cut out the middle of some long text for text summarization dataset as proposed in Sun et al. (2019).

B Details of the Notations
Let Attn denote a multi-head attention module, which takes three matrices containing a query matrix Q, a key matrix K, and a value matrix V and product an output matrix as follows: Attn(Q, K, V ) = concat(head 1 , ..., head i )W o , attn(q, k, v) = sof tmax( qW q kW k √ d model )vW v , (8) where concat denotes a concatenation operation, sof tmax denotes a softmax function, d model is the dimension of the model, and W o , W q , W k , W v are parameter matrices. FFN consists of two fullyconnected layers with a relu activation in between.
where max(0, x) is relu activation function, and W 1 , b 1 , W 2 , b 2 are parameter matrices. Finally, Attn and FFN are connected with Add&Norm, which denotes a combination module containing a residual connection (He et al., 2016) and a layer normalization (Ba et al., 2016).