NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in NLP. News articles usually contain rich textual information, and PLMs have the potentials to enhance news text modeling for various intelligent news applications like news recommendation and retrieval. However, most existing PLMs are in huge size with hundreds of millions of parameters. Many online news applications need to serve millions of users with low latency tolerance, which poses huge challenges to incorporating PLMs in these scenarios. Knowledge distillation techniques can compress a large PLM into a much smaller one and meanwhile keeps good performance. However, existing language models are pre-trained and distilled on general corpus like Wikipedia, which has some gaps with the news domain and may be suboptimal for news intelligence. In this paper, we propose NewsBERT, which can distill PLMs for efficient and effective news intelligence. In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models, where the student model can learn from the learning experience of the teacher model. In addition, we propose a momentum distillation method by incorporating the gradients of teacher model into the update of student model to better transfer useful knowledge learned by the teacher model. Extensive experiments on two real-world datasets with three tasks show that NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.


Introduction
Pre-trained language models (PLMs) like BERT (Devlin et al., 2019) and GPT (Radford et al., 2019) have achieved remarkable success in various NLP applications Yang et al., 2019). These PLMs are usually in huge size with hundreds of millions of parameters (Qiu et al., 2020). For example, the BERT-Base model contains about 110M parameters and 12 Transformer (Vaswani et al., 2017) layers, which may raise a high demand of computational resources in model training and inference. However, many online applications need to provide services for a large number of concurrent users and the tolerance of latency is often low, which hinders the deployment of large-scale PLMs in these systems (Sanh et al., 2019).
In recent years, online news websites such as MSN News and Google News have gained huge popularity for users to digest digital news (Wu et al., 2019b). These news websites usually involve a series of intelligent news applications like automatic news topic classification (Wu et al., 2019c), news headline generation (Tan et al., 2017) and news recommendation (Okura et al., 2017;Wu et al., 2019aWu et al., ,b,d, 2021b. In these applications, text modeling is a critical technique because news articles usually contain rich textual content (Wang et al., 2020a). Thus, these applications would benefit a lot from the powerful language understanding ability of PLMs if they could be incorporated in an efficient way, which further has the potential to improve the news reading experience of millions of users .
Knowledge distillation is a technique that can compress a cumbersome teacher model into a lighter-weight student model by transferring useful knowledge (Hinton et al., 2015;Kim and Rush, 2016). It has been employed to compress many huge pre-trained language models into much smaller versions and meanwhile keep most of the original performance (Sanh et al., 2019;Sun et al., 2019;Wang et al., 2020b;. For example, Sanh et al. (2019) proposed a Distil-BERT approach, which learns the student model from the soft target probabilities of the teacher model by using a distillation loss with softmaxtemperature (Jang et al., 2016), and they regular-ized the hidden state directions of the student and teacher models to be aligned.  proposed TinyBERT, which is an improved version of DistilBERT. In addition to the distillation loss, they proposed to regularize the token embeddings, hidden states and attention heatmaps of both student and teacher models to be aligned via the mean squared error loss. These methods usually learn the teacher and student models successively, where the student can only learn from the results of the teacher model. However, the learning experience of the teacher may also be useful for the learning of student model (Zhang et al., 2018), which is not considered by existing methods. In addition, the corpus for pre-training and distilling general language models (e.g., WikiPedia) may also have some domain shifts with news corpus, which may not be optimal for intelligent news applications.
In this paper, we propose a NewsBERT approach that can distill PLMs for various intelligent news applications. In our approach, we design a teacherstudent joint learning and distillation framework to collaboratively learn both teacher and student models in news intelligence tasks by sharing the parameters of top layers, and meanwhile distill the student model by regularizing the output soft probabilities and hidden representations. In this way, the student model can learn from the teacher's learning experience to better imitate the teacher model, and the teacher can also be aware of the learning status of the student model to enhance student teaching. In addition, we propose a momentum distillation method by using the gradients of the teacher model to boost the gradients of student model in a momentum way, which can better transfer useful knowledge learned by the teacher model to enhance the learning of student model. We conduct extensive experiments on two real-world datasets that involve three news intelligence tasks. The results validate that our proposed NewsBERT approach can consistently improve the performance of these tasks using much smaller models and outperform many baseline methods for PLM distillation.
The main contributions of this work include: • We propose a NewsBERT approach to distill pre-trained language models for intelligent news applications.
• We propose a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models by sharing knowledge in their learning process.
• We propose a momentum distillation method which uses the gradient of the teacher model to boost the learning of student model in a momentum manner.
• Extensive experiments on real-world datasets validate that our method can effectively improve the performance of various intelligent news applications in an efficient way.

Related Work
In recent years, many researchers explore to use knowledge distillation techniques to compress large-scale PLMs into smaller ones (Tang et al., 2019;Sanh et al., 2019;Sun et al., 2019;Mirzadeh et al., 2020;Wang et al., 2020b;Wu et al., 2021a). For example, Tang et al. (2019) proposed a BiLSTM SOFT method that distills the BERT model into a single layer BiLSTM using the distillation loss in downstream tasks. Sanh et al. (2019) proposed a DistilBERT approach, which distills the student model at the pre-training stage using the distillation loss and a cosine embedding loss that aligns the hidden states of teacher and student models. Sun et al. (2019) proposed a patient knowledge distillation method for BERT compression named BERT-PKD, which distills the student model by learning from teacher's output soft probabilities and hidden states produced by intermediate layers. Wang et al. (2020b) proposed MiniLM, which employs a deep self-attention distillation method that uses the KL-divergence loss between teacher's and student's attention heatmaps computed by query-key inner-product and the value relations computed by value-value inner-product.  proposed TinyBERT, which distills the BERT model at both pre-training and finetuning stages by using the distillation loss and the MSE loss between the embeddings, hidden states and attention heatmaps. There are also a few works that explore to distill pre-trained language models for specific downstream tasks such as document retrieval (Lu et al., 2020;Chen et al., 2021). For example, Lu et al. (2020) proposed a TwinBERT approach for document retrieval, which employs a two-tower architecture with two separate language models to encode the query and document, respectively. They used the distillation loss function to compress the two BERT models into smaller ones. These methods usually train the teacher and student models successively, i.e., distilling the student model based on a well-tuned teacher model. However, the useful experience evoked by the teacher's learning process cannot be exploited by the student and the teacher is also not aware of the student's learning status. In addition, the corpus for pre-training and distilling these language models usually has some domain shifts with news texts. Thus, it may not be optimal to apply the off-theshelf distilled language models to intelligent news applications. In this work, we propose a News-BERT method to distill pre-trained language models for intelligent news applications, which can effectively reduce the computational cost of PLMs and meanwhile achieve promising performance. We propose a teacher-student joint learning and distillation framework, where the student model can exploit the useful knowledge produced by the learning process of the teacher model. In addition, we propose a momentum distillation method that integrates the gradient of the teacher model into the student model gradient as a momentum to boost the learning of the student.

NewsBERT
In this section, we introduce our NewsBERT approach that can distill PLMs for intelligent news applications. We will first introduce the teacherstudent joint learning and distillation framework of NewsBERT by using the news classification task as a representative example, then introduce our proposed momentum distillation method, and finally introduce how to learn NewsBERT in more complicated tasks like news recommendation.

Teacher-Student Joint Learning and Distillation Framework
The overall framework of our NewsBERT approach in a typical news classification task is shown in Fig. 1. It contains a teacher model with a parameter set Θ t and a student model with a parameter set Θ s . The teacher is a strong but large-scale PLM (e.g., BERT) with heavy computational cost, and the goal is to learn the light-weight student model that can keep most of the teacher's performance. Different from existing knowledge distillation methods that first learn the teacher model and then distill the student model from the fixed teacher model, in our approach we jointly learn the teacher and student models and meanwhile distilling useful knowledge from the teacher model. Both teacher and student language models contain an embedding layer and several Transformer (Vaswani et al., 2017) layers. We assume that the teacher model has N K Transformer layers on the top of the embedding layer and the student model contains N Transformer layers on the embedding layer. Thus, the inference speed of the student model is approximately K times faster than the teacher. We first use the teacher and student models to separately process the input news text (denoted as x) through their Transformer layers and obtain the hidden representation of each token. We use a shared attentive pooling (Yang et al., 2016) layer (with parameter set Θ p ) to convert the hidden representation sequences output by the teacher and student models into unified news embeddings, and finally use a shared dense layer (with parameter set Θ d ) to predict the classification probability scores based on the news embedding. By sharing the parameters of the top pooling and dense layers, the student model can get richer supervision information from the teacher, and the teacher can also be aware of student's learning status. Thus, the teacher and student can be reciprocally learned by sharing useful knowledge encoded by them, which is helpful for learning a strong student model. Next, we introduce the knowledge distillation details of our approach. We assume the i-th Transformer layer in the student model corresponds to the layers [(i − 1)K + 1, ..., iK] in the teacher model. We call the stack of these K layers in the teacher model as a "block". Motivated by (Sun et al., 2019), we apply a hidden loss to align the hidden representations given by each layer in the student model and its corresponding block in the teacher model, which can help the student better learn from the teacher. We denote the token representations output by the embedding layers in the teacher and student models as E t and E s , respectively. The hidden representations produced by the i-th layer in the student model are denoted as H s i , and the hidden representations given by the corresponding block in the teacher model are denoted as H t iK . The hidden loss function applied to these layers is formulated as follows: where MSE stands for the Mean Squared Error loss function. In addition, since the pooling layer is shared between student and teacher, we expect the unified news embeddings learned by the pooling  Figure 1: The framework of NewsBERT in an example task, i.e., news classification. layers in the teacher and student models (denoted as h t and h s respectively) to be similar. Thus, we propose to apply an additional hidden loss to these embeddings, which is formulated as follows: Besides, to encourage the student model to make similar predictions with the teacher model, we use the distillation loss function to regularize the output soft labels. We denote the soft labels predicted by the teacher and student models asŷ t andŷ s , respectively. The distillation loss is formulated as: where CE stands for the cross-entropy function and t is the temperature value. The overall loss function for distillation is a summation of the hidden losses and the distillation loss, which is formulated as: Since the original teacher and student models are task-agnostic, both teacher and student models need to receive task-specific supervision signals from the task labels (denoted as y) to tune their parameters. Thus, the unified loss function L s for training the student model is the summation of the overall distillation loss and the classification loss, which is written as follows: Since we do not expect the teacher to be influenced by the student too heavily, the loss function L t for training the teacher model is only the classification loss, which is computed as follows: By jointly optimizing the loss functions of the teacher and student models via backward propagation, we can obtain a light-weight student model that can generate task-specific news representations for inferring the labels in downstream tasks as the teacher model.

Momentum Distillation
In our approach, each Transformer layer in the student model corresponds to a block in the teacher model and we expect they have similar behaviors in learning hidden text representations. To help the student model better imitate the teacher model, we propose a momentum distillation method that can  inject the gradients of the teacher model into the student model as a gradient momentum to boost the learning of the student model. We denote the gradients of the j-th layer in the i-th block of the teacher model as g t i,j , which is computed by optimizing the teacher's training loss L t via backward propagation. The gradients of the k-th layer in the student model is denoted as g s k , which is derived from L s . We use the average of the gradients from each layer in the i-th block of the teacher model as the overall gradients of this block (denoted as g t i ), which is formulated as: Motivated by the momentum mechanism (Qian, 1999; He et al., 2020), we combine the block gradients g t i with the gradients of the corresponding layer in the student model in a momentum manner, which is formulated as follows: where β is a momentum hyperparameter that controls the strength of the gradient momentum of the teacher model. In this way, the teacher's gradients are explicitly injected into the student model, which may have the potential to better guide the learning of the student by pushing each layer in the student model to have similar function with the corresponding block in the teacher model.

Applications of NewsBERT for News Intelligence
In this section, we briefly introduce the applications of NewsBERT in other news intelligence scenarios like personalized news recommendation. An illustrative framework of news recommendation is shown in Fig. 2, which is a two-tower framework.
The input is a sequence with a user's T historical clicked news (denoted as [D 1 , D 2 , ..., D T ]) and a candidate news D c , and the output is the click probability scoreŷ which can be further used for personalized news ranking and display. We first use a shared NewsBERT model to encode each clicked news and the candidate news into their hidden representations [h 1 , h 2 , ..., h T ] and h c . Then, we use a user encoder to capture user interest from the representations of clicked news and obtain a user embedding u. The final click probability score is predicted by matching the user embedding u and h c via a click predictor, which can be implemented by the inner product function. In this framework, teacher and student NewsBERT models are used to generate news embeddings separately, while the user encoder and click predictor are shared between the teacher and student models to generate the prediction scores, which are further constrained by the distillation loss function. In addition, the MSE hidden losses are simultaneously applied to all news embeddings generated by the shared News-BERT model and the user embedding u generated by the user encoder, which can encourage the student model to be similar with the teacher model in supporting user interest modeling.

Datasets and Experimental Settings
We conduct experiments on two real-world datasets. The first dataset is MIND , which is a large-scale public news recommendation dataset. It contains the news impression logs of 1 million users on the Microsoft News website during 6 weeks (from 10/12/2019 to 11/22/2019). We used this dataset for learning and distilling our NewsBERT model in the news topic classification and personalized news recommendation tasks. The logs of the first 5 weeks were used for training and validation, and the rest were reserved for test. Since many news may appear in multiple dataset splits, in the news topic classification task we only used the news that do not appear in the training and validation sets for test. The second dataset is a news retrieval dataset (named as NewsRetrieval), which was sampled from the logs of Bing search engine from 07/31/2020 to 09/13/2020. It contains the search queries of users and the corresponding clicked news. On this dataset, we finetuned models distilled on MIND to measure their cross-task performance in news retrieval. We used the logs in the first month for training, the logs in the next week for validation, and the rest for test. The statistics of the two datasets are summarized in Table 1. In our experiments, motivated by (Chi et al., 2021), we used the first 8 layers of the pre-trained UniLM (Bao et al., 2020) model as the teacher model 1 , and we used the parameters of its first 1, 2 or 4 layers to initialize the student models with different capacities. In the news recommendation task, the user encoder was implemented by an attentive pooling layer, and the click predictor was implemented by inner product. The query vectors in all attentive pooling layers were 256-dimensional. We used Adam (Bengio and LeCun, 2015) as the model optimizer, and the learning rate was 3e-6. The temperature value t was set to 1. The batch size was 32. The dropout (Srivastava et al., 2014) ratio was 0.2. The gradient momentum hyperparameter β was set to 0.1 and 0.15 in the news topic classification task and the news recommendation task, respectively. These hyperparamters were tuned on the validation set. Since the topic categories in MIND are imbalanced, we used accuracy and macro-F1 score (denoted as macro-F) as the metrics for the news topic classification task. Following , we used the AUC, MRR, nDCG@5 and nDCG@10 scores to measure the performance of news recommendation models. On the news retrieval task, we used AUC as the main metric. We independently repeated each experiment 5 times and reported the average results.

Performance Evaluation
In this section, we compare the performance of our NewsBERT approach with many baseline methods, including: (1) Glove (Pennington et al., 2014), which is a widely used pre-trained word embedding. We used Glove to initialize the word embeddings in a Transformer (Vaswani et al., 2017) model for news topic classification and the NRMS (Wu et al., 2019d) model for news recommendation. (2) BERT (Devlin et al., 2019), a popular PLM with bidirectional Transformers. We compare the perfor- 1 We used the UniLM V2 model. mance of the 12-layer BERT-Base model or its first 8 layers.
(3) UniLM (Bao et al., 2020), a unified language model for natural language understanding and generation, which is the teacher model in our approach. We also compare its 12-layer version and its variant using the first 1, 2, 4, or 8 layers.
(4) TwinBERT (Lu et al., 2020), a method to distill PLMs for document retrieval. For fair comparison, we used the same UniLM model as our approach. (5) TinyBERT , which is a stateof-the-art two-stage knowledge distillation method for PLM compression. We compare the performance of the officially released 4-layer and 6-layer TinyBERT models distilled from BERT-Base and the performance of student models with 1, 2, and 4 layers distilled from the UniLM model. Table 2 shows the performance of all the compared methods in news topic classification and news recommendation tasks. From the results, we have the following observations. First, compared with the Glove baseline, the methods based on PLMs achieve better performance. It shows that contextualized word representations generated by PLMs are more informative in language modeling. Second, by comparing the results of BERT and UniLM (both 8-and 12-layer versions), we find UniLM-based models perform better in both tasks. It shows that UniLM is stronger than BERT in modeling news texts, and thereby we used UniLM for learning and distilling our models. Third, compared with BERT-12 and UniLM-12, their variants using the first 8 layers perform better. This may be because the top layers in PLMs are adjusted to fitting the self-supervision tasks (e.g., masked token prediction) while the hidden representations of intermediate layers have better generalization ability, which is also validated by (Chi et al., 2021). Fourth, compared with TwinBERT, the results of TinyBERT and NewsBERT are usually better. This may be because the TwinBERT method only distills the teacher model based on the output soft labels, while the other two methods can also align the hidden representations learned by intermediate layers, which can help the student model better imitate the teacher model. Fifth, our NewsBERT approach outperforms all other compared baseline methods, and our further t-test results show the improvements are significant at p < 0.01 (by comparing the models with the same number of layers). This is because our approach employs a teacher-student joint learning and distillation framework where the student  can learn from the learning process of the teacher, which is beneficial for the student to extract useful knowledge from the teacher model. In addition, our approach uses a momentum distillation method that can inject the gradients of teacher model into the student model in a momentum way, which can help each layer in the student model to better imitate the corresponding part in the teacher model. Thus, our approach can achieve better performance than other distillation methods. Sixth, NewsBERT can achieve satisfactory and even comparable results with the original PLM. For example, there is only a 0.24% accuracy gap between NewsBERT-4 and the teacher model in the topic classification task. In addition, the size of student models is much smaller than the original 12-layer model, and their training or inference speed is much faster (e.g., about 12.0x speedup for the one-layer NewsBERT). Thus, our approach has the potential to empower various intelligent news applications in an efficient way.
Next, to validate the generalization ability of our approach, we evaluate the performance of News-BERT in an additional news retrieval task. We used the NewsBERT model learned in the news recommendation task, and we finetuned it with the labeled news retrieval data in a two-tower framework used by TwinBERT (Lu et al., 2020). We compared its performance with several methods, including fine-tuning the general UniLM model or the TwinBERT and TinyBERT models distilled in the news recommendation task. The results are  shown in Fig. 3, from which we have several findings. First, directly fine-tuning the generally pretrained UniLM model is worse than using the models distilled in the news recommendation task. This is probably because that language models are usually pre-trained on general corpus like Wikipedia, which has some domain shifts with the news domain. Thus, generally pre-trained language models may not be optimal for intelligent news applications. Second, our NewsBERT approach also achieves better cross-task performance than Tiny-BERT and TwinBERT. It shows that our approach is more suitable in distilling PLMs for intelligent news applications than these methods.

Effectiveness of Teacher-Student Joint Learning and Distillation Framework
In this section, we conduct experiments to validate the advantage of our proposed teacher-student  joint learning and distillation framework over conventional methods that learn teacher and student models successively (Hinton et al., 2015). We first compare the performance of the student models under our framework and their variants learned in a disjoint manner. The results are shown in Fig. 4. We find that our proposed joint learning and distillation framework can consistently improve the performance of student models with different capacities. This is because in our approach the student model can learn from the useful experience evoked by the learning process of the teacher model, and the teacher model is also aware of the student's learning status. However, in the disjoint learning framework, student can only learn from the results of a passive teacher. Thus, learning teacher and student models successively may not be optimal for distilling a high-quality student model.
We also explore the influence of the teacherstudent joint learning and distillation framework on the teacher model. We compare the performance of the original UniLM-8 model and its variants that serve as the teacher model for distilling different student models. The results are shown in Fig. 5. We find a very interesting phenomenon that the performance of some teacher models is better than the original UniLM-8 model that does not participate in UniLM-8 (w/o student) UniLM-8 (NewsBERT-4 student) UniLM-8 (NewsBERT-2 student) UniLM-8 (NewsBERT-1 student) (b) News recommendation. Figure 5: Influence of the teacher-student joint learning and distillation framework on the teacher model. the joint learning and distillation framework. This may be because the teacher model may also benefit from the useful knowledge encoded by the student model. These results show that our teacher-student joint learning and distillation framework can help learn the teacher and student models reciprocally, which may improve both of their performance.

Ablation Study
In this section, we conduct experiments to validate the effectiveness of several core techniques in our approach, including the hidden loss, the distillation loss and the momentum distillation method. We compare the performance of NewsBERT and its variants with one of these components removed.
The results are shown in Fig. 6. We find that the momentum distillation method plays a critical role in our method because the performance declines considerably when it is removed. This may be because the gradients of teacher model condense the knowledge and experience obtained from its learning process, which can better teach the student model to have similar function with the teacher model and thereby yields better performance. In addition, the distillation loss function is also important for our approach. This is because the distillation loss regularizes the output of the student model to be similar with the teacher model, which   encourages the student model to behave similarly with the teacher model. Besides, the hidden loss functions are also useful for our approach. It may be because the hidden loss functions can align the hidden representations learned by the teacher and student models, which is beneficial for the student model to imitate the teacher.

Hyperparameter Analysis
In this section, we conduct experiments to study the influence of the gradient momentum hyperparameter β on the model performance. We vary the value of β from 0 to 0.3, and the results are shown in Fig. 7. We observe that the performance is not optimal when the value of β is too small. This is because the gradient momentum is too weak under a small β, and the useful experience from the teacher model cannot be effectively exploited. However, the performance starts to decline when β is relatively large (e.g., β > 0.2). This is because the gradients of the teacher model inevitably have some inconsistency with the gradients of the student model, and a large gradient momentum may lead the student model updates deviate the appropriate direction. Thus, a moderate selection of β from 0.1 to 0.2 is recommended.

Conclusion
In this paper, we propose a knowledge distillation approach named NewsBERT to compress pretrained language models for intelligent news applications. We propose a teacher-student joint learning and distillation framework to collaboratively train both teacher and student models, where the student model can learn from the learning experience of the teacher model and the teacher model is aware of the learning of student model. In addition, we propose a momentum distillation method that combines the gradients of the teacher model with the gradients of the student model in a momentum way, which can boost the learning of student model by injecting the knowledge learned by the teacher. We conduct extensive experiments on two real-world datasets with three different news intelligence tasks. The results show that our NewsBERT approach can effectively improve the performance of these tasks with considerably smaller models.