Efficient-FedRec: Efficient Federated Learning Framework for Privacy-Preserving News Recommendation

News recommendation is critical for personalized news access. Most existing news recommendation methods rely on centralized storage of users’ historical news click behavior data, which may lead to privacy concerns and hazards. Federated Learning is a privacy-preserving framework for multiple clients to collaboratively train models without sharing their private data. However, the computation and communication cost of directly learning many existing news recommendation models in a federated way are unacceptable for user clients. In this paper, we propose an efficient federated learning framework for privacy-preserving news recommendation. Instead of training and communicating the whole model, we decompose the news recommendation model into a large news model maintained in the server and a light-weight user model shared on both server and clients, where news representations and user model are communicated between server and clients. More specifically, the clients request the user model and news representations from the server, and send their locally computed gradients to the server for aggregation. The server updates its global user model with the aggregated gradients, and further updates its news model to infer updated news representations. Since the local gradients may contain private information, we propose a secure aggregation method to aggregate gradients in a privacy-preserving way. Experiments on two real-world datasets show that our method can reduce the computation and communication cost on clients while keep promising model performance.


Introduction
With the explosion of online information, the large quantities of news generated every day may overwhelm users and make them difficult to find the news they are interested in. To tackle this problem, many news recommendation methods Wu et al., 2019a;Qi et al., 2021c) have been proposed to display news according to users' personalized interests. These methods are usually composed of two core modules, i.e., user model and news model. The user model is used to learn user representations from user historical click behaviors. For example, Wang et al. (2018) use a candidate-aware attention network as the user model to help capture user interests in candidate news. The news model is used to learn news representations from news content. For example, Wu et al. (2019c) apply multi-head self attention network to capture the interactions between words in news model. With the success of pre-trained language models (PLM) in NLP, a few PLM-empowered news recommendation methods have been proposed and achieve remarkable performance. For example,  apply pre-trained language models to enhance news modeling. However, these methods require centralized storage of user behaviors, which are highly privacysensitive (Shin et al., 2018). Collecting private user data has raised many concerns (Wu et al., 2019d). Moreover, due to the adoption of some data protection regulations such as GDPR 1 , it might not be able to analyze centralized user data in the future.
Federated learning ) is a privacy-preserving method to train models on the private data decentralized on a large number of clients. In federated learning, each user keeps a local copy of model, and compute local model gradients with their local private data. A central server coordinates the clients and aggregates local gradients to update the global model. Recently, Qi et al. (2020) proposed a FedRec method to train news recommendation models using federated learning. However, the model sizes of many existing news recommendation methods are large, especially their news models. For example, PLM-NR  has 110.7M parameter in total, 110M of which are in the news model (BERT-Base version).
Thus, the communication and computation costs of FedRec can be too high for clients with rather limited computation resource.
In this paper, we propose an efficient federated learning framework for privacy-preserving news recommendation named Efficient-FedRec 2 . In our framework, we decompose the news recommendation model into a large news model and a lightweight user model. Instead of training and communicating the whole model, in our approach the clients only request the user model and the representations of news involved in their local behaviors from the server. The clients locally compute the gradients of the user model and news representations on their local data, and send them to the server for aggregation. The central server uses the aggregated user model gradients to update its maintained global user model, and update the news model based on the aggregated news representation gradients. The updated news model is further used to infer updated news representations. The above process is repeated for multiple rounds until the model gets converges. In order to protect user privacy in model training, we develop a secure aggregation protocol based on the multi-party computation framework for privacy-preserving gradient aggregation. We exchange the news representations in the union news set involved by a group of user behaviors to protect the click history of a specific user. We conduct plenty of experiments on two real-world datasets and the results show that our approach can effectively reduce the computation and communication cost on clients for federated news recommendation model training.
The main contributions of this work include: • We propose an efficient federated learning framework for privacy-preserving news recommendation, which can effectively reduce the computation and communication cost on the user side.
• We develop an effective and efficient secure aggregation protocol to protect user privacy in model training.
• We conduct thorough experiments on two realworld datasets to verify the effectiveness and efficiency of our approach.

Neural News Recommendation
Personalized news recommendation is an important technique to alleviate the information overloading problem and improve user reading experience. Many deep learning based recommendation methods have been proposed (Wu et al., 2019b;Okura et al., 2017;Zhu et al., 2019;Qi et al., 2021b,a). They usually contain two core modules, i.e., user model and news model. For example,  propose to use a CNN to learn contextual word embedding and an attention layer to select informative words. They combine long-term interests and short-term interests of users by using user id embeddings and a GRU network in user model. These methods learn news representations based on shallow NLP models, which is hard to well capture the news semantic information. Recently, pre-trained language models (PLM) achieve great success in NLP (Devlin et al., 2019;Bao et al., 2020). A few PLM-empowered news recommendation methods have been proposed. For example,  propose PLM-NR to empower news modeling by applying pre-trained language. They replace the news encoder in previous methods with pre-trained language models, and get stable improvement on news recommendation task. However, all the above methods train models based on centralized data, which is highly privacy-sensitive. Such kind of collections and analysis of private data have led to privacy concerns and risks (Shin et al., 2018). Besides, the adoption of some data protection regulations, such as GDPR 3 , gives news platforms restrictions and high pressure of using user data to prevent user data leakage. Different from these methods, we do not use centralized storage for training in our framework, which can better preserve user privacy.

Federated Learning
Federated Learning (McMahan et al., 2017) is an effective method for privacy-preserving model training. It enables several users to collaboratively train models without sharing their data to a central server.
In federated learning, users first request the latest updated model from central server, and compute local gradients with their local private data. Central server aggregates the gradients to update the global model and distributes the updated global model to user local devices. Since the local gradients may leak some private information of users (Bhowmick et al., 2018;Melis et al., 2019), several privacy protection methods are applied, such as secure multiparty computation (MPC) (Knott et al., 2020), differential privacy (DP) (Ren et al., 2018), and homomorphic encryption (HE) (Aono et al., 2017).
Recently, several works have proposed to leverage federated learning in recommendation scenario. Ammad et al. (2019) propose federated collaborative filtering (FCF). In FCF, users use their private rate data to compute gradients of user embeddings and item embeddings. The user embeddings are updated locally by the gradients of user embeddings, and the gradients of item embeddings are aggregated to update global item embeddings. Chai et al. (2020) propose secure federated matrix factorization (FMF). FMF is similar to FCF but updates user embeddings and item embeddings according to matrix factorization algorithm. However, FCF and FMF are not suitable for news recommendation scenarios, since they represent items with ID embeddings and there is much fresh news generated every day. Qi et al. (2020) propose a privacy-preserving method for news recommendation model training. In FedRec, users use their local data to compute gradients of the model parameters. A group of randomly sampled users sends their local gradients to the central server to update the global model. However, the communication and computation cost of FedRec is unacceptable for user devices with limited resource due to the large size of news recommendation models, especially their news models. In this paper, we propose Efficient-FedRec to reduce the overhead on clients. We decompose the news recommendation model into a large news model maintained in server and a light-weight user model shared between clients and server. A small number of news representations and user model are communicated.

Methodology
In this section, we introduce our Efficient-FedRec method for privacy-preserving news recommendation. We first introduce the problem formulation and news recommendation framework. Then we introduce the details of our Efficient-FedRec framework. The details of secure aggregation are demonstrated in the last subsection.

Problem Formulation
Denote U = {u 1 , u 2 , ...u P } as user set, where P is the user number. Given a user u, his private behaviors B u are locally stored on his devices. In our approach, we denote all news in user behaviors of user u as N u . The news recommendation model is decomposed into a news model with parameter set Θ n and a user model with parameter set Θ u . The server maintains the news models and generates news representations with parameter set Θ e , and keeps a global user encoder. The goal is to collaboratively train an accurate news recommendation model without leaking users' private information.

News Recommendation Framework
In this subsection, we introduce the news recommendation framework, which is shown in Figure 1. It is composed of two core modules, i.e., news model and user model. News Model Given a news n, the news model is used to learn news representations n from news contents. It can be implemented by various model structures. Several existing news recommendation methods use shallow NLP models. Wu et al. (2019c) use a combination of multi-head self attention network and additive attention network, An et al. (2019) use a combination of CNN network and additive attention network. With the success of pre-trained language models (PLM) in NLP, a few methods start to apply pre-trained language models in news model.  propose PLM-NR, which uses a combination of pre-trained language model and additive attention network as news model. In our Efficient-FedRec, we apply the news model of PLM-NR . User Model The user model is used to learn user representations from user historical clicked news. Denoting the news representations of user historical clicked news [n 1 , n 2 , ...n M ] as input, the user model computes user representations u as output. It can be implemented by several model structures. Wang et al. (2018) use candidate-aware attention, and  combines user id embeddings and GRU network. In our Efficient-FedRec, we apply the user model of NRMS (Wu et al., 2019c), which uses a combination of multi-head self attention network and additive attention network.

Framework of Efficient-FedRec
In this subsection, we introduce the framework of our Efficient-FedRec. Each user who participates in model training is called a client. In our framework, client behaviors are locally stored on their devices, which prevents the risks of data leakage. Since data of a single user is not enough to train an intelligent news recommendation model, our framework enables multiple clients to collaboratively train a news recommendation model. To lower the communication and computation overhead on the client side, we decompose the news recommendation model into a large news model maintained on server and a light-weight user model shared on both server and clients. At the t-th round, the model updating contains four steps, i.e., distributing user model and news representations, training local user model and news representations, gradient aggregation and global model updating. The framework of our Efficient-FedRec is shown in Figure 2.
The first step is distributing user model and news representations. Since the news model is heavy and users only need the news representations to predict click scores, in our framework users request a small number of news representations in their behaviors and user model instead of the whole model from the central server. However, directly requesting the news representations of news in user behaviors N u will leak user private information. In our work, we randomly sample a group of clients, who exchange the representations of union news set involved by a group of user behaviors through a secure aggregation protocol (introduced in Section 3.4). Denoting the group of clients as U s = {u 1 , u 2 ...u s }, the union news set is computed as N s = ∪ u i ∈Us N i . Thus the server only knows the news accessed by a group of clients. Finally, users keep a local copy of user model Θ t u , and news representations of union news set Θ t es . It is noted news representations of union news set is much smaller than the news model (analyzed in Section 4.3), which alleviates the communication cost.
The second step is training local user model and news representations. Given a client u, we use his historical clicked news representations to compute user representations u through the local user encoder. For a candidate news n c , we use the candidate news representation n c and the user representation u to compute a click score s through a click predictor, which is dot-product in our framework. Following the previous work (Wu et al., 2019c;Qi et al., 2020), we utilize categorical cross-entropy loss for training. More specifically, for every clicked candidate news, we sample K non-clicked news in the same impression. Denote the label of the i-th news and user u is y i and the prediction score is s i , the loss of a training sample is computed as follows: ). (1) The final loss is the average loss of all training samples in B u , which is computed as follows: Denote the local gradients of user encoder as g t vu and the local gradients of news representations as g t esu , which are computed as follows: In this step, since clients only compute the user model, the computation cost on clients is alleviated. The third step is gradient aggregation. The server needs to compute the weighted sum of gradients of user model, news representations, and the sample number from the randomly sampled user group U s . Since the local gradients may contain some private information (Bhowmick et al., 2018;Melis et al., 2019), we apply the secure aggregation to compute the summations (introduced in Section 3.4). The aggregated gradients of user model and news representations are denoted as g t v and g t es , which are formulated as follows: It is noted that each user only sends the gradients of news representations in the union news set N s , which is much smaller than the news model.  (Reddi et al., 2021) as follows: where η is the learning rate, β 1 , β 2 and τ are parameters of FedAdam. The news model is updated through a backpropagation training process. For each news in the union news set n i ∈ N s , the central server has its content and the gradients of its news representation g t e i ∈ g t es .
We use the news content as input and compute its news representation n i through the news model. The gradients of news model g t n are computed as follows: where Θ t n is the parameters of news model at the t-th round. We use Adam optimizer to updated new model, which in computed as follows: where η is the learning rate, β 1 , β 2 and τ are hyper parameters of Adam. We further use the updated news model to infer news representations. Finally, the updated news representations and user encoder are distributed to all clients.

Secure Aggregation
In this subsection, we first introduce secure aggregation proposed by Bonawitz et al. (2017), and then introduce how we apply it to our framework for secure gradients aggregation and news representations distributing. The secure aggregation is mainly based on multi-party computation (MPC). It aims to let central server compute weighted sum of vectors without accessing the local vectors of each client in federated learning scenario. Denoted the local vectors of clients as {v 1 , v 2 , ...v n }, the secure aggregation computes v = n i=1 v i in a privacy-preserving way. Meanwhile, it solves the user drop problem on mobile devices.
As we introduce in Section 3.3, we use the secure aggregation twice. The first time is to compute the union news set N s of a group of users. Given a user u i , we first transform his local news set N i into a local vector h i , of which dimension equals the number of all news. The h i is defined as follows: where h j i is the j-th dimension of h i , and n j is the j-th news in the total news set. We apply secure aggregation to compute the sum of vectors h = u i ∈Us h i . The inverse transformation of Eq 8 is used to compute the union news set N s from h. The sampled group of users then request the news representations in the union news set N s from central server.
The second time is to securely aggregate gradients. Each user flattens their local weighted gradients of news representations |B u | · g t esu , local gradients of user model |B u | · g t vu and their sample number |B u | to a vector, and applies secure aggregation to compute the summation. It is noted that only the news in the union news set has the gradients of news representations.

Experiments
In this section, we demonstrate the efficiency and effectiveness of our Efficient-FedRec. We conduct several experiments to answer the following research questions: • RQ1: How does our method perform compared with baseline methods?
• RQ2: Are the communication and computation overhead significantly reduced compared with baseline methods?
• RQ3: How does the news model size influence the performance and overhead of our framework?
• RQ4: How does the user group size influence the risk of user information leakage and the effectiveness of our method?
• RQ5: How does the user number influence the performance of our framework?

Dataset and Experimental Settings
We conduct thorough experiments on two public datasets, i.e., MIND 4 and Adressa 5 . MIND 6 (Wu  (Gulla et al., 2017) is publicly released by Adresseavisen, a local newspaper company in Norway. Following (Qi et al., 2020) and (Hu et al., 2020), we use the 6-th day's click to build training dataset and construct historical clicks from the first 5 days' samples. We randomly sample 20% clicks from the last day's clicks for validation and the rest clicks for testing. The historical clicks of validation and testing dataset are constructed from the first 6 days' samples. Since Adressa does not contain negative samples, we randomly sample 20 news for each click for testing. The detailed dataset statistics are summarized in Table 1. Following many previous news recommendation works (Wu et al., 2020b;Qi et al., 2020;Wu et al., , 2020a, we use AUC, MRR, nDCG@5 and nDCG@10 as evaluation metrics. In our experiments, we apply BERT-Base (Devlin et al., 2019) for MIND and nb-bert-base (Kummervold et al., 2021) for Adressa to initialize the pre-trained language model in news encoder. The dimension of news representations is 400. To mitigate overfitting, we apply dropout in user model. The dropout rate is 0.2. The learning rate is 0.00005. The number of negative samples associated with each positive sample is 4. The user group size is 50 on both MIND and Adressa. All hyper-parameters are selected according to results on the validation set. We repeat each experiment 5 times independently, and report the average results with standard deviations.

Performance Evaluation (RQ1)
In this section, we compare our Efficient-FedRec framework for privacy-preserving news recommendation with several baseline methods, including news recommendation methods with centralized storage: (1) DFM (Lian et al., 2018), a multi-channel deep fusion model for news recommendation; (2) DKN (Wang et al., 2018), a knowledge-aware news recommendation method;  (3) LSTUR , using user id embedding to capture user long-term interests, and GRU network to capture short-term interests; (4) NAML (Wu et al., 2019a), learning news representations via multi-view learning; (5) NRMS (Wu et al., 2019c), using two self-attention networks for better news and user modeling; (6) Cen-Rec (Qi et al., 2020), a central version of Fe-dRec; (7) PLM-NR , applying pre-trained language model to empower the performance of news recommendation. For fair comparison, we use the user model in NRMS. privacy-preserving news recommendation methods: (8) FCF (Ammad et al., 2019), federated collaborative filtering for recommendation; (9) Fe-dRec (Qi et al., 2020), privacy-preserving method for news recommendation model training. For fair comparison, we do not add differential privacy; (10) FedRec(BERT), applying FedRec to train PLM-NR in a privacy-preserving way. our method: (11) Efficient-FedRec, using our Efficient-FedRec framework to train PLM-NR in a privacypreserving and efficient way. The experimental results of all these methods are shown in Table 2.
We have several observations from Table 2. First, comparing our Efficient-FedRec with SOTA news recommendation methods with centralized storage (DKN, NAML, NRMS, LSTUR and PLM-NR), our method achieves comparable performance. Moreover, our method does not need users to share their behavior data. Therefore, it validates our method can train accurate news recommendation models and meanwhile protect user privacy. Second, our method performs better than FCF. This is because FCF is not suitable for news recommendation, since there are severe cold-start problems in news recommendation scenario (Qi et al., 2020;Wu et al., 2020b). Third, our Efficient-FedRec outper- forms FedRec. This is because we use pre-trained language model in news model, which can help better understand the semantics of news contents. Forth, comparing our Efficient-FedRec with Fe-dRec(BERT), our Efficient-FedRec achieves comparable performance. This is because our method has the same gradients as FedRec(BERT) if dropout and batch normalization are not applied in news model. Finally, FedRec(BERT) and Efficient-FedRec perform worse than PLM-NR, and FedRec performs worse than CenRec. This is probably because user behaviors are non-i.i.d, which may make it difficult for federated learning to achieve good results Wang et al., 2019).

Efficiency Analysis (RQ2)
In this subsection, we analyze the communication and computation cost of our Efficient-FedRec on MIND. The average size of the union news set is 1,320 per round, the gradient and parameter size of which is 1.06M. We assume users leverage CPU for calculation. Figure 3 shows the average computation time and the communication overhead of each user per round of several privacy-preserving   methods. From Figure 3, we have several observations. First, the average computation time of Efficient-FedRec is lower than those of FedRec and FedRec(BERT). This is because in our framework users do not need to compute the news model, which lowers the computation overhead. Second, the communication overhead of Efficient-FedRec is much lower than the overhead of FedRec and FedRec(BERT). This is because in our framework users request and send the gradients and parameters of user model and a small number of news representations, which is much smaller than the gradients and parameters of the whole model.

The Influence of News Model Size (RQ3)
In this subsection, we apply different size of BERTs in news model to study the influence of the news model size on MIND. The computation cost of clients is tested on CPU, while the computation cost of server is tested on GPU, which is reasonable since clients are usually with limited computation resource. The result are shown in Table 3, where we have several observations. First, the recommendation performance increases with the news model size, which shows the effectiveness of applying large news model. Second, the communication and computation cost of our method on clients are lower than FedRec. This is because in Efficient-FedRec clients only compute the user model and request the user model and the repre-sentations of news involved in their local behaviors. Additionally, the gap of the overhead between Efficient-FedRec and FeRec becomes larger with larger news model, which demonstrates the superiority of our method in using large news models. Third, the computation overhead of our method on server is larger than FedRec. It is because in our framework the news model is trained on central server. However, the overall computation time of Efficient-FedRec is lower than FedRec. This is because the server can use powerful GPU clusters to update the news model. It is noted that we simulate client computation cost with 100% CPU utilization. The computation time on real-time devices will be larger than the results reported in Table 3.

Influence of User Group Size (RQ4)
In this section, we study the influence of user group size on union news set size, convergence round, overall communication cost and secure aggregation time. The results are shown in Figure 4. As shown in Figure 4a, with the increasing of user group size, the size of union news set increases. When user group size is 40, the average size of union news set is 1,115, which is 10 times larger than the average size of user local news set, i.e., 114. Therefore, when user group size is large enough, it is hard for server to recover interacted news of users. Then, we study the impact of user group size on communication cost. Since larger user group size leads to larger union news set size, the communication cost of per user increases. In Figure 4b we also find larger user sizes can make the model converge faster. The impact of user group size on overall communication cost is shown in Figure 4c, which is influenced by group user size, communication cost per user and convergence round. It is shown the overall communication cost increases with larger user group size. Finally, we study the impact of user group size on secure aggregation time (per user). As shown in Figure 4d, the computation cost of secure aggregation increases with larger user group size. The computation time of secure aggregation is 0.71s when user group size is 50. Considering privacy protection ability, communication cost and secure aggregation cost, we set user group size as 50 in our experiment on MIND.

Influence of User Number (RQ5)
In this subsection, we study the influence of the number of users who participate in model training.
We randomly sample different numbers of users from MIND. The experimental results are shown in Figure 5. We can observe the performance increases with higher user numbers, which validates the idea of training news recommendation collaboratively with a large size of users. Moreover, it shows our Efficient-FedRec can effectively explore useful information from multiple user behaviors.

Conclusion
In this paper, we propose an efficient federated learning framework for privacy-preserving news recommendation named Efficient-FedRec. We decompose the news recommendation model into a large news model maintained by server and a lightweight user model. Users request news representations and user model from the central server and compute gradients with user local data. The cen-tral server aggregates gradients to update the user model and news model. The updated news model is further used to infer news representation by server. In order to protect the private information in user local gradients, we apply secure aggregation to aggregate gradients. In order to protect user interacted news history, we exchange the news representations in the union news set involved by a group of user behaviors. Experiments on two real-world datasets validate our method can effectively reduce both communication and computation cost on user side while keep the model performance.

Ethical Statements
User Information Protection in Dataset In this paper, we conduct experiments on two public datasets, i.e., MIND and Adressa. MIND dataset was released in (Wu et al., 2020b). It is a public English news recommendation dataset. In this dataset each user was de-linked from the production system when securely hashed into an anonymized ID using onetime salt mapping to protect user privacy. We have agreed with Microsoft Research License Terms 7 before downloading this dataset and complied with these license terms when using this dataset. Adressa dataset was released in (Gulla et al., 2017). It is a public Norwegian news recommendation dataset. The users in this dataset are anonymized to protect user privacy. We follow the dataset license 8 when using this dataset. Thus, all the datasets used in our paper are public datasets where user privacy information is well protected.
Influence of User Group Size The user groups consist of randomly sampled users in each round to update model according to our framework. We conduct experiments to analyze the influence of user group size, and the results are summarized in Section 4.5. The experimental results show that as long as the user group size is properly large, which is usually easy to satisfy in practical applications, the information of user interacted news can be well protected.