A Secure and Efficient Federated Learning Framework for NLP

In this work, we consider the problem of designing secure and efficient federated learning (FL) frameworks. Existing solutions either involve a trusted aggregator or require heavyweight cryptographic primitives, which degrades performance significantly. Moreover, many existing secure FL designs work only under the restrictive assumption that none of the clients can be dropped out from the training protocol. To tackle these problems, we propose SEFL, a secure and efficient FL framework that (1) eliminates the need for the trusted entities; (2) achieves similar and even better model accuracy compared with existing FL designs; (3) is resilient to client dropouts. Through extensive experimental studies on natural language processing (NLP) tasks, we demonstrate that the SEFL achieves comparable accuracy compared to existing FL solutions, and the proposed pruning technique can improve runtime performance up to 13.7x.


Introduction
Deep Neural Networks have played a significant role in advancing many applications (Yuan et al., 2021;Ding et al., 2017). The field of Natural Language Processing (NLP) leverages Recurrent Neural Networks (RNNs) and Transformers to achieve outstanding performance on many tasks. The Transformer was first introduced in (Vaswani et al., 2017) using a self-attention mechanism and it achieved prominent performance in various NLP tasks. The benefits of RNNs and Transformers in NLP are well-publicized, but the various privacy and security problems still pose challenges to the utilization of these models by data owners, especially users with sensitive data such as location, health, and financial datasets. Federated Learning (FL) (McMahan et al., 2017a) empowers different data owners § Equal contribution, alphabetical order (e.g., organizations or edge devices) to collaboratively train a model without sharing their own data, thus allowing them to address key issues like data privacy. Although data exchanged in FL consists of less information of the user's raw data (Bonawitz et al., 2019), one might still be concerned about how much information remains. Recent research has shown that attackers can still infer sensitive information about the training data, or even reconstruct the it solely from publicly shared model parameters (Zhu et al., 2019).
Although a series of works (Bonawitz et al., 2017;Truex et al., 2019;Papernot et al., 2018;Wu et al., 2021; have been proposed to protect FL protocols from leaking sensitive information . They either have to involve a trusted third party (centralized aggregator), or do not tolerate client dropouts. Therefore, the data owners either need to blindly trust the centralized aggregator or must be online all the time during the training period, which makes the entire design less practical. To address the aforementioned issues, in this work, we develop a secure and efficient FL framework, SEFL. It employs two non-colluding servers, i.e., Aggregation Server (AS) and Cryptography Service Provider (CSP). AS collects the encrypted local updates from clients, and securely aggregates them, while CSP manages the cryptography primitives, i.e. the decryption key. The overarching goal of this framework is to support accurate and efficient RNN and Transformer training while preserving the privacy of training data against the untrusted servers. In other word, any servers' knowledge about any single training data should be bounded by differential privacy (Dwork, 2008).
Our contributions are summarized as follows: (1) We present a novel secure FL framework that eliminates the need for trusted aggregators. (2) SEFL is more resilient to clients dropping out than previous works. SEFL is able to produce a correct global For each compressed local model updates, computes: Aggregate local updates over encrypted form:  Figure 1: SEFL workflow model even 75% of clients are dropped out from the training protocol. (3) To improve the training performance, we integrate the Hankel-matrix based local update/weight pruning method with SEFL to simultaneously reduce the volume of local update and weight storage. The reduction in space, computational, and communication complexity are significant, from O(l 2 ) to O(2l − 1) for weight/update representation, where l is the block size. With extensive experiments, we show that SEFL achieves comparable or even better accuracy than existing secure FL solutions over complex RNN and Transformer models, and the proposed pruning scheme improves SEFL's performance up to 13.7×.

Background
Differential privacy. Let , δ > 0 be privacy parameters, a randomized mechanism M satisfies , δ-differential privacy ( , δ-DP) if and only if for any two adjacent datasets D and D (differ by addition or removal of one data), for any possible output S, the following holds: The Gaussian Mechanism (GM) (Dwork et al., 2014) achieves differential privacy by approximating a deterministic real-valued function f with an additive noise that is proportional to the function's sensitivity where N denotes a normal distribution, and σ is the noise scale. Additively homomorphic encryption (AHE). AHE is a semantic secure public-key encryption scheme (Peter et al., 2012), with three algorithms Gen, Enc and Dec, where Gen generates they public and secret key pairs (pk, sk), Enc encrypts a message with pk and Dec decrypts a ciphertext with secret key sk. In addition AHE provides a homomorphic addition operator ⊕, such that Dec(Enc(m 1 , pk) ⊕ Enc(m 2 , pk)... ⊕ Enc(m k , pk), sk) = m 1 + · · · + m k . Two party secure computation (2PC). 2PC allows two parties with private inputs x 1 and x 2 to jointly compute a given function f . Both parties learn nothing beyond the output of f . A typical 2PC design is the garbled circuit (GC) (Yao, 1986).

Workflow
We design SEFL framework based on the two noncolluding (untrusted) server setting, where an aggregation server (AS) aggregates the encrypted local model updates and another sever (CSP) manages the cryptography primitives (i.e. the decryption key). To ensure the privacy, we require that any server's knowledge about any single training data is bounded by some differential privacy. Figure 1 illustrates an overview of SEFL.
Initially, CSP generates the key pairs (pk, sk), stores the secret key sk locally, and broadcasts the public key pk to all other entities (AS and all clients). In our design, CSP is tasked to manage the cryptography primitives (i.e. the sk), thus CSP is the only entity that can decrypt the encrypted messages under the secret key sk. In the meantime, we assume that all entities will agree on a same initial model W 0 .
Each training iteration, i.e. i th training round, starts with all clients conduct local training with their respective private data D j with a data size n j then obtain the local model update ∆W i j . Then, each client prunes the obtained model updates using weight pruning techniques and encrypts the compressed update by computing ∆Ŵ i j ← Enc( n j ∆W i j K j=1 n j , pk). Clients then submit the encrypted and compressed updates to the AS.
On the server side, AS homomorphically adds all encrypted (pruned) local updates over encrypted form and then obtains ∆Ŵ i ← ∆Ŵ i 1 ⊕∆Ŵ i 2 ⊕.... Knowing that ∆Ŵ i is equal to the encryption of the weighted average of all pruned local updates, , pk). To decrypt the aggregated global update, AS has to collaborate with CSP, as CSP is the only entity that manages the decryption key. Moreover, sending ∆Ŵ i directly to CSP for decryption will result in the exact value of ∆W i being exposed to CSP, which violates the privacy guarantee. One possible approach is to have AS homomorphically add some random noise to ∆Ŵ i and send the distorted global update to CSP for decryption. After receiving the result of decryption, AS removes the noise to obtain the true answer. This prevents CSP from knowing the true value of the global model update, however AS will know this value, which is also a privacy violation. To ensure none of the two servers can learn the exact global updates, in our design, AS first sends a distorted ∆Ŵ i (with some random mask) to CSP, followed by CSP decrypts the distorted global update. Then, the two servers jointly evaluate a secure 2PC where AS inputs the random mask and CSP inputs the decrypted global update. ∆Ŵ i is then recovered inside the secure 2PC protocol. Next, each server independently samples a DP noise and provides it as input to the secure 2PC. These DP noises are then added to the recovered ∆Ŵ i inside the 2PC protocol. Finally, the protocol returns the global update distorted by DP noise to AS, with which AS updates the global model, W i ← W i−1 + ∆W i . Note that, the choice of DP noise is quite flexible, and by default, SEFL uses Gaussian noise to distort the global update.
SEFL repeats the training phases until it reaches the maximum training round T or the model is converging. Note that, it is not necessary for AS to have all local updates from clients, according to our evaluation results, SEFL is able to train an accurate model when only 10% of clients contribute their local updates. Therefore, in practice, one can set an aggregation threshold, say L, which means that AS can start aggregating local updates as long as it receives more than L updates.

Block-Hankel Matrix-based Pruning
Cryptographic primitives can help to provide stronger security guarantees. However, in practice, they often come at high computation and com- Adding additional cryptographic operations in an FL framework could potentially prohibit the popularity and the adoption of resource-constrained edge devices such as mobile or IoT devices with limited resources (e.g., computation, memory size). Therefore, to be compatible with resource-constrained edge devices on federated learning, we aim to minimize the number of cryptographic operations required during training while maintaining the accuracy of FL. To achieve this, we develop an efficient method to train a large neural network by simultaneously reducing the volume of local updates and weight storage. we design an efficient method to train a large NLP model with reduced volume of local updates, to reduce the number of required cryptographic operations.
Pitfall of sparsity format in AHE. Typical weight pruning approaches require to store the indices of nonzero entries (Gurevin et al., 2021;Gui et al., 2019;Wen et al., 2016;Ma et al., 2020). However, the different position of nonzero values from all clients can lead to significant inefficiency for the subsequent model update aggregation. As shown in Fig. 2 (a), assume AS aggreates two local updates with the same sparsity from C 1 and C 2 . We apply compressed sparse row (CSR) format, to represent the updates (∆W 1 and ∆W 2 in Fig. 2 (a)), where the non-zero elements of ∆W 1 and ∆W 2 are not located in the same position. As the AHE-based update aggregation process is a black-box homomorphic addition operation, we can not reconstruct the original sparse matrix from CSR since indices are encrypted, therefore we can not correctly produce the aggregated update.
Crypto-friendly Block-Hankel matrix based pruning. We divide the local update into multi-ple modules with identical shape. Within each module, a special format of structure matrix is applied to approximate the original matrix without indices. In our framework, we investigate the use of blocks of Hankel matrix (BHM) to approximate blocks of local update. As shown in Fig. 2 (b), we can perform aggregation based on the encrypted val vectors since the positions of the sequence vectors are identical. In addition, the resultant global model will have the same size, therefore downloading and uploading communication is symmetric and balanced.
In what follows, we discuss the convergence analysis for pruned sub-networks.
For every network f with depth l and ∀ i ∈ {1, 2, . . . , n}. Consider g is a randomly initialized neural network with 2n layers, and width poly(d, n, m), where d is input size, n is number of layers in f , m is the maximum number of neurons in a layer. The weight initialization distribution belongs to Uniform distribution in range [-1,1]. Then with probability at least 1 − β there is a weight-pruned subnetworkĝ such that: Proof 1 We start with analysis over simple ReLU networks, where f (x) = w · x, g(x) = uσ(w g x).
To extend it to a single network layer, we computes We now provide the general case analysis. With probability over 1 − β, we obtains: Putting it all together. Our objective is to compress the weights and updates using the BHM formats. Thus we minimize the loss function subject to constraints of BHM. More specifically, we set constraints as S The backward propagation process of the training phase can also be implemented using the BHM format, since pruning based on the block Hankel matrix has the same "effectiveness" as unpruned DNNs, as shown in (Zhao et al., 2017).
Compared to other index-required pruning methods, the BHM pruning has the following advantages. First, it always guarantees the strong structure of the trained network, thereby avoiding the storage space, computation, and communication time overhead incurred by the complicated indexing process. Second, during training, the BHMbased approach directly trains weight matrices in the BHM format by updating only one vector for each block (i.e., 2l − 1 vs. l 2 ). Third, the reduction in space, computational, and communication complexity by using BHM are significant. The weight tensor W

Experiments
We implement the SEFL system using PyTorch 1.4.0, CUDA 10.1. All experiments are performed on the AWS EC2 cloud instance with a 2.30GHz Intel Xeon Gold 5218 Salable Processors and 8 NVIDIA Quadro RTX 6000 GPUs. We evaluate SEFL by conducting experiments using LSTM and Transformer on WikiText-2 (Merity et al., 2016) dataset. The LSTM model is adopted from (Hochreiter and Schmidhuber, 1997). The Transformer model (Vaswani et al., 2017) contains two layers with an embedding dimension of 200, two attention heads, and 200 hidden units. We use perplexity to measure the quality of the predicted data for both Transformer and LSTM.

Result Analysis
Comparisons with existing private FL. In Figure 3, we compare SEFL with the state-of-art private FL design, CDP-FL (Geyer et al., 2017). First,  ) to obtain a global aggregation model and then distort it with DP noise. The optimized (with pruning technique) SEFL improves the accuracy by up to 11% and 15% over the CDP-FL method, with LSTM and Transformer model, respectively. One possible explanation is that pruning reduces the DP noises added to the aggregation model. Since, to distort the model, one should inject Gaussian noise to each model element independently. Therefore, the smaller the size of the model, the fewer times the Gaussian noise is injected.
Evaluating optimization. We compare the SEFL (with pruning optimization) with unoptimized SEFL in Figure 4. We report the average elapse time in seconds over 10 replicated runs as the runtime performance. We report the accuracy and runtime performance for unpruned SEFL and SEFL with BHM block size from 4 to 32. Note that the larger the block size, the smaller the compressed model will be. SEFL achieves a performance improvement of up to 13.7× over the un- formance improvements, but also optimizes the accuracy guarantees. SEFL with clients dropout. We evaluate whether SEFL is able to handle clients dropouts in Table 1, where we report accuracy when 25%, 50%, and 75% of clients are dropped out from the protocol. Shown in Table 1, when the dropout rate is relatively small, i.e., 25%, SEFL achieves almost the same accuracy guarantee as in the no-dropout case (0.5% and 5% accuracy degradation for LSTM and Transformer, respectively). Even when majority of clients are drooped out, i.e. 75% drop rate, SEFL still produces accurate models with only 17.41 and 46.34 higher perplexity. In summary, SEFL can handle a large number of client dropouts with relatively small degradation in accuracy. This result shows that our proposed approach is applicable to practical scenarios.

Conclusion
In this paper, we introduced a new secure and efficient FL framework, SEFL, that (i) eliminates the need for the trusted entities, (ii) achieves similar model accuracy compared with existing FL approaches, and (iii) is resilient to client dropouts. We also proposed optimizations that mitigate the high computation and communication overhead caused by cryptographic primitives. This is achieved by applying a local weight pruning technique based on the block Hankel-matrix. Through extensive experimental studies on NLP tasks, we demonstrate that the SEFL achieves comparable accuracy compared to existing FL solutions, and can significantly improve runtime performance.