Privacy-Preserving Distributed Deep Learning via Homomorphic Re-Encryption

The flourishing deep learning on distributed training datasets arouses worry about data privacy. The recent work related to privacy-preserving distributed deep learning is based on the assumption that the server and any learning participant do not collude. Once they collude, the server could decrypt and get data of all learning participants. Moreover, since the private keys of all learning participants are the same, a learning participant must connect to the server via a distinct TLS/SSL secure channel to avoid leaking data to other learning participants. To fix these problems, we propose a privacy-preserving distributed deep learning scheme with the following improvements: (1) no information is leaked to the server even if any learning participant colludes with the server; (2) learning participants do not need different secure channels to communicate with the server; and (3) the deep learning model accuracy is higher. We achieve them by introducing a key transform server and using homomorphic re-encryption in asynchronous stochastic gradient descent applied to deep learning. We show that our scheme adds tolerable communication cost to the deep learning system, but achieves more security properties. The computational cost of learning participants is similar. Overall, our scheme is a more secure and more accurate deep learning scheme for distributed learning participants.


Background
In recent years, artificial intelligence (AI) [1] has been applied to more and more fields, such as medical treatment [2], internet of things (IoT) [3] and industrial control [4].Actually, deep learning [5] is one of the most attractive and representative techniques of AI, which is mainly based on neural network [6].Thanks to the development of related computer hardware (such as GPU) and the emergence of big data [7], deep learning has achieved striking success in tasks such as image classification [8], traffic identification [9] and self-learning [10].Therefore deep learning is gaining increasing importance in our more and more intelligent modern society.
Distributed deep learning is gaining increasing popularity nowadays.In this kind of deep learning, the training datasets are collected from multiple distributed data providers, rather than a single one [11].When the deep learning model is trained with more representative data items, the obtained model could be more generalized, which leads to higher model accuracy.However, the collection and utilization of distributed datasets raise worrying security issues (especially privacy issues), which hinders wider application of distributed deep learning to some extent [12].
There are many examples of privacy leak, which may incur significant threats.For example, in 2018, it was reported that a data analytics company called Cambridge Analytica harvested millions of profiles of voters which revealed voters' privacy.The data breach influenced choices of the voters which was a threat of the fairness of election.Another real live case is that the Reuters Health recently reported that some health applications in the mobile phone might share information with a host of unrelated companies, some of which have nothing to do with healthcare.A big concern of users is how their data will be used and by whom [13].
With data providers sending more and more data to cloud servers because of the cloud servers owning high computational capability and large storage to process big data, it is reasonable that data providers may worry about their data privacy, even when they encrypt their data before sending their data out [14].
Recently, Aono and Hayashi [15] presented a privacy-preserving deep learning scheme via additively homomorphic encryption.It is one of the most representative works in privacy-preserving deep learning.Their scheme is based on gradients-encrypted asynchronous stochastic gradient descent (ASGD), in combination with learning with errors (LWE)-based encryption and Paillier encryption.They first prove that sharing gradients even partially over an honest-but-curious parameter cloud server as in [16] may leak information.Then they propose an additively homomorphic encryption based scheme, in which all learning participants encrypt their own computed gradients with the same public key and send them to the server.However, the security of the scheme is based on the assumption that the server could not collude with any learning participant.
We analyze potential risks of the system in [15] as follows.
(1) Since the learning participants own the same public key and private key, learning participant A could decrypt the gradients of learning participant B as long as it gets B's encrypted gradients.(2) To make matters worse, it is likely that the server colludes with one of the participants in practice.Once the server and the learning participant collude, they could decrypt the gradients of all learning participants because the server has all encrypted gradients and the learning participant has the private key.According to [15], they could get the private local data of all learning participants through the gradients.
In our scheme, all parties (a cloud server acts as key transform server, a cloud server acts as data service provider, and data providers act as learning participants) are assumed to be honest-but-curious [17].In other words, they will finish given tasks, but may try to mouse some sensitive information, such as the local data of data providers.We assume that the two servers, key transform server (KTS) and data service provider (DSP), are selected from different cloud companies and they would not collude with each other because of benefit contradiction (different companies take their own benefit as first consideration) and reputation preservation; however, a server may collude with a data provider.
There are two main colluding scenarios between the server and the learning participant.First, the server and a learning participant may belong to the same company.That is to say, they may collude with each other for the common benefit of their company.Specifically, the cloud company a server belongs to may send an entity to be a learning participant.But the two cloud servers are selected from different companies, so they are unlikely to collude with each other for the sake of their companies' reputation and beneficial contradiction.Second, even if the cloud company a server belongs to cannot send an entity to be a learning participant, it is easier for a cloud sever to corrupt a learning participant than another cloud server.Because the security measures of cloud sever is more than that of a learning participant in general.
In a word, the cloud server is likely to collude with a learning participant in [15]; and it is reasonable for us to assume that the key transform server would not collude with the data service provider in this paper.

Our Contributions
Due to the security vulnerabilities of the scheme proposed in [15], we propose a multi-key based distributed deep learning scheme to protect the data privacy of learning participants even when a server colludes with one of the learning participants, using homomorphic re-encryption.We introduce a key transform server to re-encrypt the gradients encrypted by learning participants.The data service provider makes the re-encrypted gradients additively homomorphic and finishes the weights update computation.Finally, the learning participants download new weights and decrypt them respectively.The detailed realization steps of our scheme are described in Section 4. In a word, our scheme enjoys the following properties on security, efficiency and accuracy.

•
Security: our scheme protects the private local data of learning participants even without the assumption that the server would not collude with any learning participant.

•
Efficiency: experimental results show that the computational cost of learning participants are similar to that in [15].• Accuracy: our scheme provides a little higher accuracy than that in [15].

More Related Works
Shokri et al. [16] have designed a distributed deep learning system that multiple learning participants could jointly train an ASGD based deep neural network model without sharing their local datasets, but must selectively sharing key parameters of the model.The goal of our paper is designing an ASGD-based distributed deep learning system which do not need sharing key parameters of the model.
Papernot et al. [18] proposed a scheme to preserve the privacy of training data called private aggregation of teacher ensembles (PATE).They used "teachers" for a "student" model instead of public models to protect sensitive data.Essentially, this property of the scheme can be called differential privacy.
NhatHai Phan et al. [19] proposed the deep private auto-encoder (dPA), which was one type of deep learning.Instead of perturbing the result of deep learning, the scheme focuses on the perturbation of objective function of the deep auto-encoder, to realize differential privacy.
Abadi et al. [20] have developed a framework of differential privacy to analyze the privacy cost of crowdsourcing the training of model over large dataset containing sensitive data.They also designed algorithmic techniques for machine learning under modest cost of privacy, computation, efficiency and accuracy.Hitaj et al. [21] showed that distributed deep learning was susceptible to an attack they devised.They trained a generative adversarial network (GAN) generating samples that came from the same distribution as original training dataset, which should be kept private.Moreover, they showed that existing record-level differential privacy could not resist their attack.Therefore effective method is needed for privacy-preserving distributed collaborative deep learning.They consider one of the learning participants as the adversary, which is practical.We consider the situation in this paper as well.
Mohassel et al. [22] proposed privacy-preserving machine learning protocols for logistic regression, linear regression and stochastic gradient descent based neural network training.In their schemes, the data providers are distributed and there are two servers.Their schemes are based on the assumption that two servers would not collude, which is reasonable in practice because of beneficial conflict.Our system is based on this assumption, too.
Ping Li et al. [23] suggested that distributed deep learning over combined dataset should pay attention to two points.First, all data including intermediate computation results should be encrypted with different keys before being sent out.Second, the computational cost of data providers should be minimal.We fully consider these two points when we design our scheme, so that the data providers can be mobile smart phones.These authors proposed a framework for privacy-preserving outsourced classification in cloud computing (POCC) in [24].
Qingchen Zhang et al. [25] pointed out that offloading some expensive operations of big data feature learning to the cloud server(s) could improve the system efficiency.Meanwhile, the data privacy of enterprises and governments should be concerned.They approximated the activation function as a polynomial function with the Brakerski-Gentry-Vaikuntanathan cryptosystem (BGV).Our scheme do not need the approximation to avoid accuracy decrease.
BD Rouhani et al. [26] proposed a scalable provably-secure deep learning framework called Deepsecure.In this framework, all parties are likely to leak information.The key of their framework is the pre-processing techniques and optimized Yao's Garbled Circuit protocol [27].Our scheme considers the situation that one of the learning participants would leak information even its private key.

Paper Organization and Notations
The rest of the paper is organized as follows.We introduce the definitions of homomorphic re-encryption, ASGD-based deep learning and illustrate that gradients may leak information in Section 2, which are preliminaries of our system.The architecture of our system is proposed in Section 3. The details of realizing our system via proxy-invisible homomorphic re-encryption is given in Section 4. Then we perform security analysis of our system in Section 5. Furthermore, we analyze the communication cost of our system and evaluate its computational cost with experiments in Section 6.Finally, we conclude this paper in Section 7.
To facilitate presentation, we summarize the main notations used in this paper in Table 1.The number of weights update process; Len( * ) The bit length of input data; H( * ) The hash value of input data; (k, p, q) → (g, n) Function with the input k, p, q and the output g, n; p • q The product of p and q.

Homomorphic Re-Encryption
Homomorphic re-encryption scheme (HRES) is an asymmetric cryptography that can realize proxy-invisible re-encryption and supports privacy-preserving data processing with access control.In addition, the addition scheme of an improved version of HRES, called the somewhat re-encryption scheme, has the properties of additive homomorphism [28].There are four roles in this scheme: data providers (DPs), data service provider (DSP), access control server (ACS), and data requesters (DRs).As the names imply, the DPs provide data to ACS; and ACS transforms the received data to realize access control; then DSP processes the transformed data; finally, the DRs request and get data they need.Next, we introduce this improved homomorphic re-encryption scheme in detail which mainly consists of the following five algorithms.
1. Key generation (KeyGen): (k, p, q) → (g, n, PK).First choose k as a parameter, and two large primes p and q, where Len (p) = Len (q) = k.The outputs of this algorithm, denoted as (g, n, PK), are public system parameters.Determine a generator g of G with maximal order [29], where G is the cyclic group of quadratic residues modulo n 2 .Then compute n = p • q.The two servers, DSP and access control server (ACS), respectively generate key pairs: (sk DSP = a, pk DSP = g a ) and (sk ACS = b, pk ACS = g b ).Then they negotiate the Diffie-Hellman key: Then PK is published to all data providers for them to encrypt their data.Each data provider generates its own key pair.For example, the key pair of data provider i is (sk i , pk i ) = k i , g k i . 2. Encryption (Enc): m j → [m j ] PK = T j , T j .This algorithm is performed by data providers.
Its function is encrypting plaintext m j (m j ∈ Z n ) provided by data provider i under the Diffie-Hellman key PK.The resulting ciphertext can be represented by two parts: (1) and (2) where r ∈ 1, n 4 is first selected randomly.

First phase of re-encryption (FPRE
DSP first selects a computation identifier CID, and then implements the following algorithms to compute the intermediate ciphertext m j + : (1) (2) 4. Second phase of re-encryption (SPRE): m j + → m j pk i .After receiving m j + from DSP, ACS performs the following algorithms to compute m j pk i : (1)h 2 = H((pk i ) sk ACS ||CID); (2) Here, the final ciphertext m j pk i is obtained.
5. Decryption (Dec): m j pk i → m j .Only when the data requester of m j pk i has the corresponding private key sk i , the decryption could be finished correctly.The decryption algorithms are as follows: The computation identifier CID is set to be addition here.Because only addition in HRES is used in this paper, we omit CID in the rest of this paper.According to [28], the properties of the above improved homomorphic re-encryption scheme can be summarized as follows.
(1) Additive homomorphism: Because the proof of additive homomorphism property of HRES is not provided in [28] and this property is important in our scheme, we prove it in this paper as follows.
Proof of additive homomorphism. where and where Because g is the generator with maximal order, we have Therefore le f t = right, and HRES has the additive homomorphism property. ( (4) Resistant to impersonation attack and collusion between any server and any distributed data provider thanks to the two hash values h 1 , h 2 .In order to avoid repetition, here we omit the proof of the last three properties.The details can be found in [28].

ASGD Based Deep Learning
Deep learning can be seen as a series of algorithms over a neural network consisting of multiple layers.There are numerous neuron nodes in each layer.The neurons of one layer are connected with neurons of next layer via weight variables.The weight variables are important for activation function which computes the output of one layer.For example, the output of layer p + 1 is computed as out , where f (x) is the activation function, and (W (p) , b (p) ) denotes the weight vector connecting layer p with layer p + 1.
The weight vector of the deep neural network consisting of weight variables need to be determined through deep learning.Concretely, take supervised learning for example, where a training dataset is provided firstly, and the cost function J is defined according to the target of the learning task.Then cost function is computed over all data items of the given training dataset.The learning process minimizes J by adjusting values of weight variables.
The most frequently used adjusting method called the stochastic gradient descent (SGD).Denote the weight vector as W = (w 1 , w 2 , ..., w n ), which consists of all weights in the deep neural network.Generally, the cost function is computed iteratively over different subsets (mini-batch) of training dataset, in which elements are selected randomly.For example, the computation result of cost function over a subset consisting of t elements can be denoted as J |batch|=t .Then the gradient vector G of cost function J can be represented as In the learning process using SGD, the update rule for weight vector is where α ∈ R is called learning rate.
In order to make the learning process more efficient, practical asynchronous stochastic gradient descent (ASGD) is proposed in [30], including data parallelism and model parallelism.Data parallelism means the training dataset for updating weight vector can be distributed.That is to say, the dataset consists of data from multiple distributed data providers.Every machine used for training model has a replica of all weights.Model parallelism is separating the model into several parts which can be updated by different machines.A typical example of model parallelism is depicted in Figure 1, which is proposed in [11].There are N = 4 learning participants included in four machines (represented by four blue rectangles with dotted lines as boundaries) in this example.The five-layer deep neural network model is separated into N = 4 parts.The thick lines connecting nodes in different rectangles are called crossing weights, which are responsible for collecting different parts of models trained in different machines.The model is usually separated according to the computational capability of machines.
In model parallelism, the weight variables in the deep neural network can be represented as W = (W 1 , W 2 , ..., W N ).Each component of W, the W i (i = 1, 2, ..., N) is a weight vector.So the update rule can be transformed to where W i contains the weight variables of i th part of the model updated by i th machine, G i means the gradient vector used for updating weights of i th part of the model, i.e., W i .We adopt model parallelism in our system.The learning participant i, KTS and computing unit CU i of DSP together act as machine i in Figure 1.Machine i is responsible for training the i th part of model assigned to it.That is to say, W i is updated by the local data of learning participant i.If the weight variable w connects nodes in machine i and machine k, and machine k is responsible for updating w, then the gradient used for updating w should be generated by learning participant k and should be re-encrypted with the public key pk k of learning participant k so that machine k can further process the gradient.

Machine 3 Machine 4
Machine 1 Machine 2 Figure 1.An example of model parallelism.

Gradients Leaking Information
Authors in [15] have shown that a small portion of gradients could leak the data providers' original training data.They proved that by listing four examples, including the simplest situation that there was only one neuron, the general neural networks, the neural networks with regularization and the most complex one, with Laplace noises added to the gradients.Here we restate the simplest example.
We focus on the learning process of one single neuron.Suppose ( x, y) is the data item, where x = {x 1 , x 2 , ..., x n } is the neuron input vector and y is the corresponding truth label.The cost function here is set as the distance between the y and y predict , y predict is computed through the activation function f fed with x where b is the bias value.Hence the cost function is Then the gradients can be computed as follows: and So the data item x i can be revealed by computing x i = g i g when the gradients are known.The truth value y can be guessed when input data item x = {x 1 , x 2 , ..., x n } is an image according to [15].

System Architecture
The system architecture can be seen in Figure 2.There are three kinds of parties in our system: learning participant, KTS and DSP.The functions of them are described in detail as follows.

Learning Participant
In our system, the learning participants (LPs) act as data providers, as well as data requesters.In other words, the learning participants provide the newly computed gradients for servers to update weight variables, and request the updated weight variables for next gradients generation process.Since the learning participants are distributed in our system, the training datasets are distributed as well.
To preserve the data privacy of learning participants, encrypting gradients before sending them to DSP for further processing is necessary.Different from the encryption method in [15] that all participants use one key pair (sk, pk) generated jointly, learning participants encrypt their own gradients with the Diffie-Hellman key PK generated by KTS and DSP in our system.PK is known to all learning participants, but its corresponding private key is not overt.That is to say, any learning participant cannot decrypt the ciphertext of other learning participants to get their private data.Therefore each learning participant does not need to set up the independent secure channel to communicate with servers (KTS and DSP).
Every learning participant (suppose there are N participants) implements the following steps.
1. Generate its own key pair.The public key is public to all parties of this scheme.
2. Select a mini-batch of data from its own local training dataset randomly for this iteration.
3. Download the encrypted weights updated in previous iteration from DSP. 4. Decrypt the above encrypted weights with its own private key. 5. Compute new gradients with data obtained at step 2 and weights obtained at step 4 using partial derivative.6. Encrypt new gradients with the Diffie-Hellman key PK and send the ciphertexts to KTS.

Key Transform Server
An innovative and important design of our system is that a KTS is responsible for parameters generation and first phase of re-encryption (FPRE).KTS mainly performs the following operations.
1. Select a parameter k and two large primes p, q randomly, where Len (p) = Len (q) = k.2. Compute n = p • q and choose a generator g with maximal order [29].3. Generate its key pair.Then negotiate the Diffie-Hellman key with DSP. 4. Receive the ciphertexts from learning participants, and perform the FPRE over them. 5. Send the re-encrypted ciphertexts to the corresponding computing unit CU i of DSP (i = 1, 2, ..., N).

Data Service Provider
Data service provider (DSP) is responsible for the second phase re-encryption and weights update.In order to make full use of the computation power of DSP and facilitate model parallelism, DSP is split into N parts which are named as computing units CU in our system.The steps DSP executes are as follows.
1. Generate its key pair.2. Receive re-encrypted ciphertexts from KTS and perform SPRE on them.3. Update the weights using encrypted gradients obtained in step 2. 4. Store the updated encrypted weights into the corresponding computing unit CU i of DSP (i = 1, 2, ... , N).

System Realization
In this section we use proxy-invisible homomorphic re-encryption to realize our system described in Section 3.All steps of our proposed system are listed in sequence at length as follows, which are shown in Figure 3. 1. ParamGen: (k, p, q) → (g, n).First KTS selects k as a security parameter, and two large primes p and q, where Len (p) = Len (q) = k ( Len(x) is the bit length of input data x).Then KTS chooses a generator g with maximal order according to [29] and computes n = p • q.Finally KTS publishes the public system parameters (g, n) to all entities in our system.2. KeyGen: (g, n) → ((sk, pk) , PK).Every learning participant generates its own key pair (sk i , pk i ) (i = 1, 2, ..., N). pk i = g sk i modn 2 , i = 1, 2, ..., N; KTS and DSP generate their key pairs (sk KTS , pk KTS ) and (sk DSP , pk DSP ) respectively.
pk KTS = g sk KTS modn 2 = g a modn 2 ; (28) Then KTS and DSP negotiate their Diffie-Hellman key PK.That is to say, KTS sends its public key pk KTS = g a modn 2 to DSP, and DSP sends its public key pk DSP = g b modn 2 to KTS.Therefore both KTS and DSP can calculate PK respectively.
Finally, the public keys are published to all entities in our system.3. Initialization: since we adopt model parallelism, the deep neural network is separated into N parts (N is the number of learning participants).A machine (consists of a learning participant, KTS and a computing unit of DSP) is responsible for training a part of network.The weight variables in each part of the network form a weight vector respectively.So N weight vectors are assigned to N machines respectively.Before performing training process, weight vectors are initialized by KTS and shared by all parties of this system.In order to make representation clear, here we denote W (j) i as the weight vector updated by Machine i in the j th weights update iteration, and G (j) i is the gradient vector generated by learning participant i in the j th weights update iteration and used for updating the weight vector W (j) i .The initial weight vectors are denoted as N .The crossing weights connecting machine i and machine k (k > i) are assigned to machine k. 4. Data encoding: generally, weights and gradients are real numbers.But the homomorphic re-encryption requires that numbers to be encrypted should be integers.Therefore before encrypting weights and gradients with homomorphic re-encryption, data encoding process should be performed.A real number x ∈ R can be represented by an integer x • 2 n with n bits of precision.Here we round down x • 2 n as the encoding result of real number x. 5. Gradients generation and encryption: take the first weights update iteration (j = 1) for example.
First, every learning participant (take the learning participant i for example here) uses a mini-batch of data selected randomly from its local dataset and the initial weight vector i .Next, components of weight vector i are encoded into integers.Finally every learning participant encrypts components of its own vector −α • G (1) i with Diffie-Hellman key PK respectively and sends them to KTS.

KTS encrypts each component of initial weight vector W
(0) i with PK.

First phase of re-encryption (FPRE): After receiving the ciphertext
N ) from learning participant 1, 2, ... , N respectively, KTS performs FPRE over them.Take the ciphertext received from learning participant i for example to describe the first phase of re-encryption algorithms.First KTS computes the hash value h 1 and then the re-encrypted ciphertext: It is noticeable that if a gradient component is used for updating a crossing weight, then it should be re-encrypted with the public key of the learning participant the crossing weight assigned to.

SPRE and homomorphic addition: DSP receives
+ (1 ≤ i ≤ N) from KTS and stores them in the corresponding computation unit CU i .Each computation unit performs SPRE.Take the CU i for example.It performs following algorithms: i ) have the property of additive homomorphism now.So computation unit i of DSP can update the weight vectors: i ) is stored into the corresponding computation unit CU i of DSP. 8. Decryption:From the second iteration of weights update process on, each learning participant downloads updated weight vector from its corresponding computation unit of DSP.That is to say, learning participant i downloads Then learning participant i decrypts the downloaded E pk i (W (j) i ) with its private key sk i .The decryption algorithms are as follows.
Finally, downloaded W ) by downloading and decrypting the E pk i (W (wu) i ) from the CU i of DSP (1 ≤ i ≤ N) respectively.

Security Analysis
In this section, we present the assumption of the computational difficulty problem on which our scheme based, and then study the security of our scheme.

Assumption
Decisional Diffie-Hellman (DDH) Problem [31] over Z * n 2 : For each probabilistic polynomial time function F, there is a negligible function negl() so that for sufficiently large l: That is to say, when g x and g y are given, the possibility for adversaries to distinguish between g z and g xy is negligible.

Security of Our Scheme
In this section, we analyze the security of our scheme.As mentioned in Section 1.1, we consider the scenario that KTS and DSP would not collude with each other, but KTS or DSP may collude with one of learning participants.First we give the proof that our scheme is secure in the presence of semi-honest adversaries (A LP , A KTS , A DSP ) under non-colluding setting.Then we analyze the security of our system when there are colluding adversaries.
Proof.Our scheme is secure in the presence of semi-honest adversaries under non-colluding setting.Our scheme is based on the HRES.The HRES is proved to be semantically secure in [28] based on the difficulty of Decisional Diffie-Hellman Problem above.We omit its proof here in order to avoid repetition.Our scheme involves three types of entities: LP, KTS and DSP.Three kinds of challengers C LP , C KTS , C DSP are constructed to against three kinds of adversaries A LP , A KTS , A DSP who want to corrupt LP, KTS and DSP respectively.
When a new gradient G is generated, C LP challenges A LP as follows.When it comes to decryption of the updated weight, C LP challenges A LP as follows.C LP chooses [m ] pk j randomly and decrypts it.The result m is sent to A LP .If A LP responds with ⊥, C LP returns ⊥.The result m is the view of A LP , which is indistinguishable in both ideal and real executions because of the security of HRES.Therefore A DSP 's views are indistinguishable.Therefore our scheme is secure when there are semi-honest adversaries under non-colluding setting.Next, we illuminate that our system is secure even when KTS colludes with one of the learning participants or DSP colludes with one of the learning participants.We suppose that learning participants, KTS and DSP are honest but curious.That is to say, all parties in the system will perform the execution they should do following the system steps but they would try to get local data of learning participants.Here we analyze two situations as follows: KTS colludes with one of the learning participants, or DSP colludes with one of the learning participants.
Situation 1: KTS colludes with one of the learning participants.If KTS colludes with learning participant A, then they can share information with each other and try to get private information of other learning participants.That is to say, KTS can get the gradients of A, even the private key of A. A can get private key of KTS, too.In the scheme of [15], if the cloud server colludes with a learning participant, and the server gets the private key of the learning participant, then the server could decrypt all the gradients of all learning participants because their private keys are the same.Since the gradients leak information of original data, the private data of learning participants are not safe.However, it is impossible in our proposed scheme because all learning participants encrypt their gradients with the Diffie-Hellman key generated by KTS and DSP.KTS can not decrypt the gradients itself even with the private key of a learning participant.Because the generation of hash value h 2 can resit impersonation attack and collusion between KTS and any learning participant.KTS should perform steps in the first phase of re-encryption described in Section 4 honestly.Therefore, the private data of learning participants are secure even if KTS colludes with one of the learning participants in our scheme.
Situation 2: DSP colludes with one of the learning participants.If DSP colludes with learning participant B, the situation is similar to Situation 1.In other words, DSP can get the gradients of B, even the private key of B. B can get private key of DSP, as well as the all the re-encrypted weights and gradients.DSP should perform steps in the second phase of re-encryption described in Section 4 honestly.With the private key of learning participant B, DSP could only decrypt the gradients from learning participant B, but has no knowledge about gradients of other learning participants.Because the generation of hash value h 1 can resit impersonation attack and collusion between DSP and any learning participant.Therefore, the original data of learning participants are safe even if DSP colludes with one of the learning participants.
To sum up, our scheme is resistant to collusion between any cloud sever and any learning participant, which is not available in schemes of [15].

Performance Evaluation
In this section, we evaluate the communication cost and computational cost of our proposed scheme through theoretical analysis and simulation.First, we introduce the concrete parameter settings of the experiment we take.
It is shown in [28] that the length of n, Len(n) influences the computation efficiency greatly, as well as security of the HRES.Experimental results showed that larger Len(n) meant longer communication time, but higher security.In order to balance the efficiency and security, here we set Len(n) = 1024 bits, which guarantees the learning participants with limited resources can finish decryption process efficiently according to [28].
In the experiment, we generated all the private keys randomly.The influence of length of a private key on HRES was tested in [28].From the result, we can observe that when length of private key is between 100-200 bits, the computational costs are similar.In order to compare with the scheme in [15], we set the length of private key as 128 bits in our system, which provided the same length of security parameter as that in [15].
In the data coding process, we set n = 32.That is to say, the precision of data encoding results was 32-bit.We set N = 10, which means the deep neural network was separated into 10 parts averagely and there are 10 learning participants, 10 machines and 10 weight vectors.

Communication Cost Analysis
In this section we discuss the total communication cost of our scheme C our system .There are three kinds of communication as shown in Figure 3 To estimate the communication cost of our scheme, first we design the following formulas: In Equation (43), C one iteration is the communication cost of jth iteration.Next, we discuss the communication cost of transmitting one ciphertext between two parties.For example, ciphertext [m i ] in our proposed scheme is composed of two components: [m i ] = T i , T i , where T i = (1 + m i • n) • PK r mod n 2 and T i = g r mod n 2 .The length of T i and T i are related to n 2 .Each of them has 2 * Len(n) bits, so uploading or downloading one ciphertext needs to transmit 4 * Len(n) bits.We set Len(n) = 1024 bits, so transmitting one ciphertext in our scheme means transmitting 4096 bits.
Finally, according to our system realization described in Figure 3, we can obtain the following results.For the first iteration: For the iterations 2 ≤ j ≤ N wu : So the total communication cost of our scheme is In order to compare the communication cost of our scheme with that of scheme in [15], we analyzed the increased communication factor of our scheme.The increased factor is Since Len(n) = log 2 n = 1024 bits in our system, we can pack t = log 2 n prec+pad real numbers (after encoding into integers) into one HRES plaintext.We set the precision as prec = 32 bits in encoding process, and pad = log 2 N wu bits to prevent overflows in ciphertext additions [15].So each ciphertext in our scheme can be used to encrypt t gradients.Therefore the increased factor is When the number of weights update process N wu is large enough, the increased factor of our scheme was 6.The comparison of communication cost between our scheme and latest works can be found in Table 2.
According to simulation, the average running time of gradients generation process through training the MLP in our scheme was T gradients generation = 1.3 ms; and the HRES running time of processing the 109,386 gradients of the network with 10 learning participants in one weights update process was T HRES = 5192.4ms.Therefore, the running time of one weights update process in our system was The average running time per weights update iteration for performing steps of homomorphic re-encryption in our scheme, including encryption, FPRE, SPRE, homomorphic addition and decryption time are depicted in Figure 4, which are denoted as encTime, FPRETime, SPRETime, addTime and DecTime respectively.As mentioned above, learning participants were responsible for the implementation of encryption and decryption; KTS was responsible for FPRE; DSP was responsible for SPRE and addition.So we calculated the computational cost of them respectively in Table 4 (accurate to the second decimal place).It can be seen that DSP had the maximum computational overhead (55%).The computational cost of learning participants are 33% of the total computational cost.There were 10 learning participants in our system, and we separated the model averagely, so the average computational cost of every learning participant was just 3.3%, which was much lower than that of KTS and DSP.The configuration was reasonable because the computational capability of cloud server is much higher than that of every data provider generally.The experimental results comparison is shown in Figure 5 and Table 5.In Figure 5, T_enc of our scheme consists of there parts: running time of encryption, FPRE and SPRE.In Table 5, T weights update means the running time of using our scheme to process the gradients and update the weights.It consists of five parts in our scheme: encryption (1112.4ms) and decryption (584.1 ms) by learning participants, FPRE (630.2 ms) by KTS, SPRE (615.1 ms) and homomorphic addition (2250.6 ms) by DSP.On the other hand, the LWE-MLP scheme in [15], consists of there parts: encryption (899.2 ms) and decryption (785.4 ms) by learning participants, homomorphic addition (278.9 ms) by DSP.The running time of our system was about 2.64 times of that in LWE-MLP [15].Although the total running time of our system was longer, the computational cost of learning participants was similar to that of [15], which is shown in Table 6.The accuracy is 97.1%, which was a little higher than that in [15].

Conclusions
Considering that previous distributed deep learning scheme sharing single key pair for encryption suffers from collusion between cloud server and any learning participant [15], we propose a novel system using homomorphic re-encryption which makes privacy-preserving distributed deep learning come true.We give the detailed realization steps of our system and implement them for validation.Furthermore, security analysis and performance evaluation are provided.Specifically, the communication cost of our scheme is tolerable.Experimental results show that although the running time of our system is larger than that of LWE-MLP in [15], but the computational overhead of
is decrypted and can be used at next gradients generation iteration or deep learning model configuration when the training is finished.9. Iteration:When the training is not finished, each learning participant repeats the above step 4to step 8 to iterate the weights update process.In other words, learning participant i computes the gradient vector G (j+1) i with another mini-batch of data from its local dataset and W (j)i .In the end of the training process, all the learning participants can get ultimate weight vector W = First, C LP multiplies it by −α and encodes the result.Then C LP encrypts the encoded result with PK as [m] PK .Then C LP sends [m] PK to A LP , and outputs the entire view of A LP : [m] PK .A LP 's views in ideal and real executions are indistinguishable because of the security of HRES.C KTS challenges A KTS as follows.C KTS runs encryption on two randomly chosen integers with PK as [G] PK and [w] PK .Then [G] PK is multiplied by [w] PK .Next C KTS performs the FPRE on the result to get [m] + , which is sent to A KTS .If A KTS responds with ⊥, C KTS returns ⊥.A KTS 's views are made up of the encrypted data.A KTS gets the same outputs in both ideal and real executions because the LPs are honest and the HRES is proved to be secure.Therefore A KTS 's views are indistinguishable.C DSP challenges A DSP as follows.C DSP first chooses [m] + randomly and performs the SPRE on it with C DSP 's private key to get [m] pk j .Then [m] pk j is sent to A DSP .If A DSP responds with ⊥, C DSP returns ⊥.A DSP 's views are made up of the encrypted data.A DSP gets the same output [m] pk j in both ideal and real executions because of the security of HRES.Therefore A DSP 's views are indistinguishable.
, communication between learning participants and KTS C p−k , communication between KTS and DSP C k−d , and communication between DSP and learning participants C d−p .

Figure 4 .
Figure 4. (a) Average running time for every step.(b) Percentage of running time for every step.

Table 3 .
Experimental settings in model training.

Table 4 .
Average computational cost of parties in our scheme.