Next Article in Journal
Is Pain Stronger in Adults or in Adolescents During Semi-Rapid Maxillary Expansion (SRME) and Fixed Appliance (FA) Therapies?
Previous Article in Journal
Comprehensive Evaluation of Remote Tower Controllers’ Situation Awareness Level Based on the Entropy Weight Method (EWM)–TOPSIS–Gray Relational Analysis Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Privacy-Preserving Data Sharing and Computing for Outsourced Policy Iteration with Attempt Records from Multiple Users

Key Laboratory of Internet Information Retrieval of Hainan Province, School of Cyberspace Security, Hainan University, Haikou 570228, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(5), 2624; https://doi.org/10.3390/app15052624
Submission received: 30 January 2025 / Revised: 26 February 2025 / Accepted: 27 February 2025 / Published: 28 February 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Reinforcement learning is a machine learning framework that relies on a lot of trial-and-error processes to learn the best policy to maximize the cumulative reward through the interaction between the agent and the environment. In the actual use of this process, the computing resources possessed by a single user are limited so that the cooperation of multiple users are needed, but the joint learning of multiple users introduces the problem of privacy leakage. This research proposes a method to safely share the effort of multiple users in an encrypted state and perform the reinforcement learning with outsourcing service to reduce users calculations combined with the homomorphic properties of cryptographic algorithms and multi-key ciphertext fusion mechanism. The proposed scheme has provably security, and the experimental results show that it has an acceptable impact on performance while ensuring privacy protection.

1. Introduction

Artificial intelligence has become the frontier of human exploration [1] and it has been widely applied into different scenarios. With more complicated problems wish to be solved by artificial intelligence, reinforcement learning is noticed. In the process of reinforcement learning, the agent optimize its policy through the award or punish obtained form different interactions with environment under different states. Due to this characteristic, it is considered to be an effective method for dealing with continuous decision-making problems. But the performance of reinforcement learning is limited by computing power in early stage.
With the rapid development of computing and information technology, deep learning helps the reinforced learning to perform better so that it is used in more and more aspects [2] such as Industrial manufacturing, robot control, logistics management, national defense and military, intelligent transportation, intelligent healthcare, etc. The training of famous large language model like ChatGPT [3] and Deepseek-R1 [4] also uses reinforcement learning. With artificial intelligence used more in science research [5], reinforcement learning is used in many fields of basic scientific research such as the control of tokamak plasmas [6] and to discover algorithms which can get matrix multiplication faster [7].
Due to the complexity of problems in reality, researchers have enhance the reinforcement learning with only one agent to a system with multiple agent. Multi-agent reinforcement learning have provides strong supports for modeling and solving the problems in autonomous driving [8], intelligent inventory management [9], wireless networks [10] etc.
All the developments above make the reinforcement learning model more complex, which need more data to train the model. For the data of a single user is often insufficient to train a complex model, the cooperation of multiple users using their private data is needed in training of complex model, which makes privacy leakage a major risk factor in the Artificial Intelligence model training process as is mentioned in [11].
This dilemma is especially evident in reinforcement learning, because the trial-and-error cost that a single user can afford is often not enough to train a model which can perform well. Therefore, a method of fusing the attempt records of multiple users for reinforcement learning training is necessary. At the same time, this approach needs to be able to protect the privacy of users and support outsourced computing to make reasonable use of computing resources.
In order to protect the privacy of different users and perform secret computations while protecting privacy, we have successfully built a data sharing system in which users can encrypt their data using their unique keys and can later provide conversion keys to confirm the server’s use of the data. We have designed a secret computations process which can be done by the server without user interaction.
There are two main challenges to keep privacy protected while improve speed of computition in encrypted state. Firstly, since it is necessary to count the numbers of times of taking different actions under different states, we need to determine whether a ciphertext is equal to another one while ensuring security, which poses a challenge to data security. Second, we cannot allow the server to establish a connection between the ciphertext and the plaintext, which results in the policy maintained by the server being out of order. This affects the performance of the model.
Corresponding to the first challenge, we redesigned the operational details of proxy re-encryption so that it is impossible to determine whether the ciphertext is equal before conversion, and the converted ciphertext shows hash-like consistency. Similarly, ciphertext conversion can only be performed with the permission of the owner. To address the second challenge, we prioritize the state according to the action with the highest probability in the existing strategy, which increases the possibility of retrieving the corresponding state first, thereby determining the corresponding action more quickly. The main contributions of this paper are as follows:
  • Our solution can firmly maintain user rights, and different users can encrypt their data with unique keys to and control their use. We use partially homomorphic encryption (PHE) to improve the versatility of the system. At the same time, the ciphertext under the same mechanism can meet the special needs of reinforcement learning tasks.
  • Our solution takes into account both security and computational efficiency. It makes the efficiency of reinforcement learning in an encrypted state similar to that in plain text and user privacy are protected, which greatly improves the usability of the solution.
  • Our design does not affect the dynamic learning characteristics of the reinforcement learning model. The continuously iteration process does not require additional communication, and the policy can adapt to new changes during use.

2. Related Works

To protect user privacy, there are a variety of tools available, such as homomorphic encryption, secret sharing, oblivious transfer, and garbled circuits. In general, homomorphic encryption requires more computing power, while secret sharing incurs a lot of communication overhead. Oblivious transfer and garbled circuit rely on specific hardware implementations, so they are not universal. In this privacy-preserving outsourced policy iteration method, both statistics and policy updates should be run on the server side. The specific values of states and behaviors should be encrypted by the user, which raises the need to perform calculations in a secret state. In order to minimize unnecessary communication and user-side computing overhead, we consider using homomorphic encryption to protect user privacy.
Homomorphic encryption is the algorithms supporting calculations on encrypted data. While the operations supported by homomorphic encryption can usually not include all operations available on plaintext, homomorphic encryption is divided into three categories: Fully Homomorphic Encryption (FHE), Somewhat Homomorphic Encryption (SWHE) and Partially Homomorphic Encryption (PHE).
The first proposed FHE scheme introduced by Gentry et al. in [12], with which any polynomial functions could be calculated under encrypted state but the calculation cost it needed is too much to become usable in actual scene. Therefore, BGN algorithm [13] is more used to keep its efficiency with the supported operation is reduced into a limited times of addition and multiplication and become SWHE.
Moreover, PHE is another way to keep efficiency which only supports a single type of operations. This supported operation is usually addition or multiplication. If a PHE support addition, we have homomorphic add operation + ^ on ciphertext E ( m 1 ) and ciphertext E ( m 2 ) so that E ( m 1 + m 2 ) = E ( m 1 ) + ^ E ( m 2 ) . Similarly, we have homomorphic multiple operation × ^ on ciphertext E ( m 1 ) and ciphertext E ( m 2 ) so that E ( m 1 × m 2 ) = E ( m 1 ) × ^ E ( m 2 ) , if multiplication is supported. The detail operation in + ^ or × ^ is defined individually in each PHE algorithms. The most used is Paillier brought out in [14].
The basic way to calculate encrypted data with different key is to make it able to be compute by keep their homomorphism under different keys like what the [15,16] have done. This way is called multi-key homomorphic encryption (MKHE), but will not be adopted by us for their heavy computing cost.
Therefore a way to switch the key of a ciphertext is needed, as is introduced by Zhang et al. in [17]. They uses a key-switching matrices to change keys into a unified one, which is efficient but, in essence, this method reduces the security of multiply keys to a single key. That is to say the matrix weakened the security of the original scheme, which is not good.
To keep the security the same while considering the efficiency, many methods like [18,19] introduce two servers. In [18], the ciphertexts are decrypted and than encrypted again by the Second server, which is a brief way but the Second server keeps all keys and its royalty and invincibility must be ensured. In [19,20], they divide the proxy re-encrypt and calculate into different server, but they still need interactions between them to accomplish the conversion and calculation process, which is complicated and significantly increased the consumption on communication.
Since the communication cost is huge, refs. [21,22] tried remove the process of ciphertext conversion, but they can not reduce that in process of calculation. To canceling the Second server, the scheme proposed by Ma et al. in [23] perform the re-encryption with server and users, but the communicate cost is still high. In summary, they can not solve the problems they aim to by avoid the problem itself.
Concentrate to the accrual scene similar to us, ref. [24] shows a solution to aggregate a group of numbers from many participants, but the privacy can be hurt for the users may keep some information and the retreated users could violate the privacy of those users left. To protect this privacy, DeePAR in [25] proposed by Zhang et al. realize a complete way under different keys, but the secret key of users is needed to be used during the re-encryption process. To avoid this use, Xiaoyin Shen et al. proposed a scheme in [26]. This method introduces fog nodes so that calculate can be done with only one server and it keeps the communication cost relatively low. but it introduces a new step of the Second time gathering, which brings new cost.
In general, most of related works aiming to perform some calculation on separated dataset only take arithmetic operations like add, subtract, multiply and divide into consideration or even a part of them. While our scheme supports the judgment of equality additionally. The detailed difference of our scheme with other methods which can perform the similar fusion and calculation on different encrypted data is shown in Table 1. Obviously, our scheme can act individual function with the least number of server.
In [15,16], they directly realize a homomorphic encryption under multiply key. They do not need to convert a ciphertext, but A high calculation cost is needed. Ref. [17] introduces a matrix to change a key into another for higher efficiency but the safety is degrade. Refs. [18,19] reduces the calculation cost. But instead of that, they introduced too mach communication cost. Refs. [21,22] delete the step of ciphertext transformation to reduce communication between two servers, but the communication is still unavoidable during calculation. Ref. [26] finally get rid of the second server the same to our scheme. But our scheme additionally supports safety judgment on equality.

3. Preliminaries

The main techniques involved in proposed scheme can be summarized into two aspects as proxy re-encryption and homomorphic encryption. There are introductions of these two techniques and that of policy iteration.

3.1. Proxy Re-Encryption

Proxy re-encryption offers a method that can convert a ciphertext c 1 into a different one c 2 but they could be decryped and get the same plaintext m. This enhances the security of cryptographic algorithm by avoiding the repeated use of one ciphertext. Since it was first proposed in [27]. With this, people could avoid the decryption and re-encryption process but transform a ciphertext into another, so that the possibility of expose to plaintext m could be minimized.
The typical scenario of proxy re-encryption can be described as follows: Alice wants to send its data m to several users safely. To ensure the security, it is needed to ecrypt the data m with the public keys of each user. With a large amount of them, it need too much computation for Alice so that it could encrypt data with a intermediate key and offers transform keys to receivers instead of encrypt same data by many times. After that, receivers can convert the text to which it can decrypt with their secret keys. This proxy of convert process is completed by receivers, which transfers the computing cost from single sender to multiple receivers without the expose of plaintext m so that the privacy is protected.
In our scheme, a variant of proxy re-encryption is used in Section 5.4, it converts a ciphertext from secret key s t s t i to the secret key s t with a re-encryption key correspond to s t i . The correctness is verified in Section 6.

3.2. Homomorphic Encryption

Homomorphic encryption stands for those cryptographic algorithms which owns homomorphism. As a concept from abstract algebra, homomorphism means a mapping from an algebraic structure to an algebraic structure of the same kind, which keeps all relevant structures unchanged. That is, all properties such as identity, inverse, and binary operations remain the same. A homomorphism Φ : X Y suit (1).
Φ ( u · v ) = Φ ( u ) Φ ( v )
In which “·” presents an operation on X and “∘” presents an operation on Y. Similarly, a homomorphic encryption means a encryption method E n c : M C that suit (2).
E n c ( u · v ) = E n c ( u ) E n c ( v )
In which “·” presents an operation on plaintext group M and “∘” presents an operation on ciphertext group C, which allows people to act operations on ciphertexts and directly get the ciphertext corresponding to the plaintext of results to preset operation corresponding plaintexts to those ciphertext.

3.3. BCP and BCP Enhance Method

Accodring to the actual demand of target scenario, the basic PHE algorithm chose in our scheme is BCP proposed by [28]. The supported homomorphic operation of this scheme contains only addition, which is also called to own the attribute of linear homomorphism. The supported operations and specific processes defined by [28] is showed as below:
  • Algorithm Initialization: choose two large primes p and q, compute N = p q and randomly select g = a 2 N ( a Z N 2 * ) ;
  • Key Generation: Randomly choose secret key s [ 1 , N 2 2 ] , the public key p = g s m o d N 2 ;
  • Encryption: The plaintext m should be m Z N . To encrypt m, firstly select a random r [ 1 , N 4 ] . And the ciphertext E ( m ) can be calculate as (3):
    E ( m ) = ( C m 1 , C m 2 ) = ( g r m o d N 2 , g r · s ( 1 + m N ) m o d N 2 )
  • Decryption: m can be calculated from E ( m ) and s like (4):
    m = ( C m 2 ( C m 1 ) s 1 ) m o d N 2 N
  • Linear Homomorphism: If we have two ciphertext E ( m 1 ) = ( C m 1 1 , C m 1 2 ) , E ( m 2 ) = ( C m 2 1 , C m 2 2 ) under the same secret key s, we can calculate E ( m 1 + m 2 ) as (5):
    E ( m 1 + m 2 ) = E ( m 1 ) + ^ E ( m 2 ) = ( C m 1 1 · C m 2 1 , C m 1 2 · C m 2 2 )
To be noticed, the multiple operations on each part of ciphertexts are all modular operations based on N 2 .
Except from the operations it already defined, we can also use the Proxy Re-encryption technique on this cryptographic algorithm as follows: If there are three secret key s , s 1 , s 2 and they conform to the relationship shown as (6), the Proxy Re-encryption process between and by them can be acted.
s = ( s 1 + s 2 ) m o d N 2
To keep the efficiency of BCP while supporting more necessary operations, a framework proposed by Catalano and Fiore in [29] is used in our scheme. By using this method, we could support homomorphic multiplication and keep all system fast. This scheme build a construction E ( m ) with basic encryption E ( ) so that it can calculate the E ( m 1 m 2 ) . Although, the multiplication operations is not used in this passage, we keep it for better security. The necessity of using this method will be explained in Section 6.3. In contrast, a realign operation supported by [29] mechanism is needed. We can realign its plaintext part as (7).
E ( m 1 ) = ( a 1 , β 1 ) = [ m 1 b 1 , E ( b 1 ) ] E ( m 2 ) = ( a 2 , β 2 ) = [ m 2 b 2 , E ( b 2 ) ] E ( m 2 ) = ( a 1 , β 2 + ^ E ( a 2 a 1 ) ) = [ m 1 b 1 , E ( b 2 + a 2 a 1 ) ]
To be noticed, the encryption process E ( a 2 a 1 ) must be done with the same key of original ciphertext β 1 and β 2 . Otherwise, the Equation (7) does not hold. Of course we can also align the plaintext part of E ( m 1 ) and E ( m 1 ) to a new value a 3 by the same key of them like (8).
E ( m 1 ) = ( a 3 , β 2 + ^ E ( a 1 a 3 ) ) = [ a 3 , E ( b 1 + a 1 a 3 ) ] E ( m 2 ) = ( a 3 , β 2 + ^ E ( a 2 a 3 ) ) = [ a 3 , E ( b 2 + a 2 a 3 ) ]
If m 1 = m 2 , b 1 + a 1 a 3 should be equal to b 2 + a 2 a 3 . Of cause, this is not obvious in our scenario, a special method will be used in Section 5.5 to show whether they are equal.

3.4. Policy Iteration

Policy Iteration is a basic method to solve Markov Decision Problem, which is suitable for our research. It uses an iterative way to avoid solving the inverse matrix, so it has high computational efficiency. If we could build an efficient outsourcing policy iteration, it will become a solid foundation for follow-up research.
In policy iteration we usually initialize the policy to choose each actions fairly. After an attempt, we can evaluate the policy with its performance and update our policy. This iteration process is often expressed as (9):
v π k ( j + 1 ) = r π k = r P π k v π k ( j ) , j = 0 , 1 , 2 ,
In which v π k ( j ) presents the value last round and v π k ( j + 1 ) can be the new one. In fact, (9) actually shows an attempt process. The policy could be iterated like (10):
π k + 1 = r π k + r P π k v π k
The best policy for fixed environment must be greed determined policy and our policy could approach it gradually.

4. Model Description

The system model introduces the participates and their need with ability, and the threat model implicate the potential threats in this system, which are also what we aim to deal with.

4.1. System Model

There are three types of characters as shown in Figure 1 in our system. Users offer their attempt histories to join in the training of policy for same problem or only ask for the policy on server for operation suggestions. If it is needed, users could also ask for trained policy. These different actions are shown in Figure 1 as UserA, UserB and UserC.
Server collects encrypted data from users offering it, and fuse those data from different users and protected by different secret key to a unified encrypted state database. After that, the Server could act further calculation to update the policy matrix. This policy matrix is usable but only generate encrypted results so that it can not be used by the person who do not hold the correct secret key.
Key Generation Center (KGC) is also the manager for this system and act as a trusted third party but it did not join in the operation too much. After the collecting of encrypted data from users by the Server, the KGC should help the Server to transform ciphertext that can support the needed calculation. It also helps to convert results set for those users who legally ask for calculation result. Except what mentioned above, KGC do not undertake the calculation cost. It would distribute keys for all participators including itself and tasks. In fact, the data from users are converted to a domain protected by the key of KGC by default by the Server, which offers the right of decrypt them to KGC.

4.2. Threat Model

KGC is considered to be reliable. In contrast, Server and Users may launch attacks to the system.
Server is considered to be honest-but-curious. It will follow the preset behavior honestly, but they are curious about the contents of encrypted data.
Users are likely to collude with other Users so that there they can know the content of ciphertext from other users including data and request. For the Server would frequently return results for users, the method to deal with the collusion between Users and Servers can obviously need the participation of KGC or another non collusive server. So the collusion between Users and Server is now not considered. Two type of possible attacks in our system.
  • Some Users try to know the content of ciphertext of other users;
  • Some Users try to know the content of ciphertext returned from the Server to specific User;
  • The Server tries to know the content of ciphertest it received and calculated on;
  • An attacker from the cloud (maybe server or other users) tries to reach data from users by observing the input, intermediate results or final output.

5. Proposed Scheme

The proposed scheme is constructed by seven steps as follows: system initialization, parameter negotiation, user data encryption, ciphertext conversion, calculation and model iteration, using of encrypted policy and policy providing to specific users.

5.1. System Initialization and Participator Registration

In initialization step of this system, the encryption system for all this system would be established by the KGC, and it should generate and transfer pairs of public key and secret key for each User as well as Server.
The secret key for User i s i , Server s s and Key Generation Center s K G C are all kept by Key Generation Center.

5.2. Parameter Negotiation for Data Sharing Task

To share attempt history for policy iteration, Users U i individually choose its r i [ 1 , N 4 ] , calculate out the corresponding negotiate parameter n p i as (11). The Server and Key Generation Center would both receive it and the public parameter of this task g r can be calculate as (12):
n p i = g r i m o d N 2
g r = i = 1 n n p i m o d N 2 = i = 1 n g r i m o d N 2
The Server will release g r to the Key Generation Center and all Users joined in this task. Once the KGC confirm the correctness of g r , it generates a secret key for this task. Everyone who join in this data fusion including Users and Server can get this secret key s t , which would be used to realize the re-encrypted process in following step.

5.3. User Encryption

Before the User encryption, User U i should randomly choose s t i [ 1 , N 2 2 ] to build the authorization A i for conversion like (13).
A i = g s t i · r m o d N 2
As for each plaintext m. User would built the ciphertext stricture introduced in Section 3.3. The First part of it needs a randomly choosed bias b Z N and the Second part needs a random number r i [ 1 , N 4 ] . The ciphertext c for plaintext m is built as (14) to (16).
c = ( m b , β ) = ( m b , ( β 1 , β 2 ) )
β 1 = g r i m o d N 2
β 2 = [ g ( s t s t i ) · r · g s K G C · r i · ( 1 + b N ) ] m o d N 2
After encryption of all attempt histories, they will be send to Server. The authorization A i is needed for the Servers to fuse all encryptions to a united encrypted data set. Once the User decide to start the training, its A i should be sent to the Server through secret channel. This mechanism ensures power of User permission.

5.4. Ciphertext Conversion

With ciphertext c, authorization A i from the corresponding user U i , the Server can convert the ciphertext c to a united one c . Ciphertext conversion process can be calculated like (17) and (18):
c = ( m b , β )
β = ( g r i m o d N 2 , β 2 · A i g s t r m o d N 2 )
The Second part of ciphertext c and united ciphertext c are actually ciphertexts under BCP algorithm for the random chosen bias value b. By this proxy re-encryption process the secret key on them are converted to the secret key of KGC s K G C , which complete the fusion of User data. If is needed, this target secret key could be replaced to any other one and the owner of this secret key can manage all training process. In our scheme, this target key for data fusion is temporary set to be the key of Key Generation Center.

5.5. Ciphertext Calculations and Policy Update

To initialize and update the policy matrix, the alignment of bias b and random number r is needed. For a pair of converted ciphertexts for m 1 and m 2 like (19):
c 1 = ( m 1 b 1 , β 1 ) = ( m 1 b 1 , ( g r a m o d N 2 , g r a s K G C ( 1 + b 1 N ) m o d N 2 ) ) c 2 = ( m 2 b 2 , β 2 ) = ( m 2 b 2 , ( g r b m o d N 2 , g r b s K G C ( 1 + b 2 N ) m o d N 2 ) )
In order to verify whether the corresponding plaintext m 1 and m 2 are equal to each other only by action on their ciphertexts c 1 and c 2 safely, the server firstly aligns the first part of c 1 and c 2 to a unified value a 0 as is introduced in Section 3.3. The server realigns the first part of c 2 and c 1 to a 0 as follows:
δ 1 = b 1 + ( m 1 b 1 ) a 0 c 1 = ( a 0 , β 1 + ^ E n c ( δ 2 ) ) δ 2 = b 2 + ( m 2 b 2 ) a 0 c 2 = ( a 0 , β 2 + ^ E n c ( δ 2 ) )
If the plaintext values corresponding to c 1 and c 2 are equal to each other, the plaintext corresponding to β 1 + ^ E n c ( δ 1 ) should be equal to that corresponding to β 2 + ^ E n c ( δ 2 ) . But they are encrypted with different r, so whether they are equal to each other is still invisible. Assume a pair of β 1 + ^ E n c ( δ 1 ) and β 2 + ^ E n c ( δ 2 ) as (21):
β 1 + ^ E n c ( δ 1 ) = ( g r a m o d N 2 , g r a s K G C ( 1 + m 1 N ) m o d N 2 ) β 1 + ^ E n c ( δ 1 ) = ( g r b m o d N 2 , g r b s K G C ( 1 + m 2 N ) m o d N 2 )
To unify the random number r a and r b used in these two ciphertext, the server should send all the first part of ciphertext β i + ^ E n c ( δ i ) to Key Generation Center for help. The Key Generation Center can use its secret key s K G C to get g r a s K G C and g r a s K G C . To make the equation of m 1 and m 2 visible, Key Generation Center can calculate Δ r 1 and Δ r 1 so that (22) is meeted, the unified result of U K G C is kept by the Key Generation Center.
( g r a s K G C · Δ r a ) m o d N 2 = ( g r b s K G C · Δ r b ) m o d N 2 = U K G C
Receiving Δ r b and Δ r b , the Server can infer those ciphertexts whose corresponding plaintext the same with the same Second part of ciphertext for the aligned bias. For more ciphertexts the needed Δ r i can be as many as the amount of ciphertexts.
Finally Server can count the number of every encrypted ( s , a ) pairs and update the policy matrix. After traversing n attempt histories, the policy matrix would updates n times. More updates finished, the policy can become usable.

5.6. Policy Using

If some users need to use the policy on Server, it should ask the Key Generation Center for task secret key s t and the unified value U K G C . The task secret key s t and the unified value U K G C should be distribute through secret channel.
Received unified value U K G C the unified ciphertext for each state s i and action a i can be calculated as (23).
U s i = U K G C ( 1 + s i N ) m o d N 2 U a i = U K G C ( 1 + a i N ) m o d N 2
Actually all this value act as hash value. To make requests, User encrypt it by the secret key it chosen by itself, and send the authorization A i secretly to server like Section 5.3. The returned action from Server is corresponding to the one chosen by policy. Server will record the ( s , a ) pair and encrypt it by the authorization A i for this user. Once the user decrepted the returned ciphertext, it will get one of the Unified U a i . After it tried the action, the user can get a new state and request for a new action. After continuously repeating the above process, the user can reach the target state and Server can get a new attempt history for updating the policy.

5.7. Results Conversion and Decryption

With mentioned steps multiply users can already train and use a policy under encrypted state on a Server. If it is needed, a user can ask for the policy matrix, Server S can encrypt the labels of policy matrix just the same way as is used during the use of policy and send its policy matrix to this user. The user can naturally map each label to the plaintext it present to.

6. Security Analysis

To ensure the security of this system, the correctness of introduced operations is proved. And it is necessary to ensure that one user or multiple colluding users are unable to decrypt other encrypted data according to the threat model. Besides, it is necessary to ensure that the Server or an adversary could not obtain the plaintext to all ciphertext in this system. And we would explain the usage of BCP enhance method to deal with probabilistic forgery attacks.

6.1. Correctness

The new introduced step on this encryption is only a variant of proxy re-encryption on ciphertext introduced in Section 5.4. In this step, the first part of ciphertext c is not changed and the second part β is assumed to become the one under the secret key of Key Generation Center s K G C . The prove that the secret key to ciphertext β is the secret key of Key Generation Center s K G C is given in (24):
β = ( g r i m o d N 2 , β 2 · A i g s t r m o d N 2 ) = ( g r i m o d N 2 , g s t r · g s K G C r i g s t r ( 1 + b N ) m o d N 2 ) = ( g r i m o d N 2 , g s K G C r i ( 1 + b N ) m o d N 2 ) E ( b ) u n d e r s K G C
For the already existing calculation process under ciphertext is separately proved to be correct in the source paper [28,29]. Therefore, all steps in proposed scheme are proved correct and they could output the exactly right result.

6.2. Security

The security of basic BCP algorithm and the building method from a linear homomorphic encryption to a homomorphic encryption in degree-2 functions are individually proved in their resources paper [28,29], so this passage mainly analyze the security of parameter negotiation for data sharing task, ciphertext conversion and policy using steps in proposed scheme. To prove the security of our scheme, we prove it with the method of real and ideal paradigm proposed in [30]. And the following theorems are needed.
Theorem 1. 
In process Section 5.2, it is computationally infeasible for every participators including users, server and Key Generation Center to get r i from others and r.
Proof. 
Even the Key Generation Center is acting as a trusted third party. We could set up a Semi-Honest adversary who could get the view of server S e r v e r , Key Generation Center K G C and multiply users U s e r i and U s e r j as A ( S e r v e r , K G C , U s e r i j ) SH . the view of A ( S e r v e r , K G C , U s e r i j ) SH includes all n p i = g r i m o d N 2 and the r i and r j of U s e r i and U s e r j . Based on Discrete logarithm Problem (DLP), all r k chosen by U s e r k except the one from U s e r i and U s e r j are computationally infeasible except those r k completely identical to r i or r j , which can obviously be probability neglected. Therefore, the probability to gather all part of r is negligible. It indicates that r is computationally infeasible too. □
Theorem 2. 
In processes of Section 5.4, Section 5.5, Section 5.6 and Section 5.7, it is computationally infeasible for server S e r v e r or several Users colluded with each other U s e r i and U s e r j to obtain the plaintext from another User U s e r k based on the semantically security of BCP algorithm witch is proved in [28].
Proof. 
To discuss this, we should assume two Semi-Honest adversary for server S e r v e r and users U s e r i and U s e r j as A S e r v e r SH and A U s e r i , U s e r j SH . Firstly, the view of A S e r v e r SH could contains A i and c from all users. To get the plaintext of m, the value of b is needed while it is encrypted into β , as is shown in (15) and (16), which constitute a variant of BCP ciphertext for the β 2 can be seen as g ( s t s t i ) · r · E 2 ( b ) in which E 2 ( b ) presents the second part of a BCP ciphertext corresponding to plaintext of b under the secret key of s K G C . For A S e r v e r SH , s K G C is obviously unfeasible so that β is computationally undecryptble based on the semantically secure of BCP. Even g ( s t s t i ) is removed in following steps, the structure of E 2 ( b ) is still kept. □
Secondly, the map from E 2 ( b ) to corresponding action is given by Key Generation Center K G C to all user such as U s e r a so that it can infer the actual means of the returned result from server S e r v e r after decryption by the secret key s t a chosen by itself. But the g ( s t s t i ) removed version of encrypted data from other user such as U s e r k is not included in the view of A U s e r i , U s e r j SH . That is to say, g ( s t s t k ) is needed to investigate the plaintext of uploaded by U s e r k , which needed s t k or A k . However, as is designed by proposed scheme, s t k is not sent out and A k is sent in secret channel as authorization credentials, which is all unfeasible for A U s e r i , U s e r j SH .

6.3. Necessity of BCP Enhance Method

Taken the actual scenario into account, the options of actions is often relatively less. For example, the actions can be chosen in a two-dimension maze are usually only four as ( U p , D o w n , L e f t , R i g h t ) . To encrypt them by BCP algorithm, the ciphertext looks like (15) and (16). By randomly choose two ciphertext of actions from one user such as U s e r a and divide the corresponding part of them to each other, there is a non negligible probability up to 25% to get a pair of ( g r a m o d N 2 , g ( s t s t a ) · r · g s K G C · r a m o d N 2 ) so that adversary can successfully forge a fake attempt history under this probability to server.
By changing the encryption of plaintext m to encryption of randomly chosen bisa value b, the possibility of this collision become negligible. At the same time, there are align method available for server to recover the equality. Also as is said in Section 6.2, the complex key used by U s e r a is composed by public key of Key Generation Center g s K G C and secret key s t s t a . Both part of them is designed unfeasible for others and they are even encrypted with different random number, which further makes the aforementioned attack methods impossible.

7. Experimental Results

7.1. Introduction of Experiment Environment

The experiment is run under a classical task of Reinforcement learning called Grid World. In this game, an environment, an agent and the relationship of state, actions with reward are built up, which constitute a Markov Decision Process that the policy iteration process could run to solve. There are two typical solutions of a Grid World problem as shown in Figure 2, which shows different solutions by different policy.
In a game of Grid World, the different states are expressed in different grids like Figure 2. We fill the start state into red color and the target state, also is known as finish state into green color. The wall grid is filled by black color, which is unable to reach or cross. The blue grids show the states the agent have reached by following different policy. Obviously the policy of ( a ) in Figure 2 performs better than that of ( b ) in Figure 2.
In each state the agent can try an action belonging to set ( U p , D o w n , L e f t , R i g h t ) , which mean to go upward, downward, leftward or rightward. But if the next state in chosen direction is out of range or is the wall, the agent would be kept the same state as that the action did not act. Once the agent reach the goal state, it finish an attempt the number of actions it take is seen as a punishment, and the goal for policy iteration is to minimize this punishment. All these mechanisms above compose a Markov Decision Problem.

7.2. Performances Comparison with Original Policy Iteration

This experiment is done in a virtual machine with Ubuntu 22.04.1 LTS deployed on computer with 16GB DDR5 memory with frequency of 4000 MHz and Intel Core I7-14700K CPU and encrypted-state computing is run in PyCharm 2024.3.1.1 with Charm-Crypto 0.50.
To set up maze environment, we set the space of maze as a 20 × 10 grid matrix. The environment build system would randomly choose a grid at the leftmost column to be the beginning and one of the rightmost column as goal. Then, some grid would be set as the wall. In order to ensure a high probability of the existence of a path from the beginning to the goal, we set the probability of a grid to be the wall as 20%.
In theory, the number of iterations needed for the policy to generate the best action chain in our scheme is the same that without encryption. We still run this algorithm under encrypted state of our scheme and original way for 100 times, and the result of both way is shown in Table 2.
In Table 2, the maze size and wall probability are fixed so that the experiments could be successfully conducted without be stopped by an unsolvable maze frequently. Because the maze is randomly generated, the average steps and number of iterations are not completely equal in our scheme with the original policy iteration without encryption. But they are quite close to each other.
For the time cost for each round is not evenly distributed as is shown in Figure 3. The difference in time costs is relatively apparent at the beginning but reduced during the policy iteration. At the beginning, the policy is not optimized so that the attempt history is composed by a large number of steps, which leads to a longer time of updating for both method. With the continuous updating of policy, the steps of attempt histories become stable and better, which result in a similar time of proposed scheme and original policy iteration algorithm.
As is shown in Figure 3, our scheme shows sensitivity to the actual number of stage-action pairs in attempt history, but the time cost of our scheme is not too much comparing to original policy iteration process. Considering that the computation and conversion on ciphertext in our scheme are almost done by server, we can announce that our scheme is usable for actual scenario.

8. Conclusions

In conclusion, we focus on the security of policy iteration with multiple users. And a scheme of fusion encrypted data into a united database is designed. The proposed solution can complete ciphertext conversion and subsequent reinforcement learning using only one server.After the outsourced training, the encrypted policy in server can be used by making requests without downloading the complete policy and a complete decision chain generated during the use of encrypted policy can also be used in further training of policy on the server without additional communication. Using of this scheme can get users rid of heavy computational cost and servers can concentrate on computation instead of waiting for communication and wastes its computing resources.
For further research, we noticed that there are more complex computations in practical application, while the calculating are all done in plaintext. Researchers can continue to expand relevant cryptographic mechanisms to support computations on ciphertexts with more complex artificial intelligence algorithms, which will make the training and using process of artificial intelligence safer.

Author Contributions

Conceptualization, B.C.; methodology, B.C.; software, B.C.; validation, B.C. and J.Y.; formal analysis, B.C. and J.Y.; investigation, B.C. and J.Y.; writing—original draft preparation, B.C.; writing—review and editing, B.C.; visualization, B.C.; supervision, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number No. 62162020 and the Haikou Science and Technology Special Fund, grant number 2420016000142.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhu, S.; Yu, T.; Xu, T.; Chen, H.; Dustdar, S.; Gigan, S.; Gunduz, D.; Hossain, E.; Jin, Y.; Lin, F.; et al. Intelligent Computing: The Latest Advances, Challenges, and Future. Intell. Comput. 2023, 2, 0006. [Google Scholar] [CrossRef]
  2. Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
  3. Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models. Meta-Radiology 2023, 1, 100017. [Google Scholar] [CrossRef]
  4. DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
  5. Wang, H.; Fu, T.; Du, Y.; Gao, W.; Huang, K.; Liu, Z.; Chandak, P.; Liu, S.; Van Katwyk, P.; Deac, A.; et al. Scientific discovery in the age of artificial intelligence. Nature 2023, 620, 47–60. [Google Scholar] [CrossRef] [PubMed]
  6. Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; De Las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
  7. Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R. Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef] [PubMed]
  8. Zhang, R.; Hou, J.; Walter, F.; Gu, S.; Guan, J.; Röhrbein, F.; Du, Y.; Cai, P.; Chen, G.; Knoll, A. Multi-Agent Reinforcement Learning for Autonomous Driving: A Survey. arXiv 2024, arXiv:2408.09675. [Google Scholar]
  9. Zhang, C.; Wang, X.; Jiang, W.; Yang, X.; Wang, S.; Song, L.; Bian, J. Whittle Index with Multiple Actions and State Constraint for Inventory Management. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  10. Bellouch, M.; Zitoune, L.; Lahsen-Cherif, I.; Vèque, V. DQL-MultiMDP: A Deep Q-Learning-Based Algorithm for Load Balancing in Dynamic and Dense WiFi Networks. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar]
  11. Ren, J.; Xu, H.; He, P.; Cui, Y.; Zeng, S.; Zhang, J.; Wen, H.; Ding, J.; Huang, P.; Lyu, L.; et al. Copyright Protection in Generative AI: A Technical Perspective. arXiv 2024, arXiv:2402.02333. [Google Scholar]
  12. Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009; pp. 169–178. [Google Scholar]
  13. Boneh, D.; Goh, E.J.; Nissim, K. Evaluating 2-DNF Formulas on Ciphertexts. In Theory of Cryptography; Kilian, J., Ed.; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3378, pp. 325–341. [Google Scholar]
  14. Paillier, P. Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. In Advances in Cryptology—EUROCRYPT ’99; Stern, J., Ed.; Springer: Berlin/Heidelberg, Germany, 1999; Volume 1592, pp. 223–238. [Google Scholar]
  15. López-Alt, A.; Tromer, E.; Vaikuntanathan, V. On-the-fly multiparty computation on the cloud via multikey fully homomorphic encryption. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, New York, NY, USA, 19–22 May 2012; pp. 1219–1234. [Google Scholar]
  16. Chen, H.; Dai, W.; Kim, M.; Song, Y. Efficient Multi-Key Homomorphic Encryption with Packed Ciphertexts with Application to Oblivious Neural Network Inference. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 395–412. [Google Scholar]
  17. Zhang, J.; Wang, X.; Yiu, S.M.; Jiang, Z.L.; Li, J. Secure Dot Product of Outsourced Encrypted Vectors and its Application to SVM. In Proceedings of the Fifth ACM International Workshop on Security in Cloud Computing, Abu Dhabi, United Arab Emirates, 2 April 2017; pp. 75–82. [Google Scholar]
  18. Peter, A.; Tews, E.; Katzenbeisser, S. Efficiently Outsourcing Multiparty Computation Under Multiple Keys. IEEE Trans. Inf. Forensics Secur. 2013, 8, 2046–2058. [Google Scholar] [CrossRef]
  19. Wang, B.; Li, M.; Chow, S.S.M.; Li, H. A tale of two clouds: Computing on data encrypted under multiple keys. In Proceedings of the 2014 IEEE Conference on Communications and Network Security, San Francisco, CA, USA, 29–31 October 2014; pp. 337–345. [Google Scholar]
  20. Wang, B.; Li, M.; Chow, S.S.M.; Li, H. Computing encrypted cloud data efficiently under multiple keys. In Proceedings of the 2013 IEEE Conference on Communications and Network Security (CNS), Washington, DC, USA, 14–16 October 2013; pp. 504–513. [Google Scholar]
  21. Zhang, J.; He, M.; Zeng, G.; Yiu, S.M. Privacy-preserving verifiable elastic net among multiple institutions in the cloud. J. Comput. Secur. 2018, 26, 791–815. [Google Scholar] [CrossRef]
  22. Zhang, J.; Hu, S.; Jiang, Z.L. Privacy-Preserving Similarity Computation in Cloud-Based Mobile Social Networks. IEEE Access 2020, 8, 111889–111898. [Google Scholar] [CrossRef]
  23. Ma, X.; Ji, C.; Zhang, X.; Wang, J.; Li, J.; Li, K.C.; Chen, X. Secure multiparty learning from the aggregation of locally trained models. J. Netw. Comput. Appl. 2020, 167, 102754. [Google Scholar] [CrossRef]
  24. Li, T.; Li, J.; Chen, X.; Liu, Z.; Lou, W.; Hou, T. NPMML: A Framework for Non-interactive Privacy-preserving Multi-party Machine Learning. IEEE Trans. Dependable Secure Comput. 2020, 18, 2969–2982. [Google Scholar] [CrossRef]
  25. Zhang, X.; Chen, X.; Liu, J.K.; Xiang, Y. DeepPAR and DeepDPA: Privacy Preserving and Asynchronous Deep Learning for Industrial IoT. IEEE Trans. Ind. Inform. 2020, 16, 2081–2090. [Google Scholar] [CrossRef]
  26. Shen, X.; Luo, X.; Yuan, F.; Wang, B.; Chen, Y.; Tang, D.; Gao, L. Privacy-preserving multi-party deep learning based on homomorphic proxy re-encryption. J. Syst. Archit. 2023, 144, 102983. [Google Scholar] [CrossRef]
  27. Blaze, M.; Bleumer, G.; Strauss, M. Divertible protocols and atomic proxy cryptography. In Advances in Cryptology—EUROCRYPT’98; Nyberg, K., Ed.; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1403, pp. 127–144. [Google Scholar]
  28. Bresson, E.; Catalano, D.; Pointcheval, D. A Simple Public-Key Cryptosystem with a Double Trapdoor Decryption Mechanism and Its Applications. In Advances in Cryptology—ASIACRYPT 2003; Laih, C.S., Ed.; Springer: Berlin/Heidelberg, Germany, 2003; Volume 2894, pp. 37–54. [Google Scholar]
  29. Catalano, D.; Fiore, D. Using Linearly-Homomorphic Encryption to Evaluate Degree-2 Functions on Encrypted Data. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, 12–16 October 2015; pp. 1518–1529. [Google Scholar]
  30. Goldreich, O. Foundations of Cryptography; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Figure 1. System model.
Figure 1. System model.
Applsci 15 02624 g001
Figure 2. This is a wide figure. Schemes follow the same formatting. If there are multiple panels, they should be listed as: (a) A solution by policy after 6000 iterations. (b) A solution by random policy.
Figure 2. This is a wide figure. Schemes follow the same formatting. If there are multiple panels, they should be listed as: (a) A solution by policy after 6000 iterations. (b) A solution by random policy.
Applsci 15 02624 g002
Figure 3. Time costs under different times of iterations.
Figure 3. Time costs under different times of iterations.
Applsci 15 02624 g003
Table 1. Comparison among related works and our scheme.
Table 1. Comparison among related works and our scheme.
Scheme1 Server2 ServersMulti-KeyCiphertext ConversionEquation Judgment
[15,16]×××
[17]××××
[18,19]×××
[21,22]×××
[26]×××
PPOPI××
Table 2. The performance of our scheme and original policy iteration method.
Table 2. The performance of our scheme and original policy iteration method.
PropertyPPOPIOriginal Policy Iteration Without Encryption
Maze size 20 × 10 20 × 10
Wall probability20%20%
Average steps of best path28.6327.82
Average number of iterations54385396
Average time cost per round42.281 ms12.509 ms
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, B.; Ye, J. Privacy-Preserving Data Sharing and Computing for Outsourced Policy Iteration with Attempt Records from Multiple Users. Appl. Sci. 2025, 15, 2624. https://doi.org/10.3390/app15052624

AMA Style

Chen B, Ye J. Privacy-Preserving Data Sharing and Computing for Outsourced Policy Iteration with Attempt Records from Multiple Users. Applied Sciences. 2025; 15(5):2624. https://doi.org/10.3390/app15052624

Chicago/Turabian Style

Chen, Bangyan, and Jun Ye. 2025. "Privacy-Preserving Data Sharing and Computing for Outsourced Policy Iteration with Attempt Records from Multiple Users" Applied Sciences 15, no. 5: 2624. https://doi.org/10.3390/app15052624

APA Style

Chen, B., & Ye, J. (2025). Privacy-Preserving Data Sharing and Computing for Outsourced Policy Iteration with Attempt Records from Multiple Users. Applied Sciences, 15(5), 2624. https://doi.org/10.3390/app15052624

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop