1. Introduction
Artificial intelligence has become the frontier of human exploration [
1] and it has been widely applied into different scenarios. With more complicated problems wish to be solved by artificial intelligence, reinforcement learning is noticed. In the process of reinforcement learning, the agent optimize its policy through the award or punish obtained form different interactions with environment under different states. Due to this characteristic, it is considered to be an effective method for dealing with continuous decision-making problems. But the performance of reinforcement learning is limited by computing power in early stage.
With the rapid development of computing and information technology, deep learning helps the reinforced learning to perform better so that it is used in more and more aspects [
2] such as Industrial manufacturing, robot control, logistics management, national defense and military, intelligent transportation, intelligent healthcare, etc. The training of famous large language model like ChatGPT [
3] and Deepseek-R1 [
4] also uses reinforcement learning. With artificial intelligence used more in science research [
5], reinforcement learning is used in many fields of basic scientific research such as the control of tokamak plasmas [
6] and to discover algorithms which can get matrix multiplication faster [
7].
Due to the complexity of problems in reality, researchers have enhance the reinforcement learning with only one agent to a system with multiple agent. Multi-agent reinforcement learning have provides strong supports for modeling and solving the problems in autonomous driving [
8], intelligent inventory management [
9], wireless networks [
10] etc.
All the developments above make the reinforcement learning model more complex, which need more data to train the model. For the data of a single user is often insufficient to train a complex model, the cooperation of multiple users using their private data is needed in training of complex model, which makes privacy leakage a major risk factor in the Artificial Intelligence model training process as is mentioned in [
11].
This dilemma is especially evident in reinforcement learning, because the trial-and-error cost that a single user can afford is often not enough to train a model which can perform well. Therefore, a method of fusing the attempt records of multiple users for reinforcement learning training is necessary. At the same time, this approach needs to be able to protect the privacy of users and support outsourced computing to make reasonable use of computing resources.
In order to protect the privacy of different users and perform secret computations while protecting privacy, we have successfully built a data sharing system in which users can encrypt their data using their unique keys and can later provide conversion keys to confirm the server’s use of the data. We have designed a secret computations process which can be done by the server without user interaction.
There are two main challenges to keep privacy protected while improve speed of computition in encrypted state. Firstly, since it is necessary to count the numbers of times of taking different actions under different states, we need to determine whether a ciphertext is equal to another one while ensuring security, which poses a challenge to data security. Second, we cannot allow the server to establish a connection between the ciphertext and the plaintext, which results in the policy maintained by the server being out of order. This affects the performance of the model.
Corresponding to the first challenge, we redesigned the operational details of proxy re-encryption so that it is impossible to determine whether the ciphertext is equal before conversion, and the converted ciphertext shows hash-like consistency. Similarly, ciphertext conversion can only be performed with the permission of the owner. To address the second challenge, we prioritize the state according to the action with the highest probability in the existing strategy, which increases the possibility of retrieving the corresponding state first, thereby determining the corresponding action more quickly. The main contributions of this paper are as follows:
Our solution can firmly maintain user rights, and different users can encrypt their data with unique keys to and control their use. We use partially homomorphic encryption (PHE) to improve the versatility of the system. At the same time, the ciphertext under the same mechanism can meet the special needs of reinforcement learning tasks.
Our solution takes into account both security and computational efficiency. It makes the efficiency of reinforcement learning in an encrypted state similar to that in plain text and user privacy are protected, which greatly improves the usability of the solution.
Our design does not affect the dynamic learning characteristics of the reinforcement learning model. The continuously iteration process does not require additional communication, and the policy can adapt to new changes during use.
2. Related Works
To protect user privacy, there are a variety of tools available, such as homomorphic encryption, secret sharing, oblivious transfer, and garbled circuits. In general, homomorphic encryption requires more computing power, while secret sharing incurs a lot of communication overhead. Oblivious transfer and garbled circuit rely on specific hardware implementations, so they are not universal. In this privacy-preserving outsourced policy iteration method, both statistics and policy updates should be run on the server side. The specific values of states and behaviors should be encrypted by the user, which raises the need to perform calculations in a secret state. In order to minimize unnecessary communication and user-side computing overhead, we consider using homomorphic encryption to protect user privacy.
Homomorphic encryption is the algorithms supporting calculations on encrypted data. While the operations supported by homomorphic encryption can usually not include all operations available on plaintext, homomorphic encryption is divided into three categories: Fully Homomorphic Encryption (FHE), Somewhat Homomorphic Encryption (SWHE) and Partially Homomorphic Encryption (PHE).
The first proposed FHE scheme introduced by Gentry et al. in [
12], with which any polynomial functions could be calculated under encrypted state but the calculation cost it needed is too much to become usable in actual scene. Therefore, BGN algorithm [
13] is more used to keep its efficiency with the supported operation is reduced into a limited times of addition and multiplication and become SWHE.
Moreover, PHE is another way to keep efficiency which only supports a single type of operations. This supported operation is usually addition or multiplication. If a PHE support addition, we have homomorphic add operation
on ciphertext
and ciphertext
so that
. Similarly, we have homomorphic multiple operation
on ciphertext
and ciphertext
so that
, if multiplication is supported. The detail operation in
or
is defined individually in each PHE algorithms. The most used is Paillier brought out in [
14].
The basic way to calculate encrypted data with different key is to make it able to be compute by keep their homomorphism under different keys like what the [
15,
16] have done. This way is called multi-key homomorphic encryption (MKHE), but will not be adopted by us for their heavy computing cost.
Therefore a way to switch the key of a ciphertext is needed, as is introduced by Zhang et al. in [
17]. They uses a key-switching matrices to change keys into a unified one, which is efficient but, in essence, this method reduces the security of multiply keys to a single key. That is to say the matrix weakened the security of the original scheme, which is not good.
To keep the security the same while considering the efficiency, many methods like [
18,
19] introduce two servers. In [
18], the ciphertexts are decrypted and than encrypted again by the Second server, which is a brief way but the Second server keeps all keys and its royalty and invincibility must be ensured. In [
19,
20], they divide the proxy re-encrypt and calculate into different server, but they still need interactions between them to accomplish the conversion and calculation process, which is complicated and significantly increased the consumption on communication.
Since the communication cost is huge, refs. [
21,
22] tried remove the process of ciphertext conversion, but they can not reduce that in process of calculation. To canceling the Second server, the scheme proposed by Ma et al. in [
23] perform the re-encryption with server and users, but the communicate cost is still high. In summary, they can not solve the problems they aim to by avoid the problem itself.
Concentrate to the accrual scene similar to us, ref. [
24] shows a solution to aggregate a group of numbers from many participants, but the privacy can be hurt for the users may keep some information and the retreated users could violate the privacy of those users left. To protect this privacy, DeePAR in [
25] proposed by Zhang et al. realize a complete way under different keys, but the secret key of users is needed to be used during the re-encryption process. To avoid this use, Xiaoyin Shen et al. proposed a scheme in [
26]. This method introduces fog nodes so that calculate can be done with only one server and it keeps the communication cost relatively low. but it introduces a new step of the Second time gathering, which brings new cost.
In general, most of related works aiming to perform some calculation on separated dataset only take arithmetic operations like add, subtract, multiply and divide into consideration or even a part of them. While our scheme supports the judgment of equality additionally. The detailed difference of our scheme with other methods which can perform the similar fusion and calculation on different encrypted data is shown in
Table 1. Obviously, our scheme can act individual function with the least number of server.
In [
15,
16], they directly realize a homomorphic encryption under multiply key. They do not need to convert a ciphertext, but A high calculation cost is needed. Ref. [
17] introduces a matrix to change a key into another for higher efficiency but the safety is degrade. Refs. [
18,
19] reduces the calculation cost. But instead of that, they introduced too mach communication cost. Refs. [
21,
22] delete the step of ciphertext transformation to reduce communication between two servers, but the communication is still unavoidable during calculation. Ref. [
26] finally get rid of the second server the same to our scheme. But our scheme additionally supports safety judgment on equality.
3. Preliminaries
The main techniques involved in proposed scheme can be summarized into two aspects as proxy re-encryption and homomorphic encryption. There are introductions of these two techniques and that of policy iteration.
3.1. Proxy Re-Encryption
Proxy re-encryption offers a method that can convert a ciphertext
into a different one
but they could be decryped and get the same plaintext
m. This enhances the security of cryptographic algorithm by avoiding the repeated use of one ciphertext. Since it was first proposed in [
27]. With this, people could avoid the decryption and re-encryption process but transform a ciphertext into another, so that the possibility of expose to plaintext
m could be minimized.
The typical scenario of proxy re-encryption can be described as follows: Alice wants to send its data m to several users safely. To ensure the security, it is needed to ecrypt the data m with the public keys of each user. With a large amount of them, it need too much computation for Alice so that it could encrypt data with a intermediate key and offers transform keys to receivers instead of encrypt same data by many times. After that, receivers can convert the text to which it can decrypt with their secret keys. This proxy of convert process is completed by receivers, which transfers the computing cost from single sender to multiple receivers without the expose of plaintext m so that the privacy is protected.
In our scheme, a variant of proxy re-encryption is used in
Section 5.4, it converts a ciphertext from secret key
to the secret key
with a re-encryption key correspond to
. The correctness is verified in
Section 6.
3.2. Homomorphic Encryption
Homomorphic encryption stands for those cryptographic algorithms which owns homomorphism. As a concept from abstract algebra, homomorphism means a mapping from an algebraic structure to an algebraic structure of the same kind, which keeps all relevant structures unchanged. That is, all properties such as identity, inverse, and binary operations remain the same. A homomorphism
suit (
1).
In which “·” presents an operation on
X and “∘” presents an operation on
Y. Similarly, a homomorphic encryption means a encryption method
that suit (
2).
In which “·” presents an operation on plaintext group M and “∘” presents an operation on ciphertext group C, which allows people to act operations on ciphertexts and directly get the ciphertext corresponding to the plaintext of results to preset operation corresponding plaintexts to those ciphertext.
3.3. BCP and BCP Enhance Method
Accodring to the actual demand of target scenario, the basic PHE algorithm chose in our scheme is BCP proposed by [
28]. The supported homomorphic operation of this scheme contains only addition, which is also called to own the attribute of linear homomorphism. The supported operations and specific processes defined by [
28] is showed as below:
Algorithm Initialization: choose two large primes p and q, compute and randomly select ;
Key Generation: Randomly choose secret key , the public key ;
Encryption: The plaintext
m should be
. To encrypt
m, firstly select a random
. And the ciphertext
can be calculate as (
3):
Decryption:
m can be calculated from
and
s like (
4):
Linear Homomorphism: If we have two ciphertext
,
under the same secret key
s, we can calculate
as (
5):
To be noticed, the multiple operations on each part of ciphertexts are all modular operations based on .
Except from the operations it already defined, we can also use the Proxy Re-encryption technique on this cryptographic algorithm as follows: If there are three secret key
and they conform to the relationship shown as (
6), the Proxy Re-encryption process between and by them can be acted.
To keep the efficiency of BCP while supporting more necessary operations, a framework proposed by Catalano and Fiore in [
29] is used in our scheme. By using this method, we could support homomorphic multiplication and keep all system fast. This scheme build a construction
with basic encryption
so that it can calculate the
. Although, the multiplication operations is not used in this passage, we keep it for better security. The necessity of using this method will be explained in
Section 6.3. In contrast, a realign operation supported by [
29] mechanism is needed. We can realign its plaintext part as (
7).
To be noticed, the encryption process
must be done with the same key of original ciphertext
and
. Otherwise, the Equation (
7) does not hold. Of course we can also align the plaintext part of
and
to a new value
by the same key of them like (
8).
If
,
should be equal to
. Of cause, this is not obvious in our scenario, a special method will be used in
Section 5.5 to show whether they are equal.
3.4. Policy Iteration
Policy Iteration is a basic method to solve Markov Decision Problem, which is suitable for our research. It uses an iterative way to avoid solving the inverse matrix, so it has high computational efficiency. If we could build an efficient outsourcing policy iteration, it will become a solid foundation for follow-up research.
In policy iteration we usually initialize the policy to choose each actions fairly. After an attempt, we can evaluate the policy with its performance and update our policy. This iteration process is often expressed as (
9):
In which
presents the value last round and
can be the new one. In fact, (
9) actually shows an attempt process. The policy could be iterated like (
10):
The best policy for fixed environment must be greed determined policy and our policy could approach it gradually.
4. Model Description
The system model introduces the participates and their need with ability, and the threat model implicate the potential threats in this system, which are also what we aim to deal with.
4.1. System Model
There are three types of characters as shown in
Figure 1 in our system. Users offer their attempt histories to join in the training of policy for same problem or only ask for the policy on server for operation suggestions. If it is needed, users could also ask for trained policy. These different actions are shown in
Figure 1 as UserA, UserB and UserC.
Server collects encrypted data from users offering it, and fuse those data from different users and protected by different secret key to a unified encrypted state database. After that, the Server could act further calculation to update the policy matrix. This policy matrix is usable but only generate encrypted results so that it can not be used by the person who do not hold the correct secret key.
Key Generation Center (KGC) is also the manager for this system and act as a trusted third party but it did not join in the operation too much. After the collecting of encrypted data from users by the Server, the KGC should help the Server to transform ciphertext that can support the needed calculation. It also helps to convert results set for those users who legally ask for calculation result. Except what mentioned above, KGC do not undertake the calculation cost. It would distribute keys for all participators including itself and tasks. In fact, the data from users are converted to a domain protected by the key of KGC by default by the Server, which offers the right of decrypt them to KGC.
4.2. Threat Model
KGC is considered to be reliable. In contrast, Server and Users may launch attacks to the system.
Server is considered to be honest-but-curious. It will follow the preset behavior honestly, but they are curious about the contents of encrypted data.
Users are likely to collude with other Users so that there they can know the content of ciphertext from other users including data and request. For the Server would frequently return results for users, the method to deal with the collusion between Users and Servers can obviously need the participation of KGC or another non collusive server. So the collusion between Users and Server is now not considered. Two type of possible attacks in our system.
Some Users try to know the content of ciphertext of other users;
Some Users try to know the content of ciphertext returned from the Server to specific User;
The Server tries to know the content of ciphertest it received and calculated on;
An attacker from the cloud (maybe server or other users) tries to reach data from users by observing the input, intermediate results or final output.
5. Proposed Scheme
The proposed scheme is constructed by seven steps as follows: system initialization, parameter negotiation, user data encryption, ciphertext conversion, calculation and model iteration, using of encrypted policy and policy providing to specific users.
5.1. System Initialization and Participator Registration
In initialization step of this system, the encryption system for all this system would be established by the KGC, and it should generate and transfer pairs of public key and secret key for each User as well as Server.
The secret key for User i , Server and Key Generation Center are all kept by Key Generation Center.
5.2. Parameter Negotiation for Data Sharing Task
To share attempt history for policy iteration, Users
individually choose its
, calculate out the corresponding negotiate parameter
as (
11). The Server and Key Generation Center would both receive it and the public parameter of this task
can be calculate as (
12):
The Server will release to the Key Generation Center and all Users joined in this task. Once the KGC confirm the correctness of , it generates a secret key for this task. Everyone who join in this data fusion including Users and Server can get this secret key , which would be used to realize the re-encrypted process in following step.
5.3. User Encryption
Before the User encryption, User
should randomly choose
to build the authorization
for conversion like (
13).
As for each plaintext
m. User would built the ciphertext stricture introduced in
Section 3.3. The First part of it needs a randomly choosed bias
and the Second part needs a random number
. The ciphertext
c for plaintext
m is built as (
14) to (
16).
After encryption of all attempt histories, they will be send to Server. The authorization is needed for the Servers to fuse all encryptions to a united encrypted data set. Once the User decide to start the training, its should be sent to the Server through secret channel. This mechanism ensures power of User permission.
5.4. Ciphertext Conversion
With ciphertext
c, authorization
from the corresponding user
, the Server can convert the ciphertext
c to a united one
. Ciphertext conversion process can be calculated like (
17) and (
18):
The Second part of ciphertext c and united ciphertext are actually ciphertexts under BCP algorithm for the random chosen bias value b. By this proxy re-encryption process the secret key on them are converted to the secret key of KGC , which complete the fusion of User data. If is needed, this target secret key could be replaced to any other one and the owner of this secret key can manage all training process. In our scheme, this target key for data fusion is temporary set to be the key of Key Generation Center.
5.5. Ciphertext Calculations and Policy Update
To initialize and update the policy matrix, the alignment of bias
b and random number
r is needed. For a pair of converted ciphertexts for
and
like (
19):
In order to verify whether the corresponding plaintext
and
are equal to each other only by action on their ciphertexts
and
safely, the server firstly aligns the first part of
and
to a unified value
as is introduced in
Section 3.3. The server realigns the first part of
and
to
as follows:
If the plaintext values corresponding to
and
are equal to each other, the plaintext corresponding to
should be equal to that corresponding to
. But they are encrypted with different
r, so whether they are equal to each other is still invisible. Assume a pair of
and
as (
21):
To unify the random number
and
used in these two ciphertext, the server should send all the first part of ciphertext
to Key Generation Center for help. The Key Generation Center can use its secret key
to get
and
. To make the equation of
and
visible, Key Generation Center can calculate
and
so that (
22) is meeted, the unified result of
is kept by the Key Generation Center.
Receiving and , the Server can infer those ciphertexts whose corresponding plaintext the same with the same Second part of ciphertext for the aligned bias. For more ciphertexts the needed can be as many as the amount of ciphertexts.
Finally Server can count the number of every encrypted pairs and update the policy matrix. After traversing n attempt histories, the policy matrix would updates n times. More updates finished, the policy can become usable.
5.6. Policy Using
If some users need to use the policy on Server, it should ask the Key Generation Center for task secret key and the unified value . The task secret key and the unified value should be distribute through secret channel.
Received unified value
the unified ciphertext for each state
and action
can be calculated as (
23).
Actually all this value act as hash value. To make requests, User encrypt it by the secret key it chosen by itself, and send the authorization
secretly to server like
Section 5.3. The returned action from Server is corresponding to the one chosen by policy. Server will record the
pair and encrypt it by the authorization
for this user. Once the user decrepted the returned ciphertext, it will get one of the Unified
. After it tried the action, the user can get a new state and request for a new action. After continuously repeating the above process, the user can reach the target state and Server can get a new attempt history for updating the policy.
5.7. Results Conversion and Decryption
With mentioned steps multiply users can already train and use a policy under encrypted state on a Server. If it is needed, a user can ask for the policy matrix, Server S can encrypt the labels of policy matrix just the same way as is used during the use of policy and send its policy matrix to this user. The user can naturally map each label to the plaintext it present to.