Differentially Private Actor and Its Eligibility Trace

We present a differentially private actor and its eligibility trace in an actor-critic approach, wherein an actor takes actions directly interacting with an environment; however, the critic estimates only the state values that are obtained through bootstrapping. In other words, the actor reflects the more detailed information about the sequence of taken actions on its parameter than the critic. Moreover, their corresponding eligibility traces have the same properties. Therefore, it is necessary to preserve the privacy of an actor and its eligibility trace while training on private or sensitive data. In this paper, we confirm the applicability of differential privacy methods to the actors updated using the policy gradient algorithm and discuss the advantages of such an approach with regard to differentially private critic learning. In addition, we measured the cosine similarity between the differentially private applied eligibility trace and the non-differentially private eligibility trace to analyze whether their anonymity is appropriately protected in the differentially private actor or the critic. We conducted the experiments considering two synthetic examples imitating real-world problems in medical and autonomous navigation domains, and the results confirmed the feasibility of the proposed method.


Introduction
Reinforcement learning (RL) defines the steps and procedures required to map situations to actions aiming to maximize a accumulated reward signal [1] and serves as a practical framework for decision-making problems. RL can be applied to a variety of data types. In particular, it can be used for sensitive or private data, such as a patient's treatment procedure or trajectories for car navigation. However, with the development and deployment of diverse RL-based technologies in computer science, the demand for these private or sensitive data increases. Rather than using these raw data as it is, it is needed to prevent personal privacy leakage while maintaining the original data's utility. Difference privacy (DP), first proposed by Dwork [2], is a privacy model that is used to design a mechanism to prevent the leakage of personal information due to multiple queries in a database storing sensitive data. Based on the DP concepts, many studies aimed at applying deep learning have been conducted. In particular, Abadi et al. [3] have proposed incorporating the calibrated noise into the stochastic gradient descent computation to protect the privacy of training data.
Many prior RL works focused on DP have also been reported. First, Balle et al. [4] presented a differentially private algorithm for an on-policy evaluation concerning synthetic medical domain problems. Lebensold et al. [5] recently extended their work to actor-critic algorithm [6,7] with differentially private critics. They realized output perturbation [8] to guarantee privacy while adding randomized noise to all elements of an estimated state value vector in the least-squares policy evaluation method. Xie et al. [9] also introduced an application of off-policy evaluation to a differential privacy algorithm. They added noise to a stochastic gradient descent update [10,11] aiming to guarantee privacy using the off-policy evaluation in the GTD2 [12] algorithm. Wang and Hegde [13] suggested a differentially private Q-learning algorithm in a continuous space to preserve a privacy of the value function approximator by adding a Gaussian process noise to the value function at each iteration during training.
However, most of the previous studies focused mainly on value-based algorithms and did not consider policy gradient and eligibility traces. Therefore, in the present research, we determine whether DP can be applied to the actor-critic algorithm, a representative policy-based method. In this approach, an actor takes actions and updates its parameters by directly interacting with an environment; however, the critic can estimate the state value only through bootstrapping. Although the state value function is necessary to learn the parameters of an actor u it is not used to perform action selection. Instead, the actor learns the parameters in the direction of increasing the performance function J(u) computed when the actor follows the policy [1]. Therefore, the actor that stores and updates its actions requires more information compared with the critic that only performs policy evaluation. In addition, an eligibility trace is a short-term memory vector that records the credit according to the frequency and recency [14] of visited states or actions during each episode. The corresponding eligibility traces of the actor and the critic have the same properties corresponding to recording and updating information from the learning process. Therefore, if RL relies on private and sensitive data, it is necessary to ensure that the parameters and the eligibility trace of the actor are not leaked during training. For example, in medical or automatic navigation domains, the parameters and its eligibility trace updated by an actor may indicate the recovery state of a patient and reveal the prescription defined by a doctor in the medical domain. Alternatively, they may exhibit the information representing the locations of a driver and the corresponding traces of movements in the car navigation domain. Furthermore, their gradient used in the training process may represent a private tendency existing in the information (the recovery rate of a specific patient in the medical domain or the preferred private route of a user in the navigation domain). In this paper, we propose a method to protect the privacy of sensitive data corresponding to an actor and its eligibility trace during training in the actor-critic approach. To realize this, we apply the gradient perturbation method introduced by Abadi et al. [3] to the off-policy actor-critic [15] with some modifications.
The main contributions of this paper can be summarized as follows. We verify that the policy-based algorithm also guarantees DP and, more importantly, protects the privacy of its eligibility trace. In other words, we confirm the applicability of differential privacy to the actors updated using the policy gradient method and demonstrate its superiority compared with differentially private critic learning. Furthermore, we measure cosine similarity between the eligibility trace vectors of the differentially private and non-differentially private approaches to analyze whether their anonymity is protected concerning the differentially private actor or the critic. We also evaluate the experimental results corresponding to two synthetic examples imitating real-world problems in medical and autonomous navigation domains, thereby demonstrating the feasibility of the proposed method. We note explicitly that although in the work reported by Lebensold [5] (AC with DP critic) may deem similar to the proposed approach in terms of using the actor-critic algorithm, it differs in two aspects. First, the aim of their work was to guarantee DP to transfer learning for sample inefficient RL. However, we do not consider transfer learning in our works. Instead, we aim to guarantee the DP of an actor and its eligibility trace in online learning. Second, they implemented a trusted data aggregator (producer) providing a state value function guaranteed by DP-LSL [4] so that the state value was estimated using the first-visit Monte Carlo method without considering an eligibility trace for several untrusted agents (consumer). Then, the consumers used this state value function as a critic in the actor-critic algorithm. Consequently, their work was focused only on guaranteeing the DP of the critic, not an actor. In contrast, we focus on guaranteeing DP in the policy-based algorithm and propose a differentially private actor and its eligibility traces, instead of considering the critic.
The rest of this paper is organized as follows. We introduce the background concepts used in this work and provide the details of the proposed approach in Section 2. In Section 3, we discuss the details of the conducted experiments, providing results and discussions. Finally, we conclude this work outlining limitations in Section 5.

Background
In this section, we introduce the essential definitions of the concepts used in this work: differential privacy and an off-policy actor-critic.

Differential Privacy
Differential privacy (DP) [2,8,16], introduced by Dwork et al. provides a robust standard framework to guarantee privacy in aggregated databases and prevents from various adversarial attackers. DP relies on the concept of two neighboring data sets that differ only by a single element.

Definition 1.
A randomized mechanism M including the domain D and range R; M : D → R satisfies ( , δ)-differential privacy if, for any two adjacent inputs d, d ∈ D and for any subset of outputs S ⊆ R, the following holds: To approximate deterministic real-valued function f : D → R with a differential privacy mechanism, we incorporate additive noise calibrated to the sensitivity of f that is defined as the maximum of the absolute distance | f (d) − f (d )|, where d, d' are adjacent input data sets. Selecting between the widely used Gaussian and Laplace noise mechanisms [2,8,16], in the present study, we employ the Gaussian mechanisms defined as follows: where N(0, σ 2 ) is a normal (Gaussian) distribution with a mean of 0 and a standard deviation σ. To achieve ( , δ)-differential privacy, we set the σ = 2log(1.25)/δ s f [2]. We consider the deterministic real value function f : D → R as a gradient of performance of an actor ( u logπ u (a|s) = u π u (a|s)/π u (a|s)) which is a probability of taking an action actually taken divided by the probability of taking that action [1].

Off-Policy Actor-Critic
Off-policy actor-critic, introduced by Degris et al. [15], integrates properties of off-policy learning into the adaptability of action selection in the actor-critic framework (Algorithm 1).

Algorithm 1
Off-Policy Actor-Critic 1: Initialize the vectors e u , e v and w to zero //eligibility traces for actor (e u ) and critic (e v ) 2: Initialize the vectors u and v arbitrarily //parameters for actor (u) and critic (w, v) 3: Initialize a state s 4: for each step do 5: Choose an action, a, according to b(.|s) 6: Observe resultant reward, r , and next state, s 7: Update the critic (GTD(λ) algorithm): 11: Update the actor: 16: The actor learns the policy parameter u according to the temporal difference error estimated by the critic (GTD(λ) [17]). In this study, the linear function approximation has been used to estimate the state values of V π, (s) :V(s) = v t x s , where x s ∈ R N v , N v ∈ N is the feature vector of state s and v ∈ R N v is the other weight vector [15]. Using the importance sampling method, the actor and critic are made convenient to update their parameters even if the behavior policy (b(a|s)) that is not the target policy (π u (a|s)) selects the action.
In this algorithm, there are two kinds of eligibility traces for actor e u and critic e v . In the case of the critic, it serves as an auxiliary to update a parameter of the state value function w. In the case of the actor, it is used to update the parameter u of a policy. Specifically, the critic acts only as a baseline for the actions taken by an actor so that the eligibility trace of an actor has a more direct effect in selecting actions than that of the critic.
Therefore, in this study, we incorporate calibrated noise into the actor's gradient to protect the privacy of an eligibility trace. As we aim to determine how incorporating noise into an actor directly affects the target policy and its eligibility trace, we employ the -greedy policy [1] instead of the policy used in the off-policy actor-critic approach. More details are provided in Section 4.

Differentially Private Actor and Its Eligibility Trace Learning
In this section, we provide the details of the proposed algorithm: differentially private actor and its eligibility trace (DP-Actor) illustrated in Algorithm 2.
Algorithm 2 Differentially Private Actor and Its Eligibility Trace Learning 1: Initialize a weight vector w to zero 2: Initialize weight vectors v, u arbitrarily 3: for each episode do 4: Initialize eligibility trace vectors e v and e u to zero 5: Initialize a state a 6: 7: for each step do 8: Choose an action, a by − Greedy 9: if a is selected by b(.|s) : ρ ← π u (a|s)/b(a|s) 10: else : ρ ← 1 11: 12: Observe resultant reward, r, and next state, s 13: Update the critic (GTD(λ) algorithm): 16: Update the noised actor: Our work is based on prior works [3,15] with several modifications. First, although Abadi et al. in [3] guaranteed DP in a non-convex objective function, we aim to provide DP in a convex objective one, as we use linear function approximation to estimate value functions, according to Degris et al. [15]. The Gaussian distribution is used to generate a calibrated noise vector for the stochastic gradient ascent update of the actor. The loss function that serves as a measurement for the performance of policy (π u ) is defined as J(u)). Its gradient ( u logπ u (a|s)) corresponds to the deterministic real-value function f : D → R in Equation (2). We fix the mini-batch size equal to 1. Incorporating the noise vector generated based on the Gaussian mechanism into actor learning consequently influences its corresponding eligibility trace vector e u that contains the information about how recently and frequently an agent has been visited according to actions taken by an actor. Second, as the proposed approach is based on linear function approximation used to estimate the value function (V(s) = v t x s ), where all parameters are represented as vectors, and each element of a gradient includes noised corresponding to different values, we normalize the gradient vector and amplify it using |C| factor that is defined as a median of a gradient in each update process. Third, we modify the policy defining it as the -greedy instead of Degris et al. [15]. In their work, the behavior policy (b(a|s)) totally selects actions and updates the target policy (π u (a|s)) though importance sampling. However, we empirically demonstrate that it is more appropriate to use the -greedy policy. An action is selected by the target policy (π u (a|s)) with a probability of 1 − , whereas the action is selected with the probability of by the behavior policy (b(a|s)). A uniform random distribution is used in the behavior policy. The importance sampling ratio ρ can be calculated in two ways: as π u (a|s)/b(a|s) if the behavior policy takes action, and as a t and 1 if the target policy selects action a t .
The proposed algorithm includes two key steps. First, we obtain a gradient of performance vector ( u logπ u (a|s)) of an actor and normalize with its L2-norm and multiply it by the median of absolute gradient value |C|. Then, we add noise into this normalized gradient using a vector sampled from the Gaussian distribution with the following scale: Second, the noised gradient of an actor influences the process of updating its eligibility trace e u . Consequently, the affected eligibility trace accelerates the process of updating policy parameter u that finally impacts the result of selecting actions. Overall, parameter u and its eligibility trace e u are protected in terms of privacy as a result of adding the noise calibrated in the stochastic gradient ascent during the actor's learning process.

Experiments
In this section, we report the results obtained by applying the proposed algorithm to the two kinds of synthetic toy examples. We conducted a simulation concerning real cases that required protecting privacy. First, as the gradient update occurred in both the actor and the critic, we compared performance in terms of rewards concerning all cases in which the gradient is perturbated by the noise vector. The cases included the proposed model based on a differentially private actor (DP-Actor), a differentially private critic (DP-Critic), and combined differentially private actor and critic (DP-Both). Moreover, we analyzed the non-differentially private (non-DP) model without noise perturbation as a baseline for performance evaluation. Algorithms 3 and 4 illustrate the DP-Critic and DP-Both approaches in which we applied the same method of gradient perturbation as in the DP-Actor. Please note that in these algorithms, |C actor | and |C critic | are absolute medians of gradient at the actor and critic respectively.   g(x) ← u logπ u (a|s) / ( u logπ u (a|s) 2 /|C actor |) 14: 15: g(x) ← g(x) + N(0, σ 2 C 2 actor I) 16: 17: Second, to ensure that the privacy of eligibility traces was protected appropriately, we measured the cosine similarity of eligibility traces between the non-DP and DP cases (DP-Actor, DP-Critic, and DP-Both). Finally, in the policy improvement part, we explained the reasons for using the -greedy policy instead of the soft-max one even if our work belongs to policy-gradient algorithms relying on the results of the performance comparison between the two policies.
We set the number of episodes in both environments limited to 500. All parameters were processed according to Table 1 to find the optimal parameter combination with a discount factor γ = 0.99, δ = 10 −6 . As we incorporated the calibrated noises into each element of a gradient vector, we set C equal to the median value of the gradient. Therefore, it was proportional to the scale of the gradient of the actor or the critic.

Experimental Setup
We introduce details of two synthetic examples imitating the real-world problems. One is the environment representing the patient's medical histories (Patient Treatment Progression), and the other (Taxi-v2) imitates a users' trajectories in the automatic navigation environment.

Patient Treatment Progression
According to Lebensold et al. [5], the Markov decision process (MDP) experiments can be considered to be clinical in nature. It means that the patient's treatment and status data is represented as a state vector similar to [18]. We established the setup of MDP similarly as in [5] so that it comprised a chain of 100 states (MDP-100). In each state, an agent can execute two kinds of actions with state transition probabilities: staying or moving to the right. For example, if the agent proceed with the moving-to-the-right action, then, with a probability of 0.1, the agent could stay in its current state or with that of 0.9-advance to the right. The agent received a reward of −1 in each time step and that of 1 in the absorbing state. Although this setup described the policy evaluation in the medical domain [4], it can be considered to be an eligibility trace recording historical states and information about actions, and therefore, it can be used to represent the past progress of a patient in traces of recovery. Figure 1 describes the second practical case corresponded to the navigation tracking domain. We analyzed the case of Taxi-v2 introduced by [19]. Its purpose was to pick up passengers at one of the predefined locations (R, G, Y, or B) and drop them off in a specified destination (R, G, Y, or B). It is a 5 × 5 grid-world that comprised 500 discrete states corresponding to 25 taxi locations, 5 passenger locations, and 4 destinations. Actions were composed of six discrete sets: {0: move south, 1: move north, 2: move east, 3: move west, 4: pick up a passenger, 5: drop off a passenger}. An agent received −1 reward for each time step, +20 reward for the passenger pickup and drop-off operation executed correctly at proper destinations, and −10 otherwise. This simple environment could be considered to be a database that could store and use trajectories of a person in the automatic navigation domain.

Results
We analyze the result of our works from three perspectives: First, we evaluated our works in terms of average reward with various privacy budget values. Second, we measured the anonymity of the eligibility trace vectors when DP was applied by cosine similarity. Finally, we will explain the reasons for using the greedy policy rather than a soft-max one. Table 2 represents the best parameter combinations in both considered environments (MDP-100 and Taxi-v2). Based on Figure 2, the following three observations were derived. First, the DP-Critic achieved the highest performance in all environments, meaning that the critic exhibited the greatest robustness to noise. Second, the differences in performance estimates corresponding to the DP-Critic and DP-Actor were larger in Taxi-v2 compared with MDP-100. Third, the learning variances of DP-Both and DP-Actor were greater than those of the DP-Critic (Table 3). The main reasons for these were related to the scale of gradient values and the complexity of the considered environments. We empirically checked that the gradient of the critic ([e v − (s )(1 − λ)(w t e v )x s ]) was larger than that of the actor ( u logπ u (a|s)), resulting in robustness to perturbation by noise. Accordingly, due to the relatively larger noise impact on the actor's gradient, DP-Actor exhibited a larger learning variance compared with DP-Critic. The complexity of the considered environments also contributed to these phenomena. We assumed that the complexity of an environment relied on the number of states and actions. Accordingly, the case of MDP-100 was deemed less complex (100 states and 2 actions) compared with Taxi-v2 (500 states and 6 actions). Therefore, the behavioral freedom of an agent differed depending on the environment. In other words, with an increase in the number of action types to be selected in a state and a raising number of states, the sensitivity to noise augmented during training because an agent could not select proper actions due to noise in its parameter update process.   Table 3. Performance Standard Deviation.  We observed the significant difference in performance estimates between the proposed method and AC with DP-Critic in Taxi-v2; however, there are some aspects to consider before comparing these results directly. First, AC with DP-Critic implied using DP-LSL [4], one of the output perturbation methods used to guarantee DP. In the proposed method, instead, we used the gradient perturbation approach. Second, AC with DP-Critic did not employ an eligibility trace that accelerated learning progress, and therefore, it required passing a larger number of episodes to converge to local optimal compared with the proposed method (the authors of [5] set the number of episodes equal to 10,000 in the corresponding experiment). Figure 3 represents the performance changes corresponding to various privacy budgets ( ). Although the performance of DP-Actor represented in Figure 2 was inferior compared with that of DP-Critic, as seen in Figure 3, the performance degradation corresponding to the lower privacy budget in the case of DP-Actor deemed to be more stable compared with DP-Critic. DP-Actor exhibited almost similar performance estimates in a wider privacy budget range ( = {10, 5, 1, 0.5} in MDP-100, and = {10, 5, 1} in Taxi-v2) compared with the results corresponding to DP-Critic ( = {10, 5, 1} in MDP-100, = {10, 5} in Taxi-v2). We considered that the reason for this difference was related to a smaller value of |C| for the actor compared with that of the critic. In other words, as we added noise based on σ 2 C 2 , its scale for the actor was relatively smaller compared with that of the critic (due to smaller gradient scale) although the privacy budget decreased in the process of determining the scale of noise. Consequently, DP-Actor had more opportunities for lowering the privacy budget without causing significant performance degradation compared with DP-Critic. We also noted that the performance changes associated with the privacy budget varied depending on the complexity of an environment. The complexity of an environment affected the distribution of performance estimates according to changes in the privacy budget. In MDP-100, their distribution was more cohesive compared with the case of Taxi-v2. Specifically in the DP-Both case with combined DP-Actor DP-Critic, it demonstrated more significant performance degradation from = 5 to = 1 in Taxi-v2 compared with the results corresponding to MDP-100. Therefore, it was required to consider complexity of an environment interacting with an agent to define a proper privacy budget.

Anonymity for the Eligibility Traces
We measured the cosine similarity between the eligibility trace vectors corresponding to non-DP and DP-Actor, DP-Critic, and DP-Both to determine to what extent the privacy of vectors was protected, as shown in Table 4.
Based on the aforementioned observations, we derived the following conclusions. As |C| of the critic was relatively larger than that of the actor, the amount of noise incorporated into the critic was more prominent. Therefore, it was natural to expect that the cosine similarity of critics should have been less than the actor similarity. On the contrary, the cosine similarity of DP-Critic differed considerably from that of DP-Actor with higher similarity values. This was because there was a difference in the update process between the gradients of the actor and the critic, as well as in their eligibility traces. In the case of the actor, when the gradient perturbed by noise (g(x) + N(0, σ 2 C 2 I)) was updated, it directly influenced its eligibility trace (e u ← ρ[ g(x) + (s)λe u ]) . However, in the case of the critic, the noised gradient ([e v − (s )(1 − λ)(w t e v )x s ]) did not have any impact on its eligibility trace and affected only the bootstrapping parameter v of the critic. The influence of inserted noise was reduced as a result of calculating the TD-error (δ). Therefore, the cosine similarity in DP-Critic was inevitably higher than that of DP-Actor, meaning that DP-Critic was associated with the greater risk of privacy leakage compared with DP-Actor. As seen in Figures 2 and 4, and Table 3, the lower the performance due to noise, the lower the similarity of an eligibility trace vector. Therefore, it is necessary to balance between the performance degradation and the degree of anonymity for eligibility traces. We also found that the degree of anonymity for an eligibility trace vector depended on the complexity of an environment according to the fact that with an increase of complexity, the cosine similarity of an overall eligibility trace vector decreased. In particular, DP-Critic exhibited lower similarity in the case Taxi-v2 compared with MDP-100. However, with respect to DP-Actor, there was still a slightly higher risk of privacy leakage for DP-Critic in the case of Taxi-v2. Conversely, although DP-Actor exhibited the performance degradation compared with the performance of the non-DP approach in both the MDP-100 and Taxi-v2 setups, its cosine similarity was maintained at a constant level, indicating that privacy was protected to a certain degree regardless of environments. . Cosine similarities of eligibility trace vectors between the vectors from non-DP (e u , e v ) and corresponding vectors from each cases (DP-Actor (e u ), DP-Critic (e v ), and DP-Both (e u , e v )). We averaged the vectors generated from same episode over 10 runs to avoid random seed effect.

Reasons of Using the -Greedy Policy
We parameterize policy with the -greedy rather than soft-max in action preferences. Although parameterizing policy with the soft-max distribution has more advantages over the -greedy policy in policy-gradient algorithm [1], we empirically find that in all DP-applied cases, if we set the policy to be the -greedy, the DP-Actor only converges to local optimal faster than soft-max one. Figure 5 illustrates this phenomenon, which can be explained by the fact that the soft-max policy should consider the preference over all other actions with relatively small gradients. This soft-max's property results in the more exploration due to inserted noise compared with the exploration required in the -greedy policy that considers only one of the largest action preferences with probability (1-). Therefore, the soft-max policy inevitably convergences to the local optimal slower than the -greedy one.

Figure 5.
Learning curves by two different policies ( -greedy and soft-max). We select the best performing parameter sets in all of the parameter combinations after averaging rewards for each parameter set run over 10 runs.

Conclusions
We proposed a differential-private actor and a learning method for its eligibility trace in the actor-critic approach. Inspired by the Abadi's work [3], we modified off-policy actor-critic algorithm [15], and confirmed that DP was guaranteed in the policy-based algorithm. As a result of experiments, we determined that the DP-Actor was more advantageous compared with the DP-Critic and DP-Both cases. First, although the DP-Actor demonstrated slightly inferior performance compared with that of the DP-Critic, it maintained stable performance within a larger privacy budget range. Second, we confirmed that the anonymity level of eligibility trace vectors was higher in the case of using DP-actor compared with the other cases. We measured cosine similarities for the degree of anonymity between eligibility trace vectors concerning the considered DP cases and one without DP. We found that similarity values in the case of DP-Actor exhibited greater independence to the eligibility trace of non-DP compared with between the DP-Critic and the non-DP.
The proposed method can be applied only to a discrete state and action space. As there are many continuous spaces in the real world, in the future work, we are bound to extend our work to the more complicated continuous spaces where we can construct the value functions by high performing deep learning techniques (i.e., deep reinforcement learning). In this way, we will aim not only to solve complicated problems reflecting real world situations but also to enhance performance. In addition, we also plan to develop DP version of various state-of-the-art RL algorithms.