Our work is based on prior works [
3,
15] with several modifications. First, although Abadi et al. in [
3] guaranteed DP in a non-convex objective function, we aim to provide DP in a convex objective one, as we use linear function approximation to estimate value functions, according to Degris et al. [
15]. The Gaussian distribution is used to generate a calibrated noise vector for the stochastic gradient ascent update of the actor. The loss function that serves as a measurement for the performance of policy (
) is defined as
). Its gradient (
) corresponds to the deterministic real-value function
in Equation (2). We fix the mini-batch size equal to 1. Incorporating the noise vector generated based on the Gaussian mechanism into actor learning consequently influences its corresponding eligibility trace vector
that contains the information about how recently and frequently an agent has been visited according to actions taken by an actor. Second, as the proposed approach is based on linear function approximation used to estimate the value function (
), where all parameters are represented as vectors, and each element of a gradient includes noised corresponding to different values, we normalize the gradient vector and amplify it using
factor that is defined as a median of a gradient in each update process. Third, we modify the policy defining it as the
-greedy instead of Degris et al. [
15]. In their work, the behavior policy (
) totally selects actions and updates the target policy (
) though importance sampling. However, we empirically demonstrate that it is more appropriate to use the
-greedy policy. An action is selected by the target policy (
) with a probability of
, whereas the action is selected with the probability of
by the behavior policy (
). A uniform random distribution is used in the behavior policy. The importance sampling ratio
can be calculated in two ways: as
if the behavior policy takes action, and as
and 1 if the target policy selects action
.
The proposed algorithm includes two key steps. First, we obtain a gradient of performance vector (
) of an actor and normalize with its
-norm and multiply it by the median of absolute gradient value
. Then, we add noise into this normalized gradient using a vector sampled from the Gaussian distribution with the following scale:
Second, the noised gradient of an actor influences the process of updating its eligibility trace
. Consequently, the affected eligibility trace accelerates the process of updating policy parameter
u that finally impacts the result of selecting actions. Overall, parameter
u and its eligibility trace
are protected in terms of privacy as a result of adding the noise calibrated in the stochastic gradient ascent during the actor’s learning process.