Next Article in Journal
Numerical Study on an RBF-FD Tangent Plane Based Method for Convection–Diffusion Equations on Anisotropic Evolving Surfaces
Previous Article in Journal
Hydrodynamic Behavior of Self-Propelled Particles in a Simple Shear Flow
Previous Article in Special Issue
Medical Image Authentication Method Based on the Wavelet Packet and Energy Entropy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Privacy-Preserving Design of Scalar LQG Control

1
School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
2
School of Electrical, Electronic, and Information Engineering “Guglielmo Marconi”—DEI, University of Bologna, 40136 Bologna, Italy
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(7), 856; https://doi.org/10.3390/e24070856
Submission received: 17 April 2022 / Revised: 19 June 2022 / Accepted: 20 June 2022 / Published: 22 June 2022
(This article belongs to the Special Issue Adversarial Intelligence: Secrecy, Privacy, and Robustness)

Abstract

:
This paper studies the agent identity privacy problem in the scalar linear quadratic Gaussian (LQG) control system. The agent identity is a binary hypothesis: Agent A or Agent B. An eavesdropper is assumed to make a hypothesis testing the agent identity based on the intercepted environment state sequence. The privacy risk is measured by the Kullback–Leibler divergence between the probability distributions of state sequences under two hypotheses. By taking into account both the accumulative control reward and privacy risk, an optimization problem of the policy of Agent B is formulated. This paper shows that the optimal deterministic privacy-preserving LQG policy of Agent B is a linear mapping. A sufficient condition is given to guarantee that the optimal deterministic privacy-preserving policy is time-invariant in the asymptotic regime. It is also shown that adding an independent Gaussian random process noise to the linear mapping of the optimal deterministic privacy-preserving policy cannot improve the performance of Agent B. The numerical experiments justify the theoretic results and illustrate the reward–privacy trade-off.

1. Related Work

During the last decades, control technologies have been widely employed and significantly improved the industry productivity, management efficiency, and life convenience. The breakthrough of the deep reinforcement learning (DRL) technology [1] enables the control systems to be intelligent and applicable for more complicated tasks. Along with the increasing concerns about information security and privacy, adversarial problems in control systems have also attracted increasing attentions recently.
The related works and literature are introduced and discussed in the following. There are two types of adversarial problems considered in these works: active attacks and privacy problems.

1.1. Research on Active Adversarial Attacks

Most previous works focus on studying the active adversarial attacks on the control systems, which aim to degenerate the control efficiency, or even worse, to lead the system to an undesired state, and developing the corresponding defense mechanisms. Depending on their methodologies, these works can be divided into two classes. One class aims to develop the adversarial reinforcement learning algorithm under attack. The other class makes a theoretic study on the adversarial problem in the standard control model.
DRL takes advantage of the deep network to represent a complex non-linear value function or policy function. Similar to the deep network, DRL is also vulnerable to the adversarial example attack, i.e., the DRL-trained policy can be misled to take a wrong action by adding a minor distortion to the observation of the agent [2]. In [2,3,4,5], the optimal generation of adversarial examples has been studied for given DRL algorithms. As a countermeasure, the mechanism of adversarial training uses adversarial examples in the training phase to enhance the robustness of control policy under attack [6,7,8]. In [9,10], attack/robustness-related regularization terms are added in the optimization objective to improve the robustness of the policy.
In most theoretic studies, adversarial attack problems are modeled from the game theoretic perspective. Stochastic game (SG) [11] and partially observable SG (POSG) can model the indirect (In SG or POSG, players indirectly interact with each other by feeding their actions back to the dynamic environment.) interactions between multiple players in the dynamic control system and have been employed in the robust or adversarial control studies [12,13,14]. Cheap talk game [15] models direct (In the cheap talk game, the sender with private information sends a message to the receiver and the receiver takes an action based on the received message and a belief on the inaccessible private information.) interactions between a sender and a receiver. In [16,17,18,19], the single-step cheap talk game has been extended to dynamic cheap talk games to model the adversarial example attacks in the multi-step control systems. With uncertainty about the environment dynamics in a partially observable Markov decision process (POMDP), the robust POMDP is formulated as a Stackelberg game in [20], where the agent (leader) optimizes the control policy under the worst-case assumption of the environment dynamics (follower). Another kind of adversarial attack maliciously falsifies the agent actions and feeds the falsified actions back to the dynamic environment to degrade the control performance. The falsified action attack can be modeled by Stackelberg games [21,22], where the dynamic environment is the leader and the adversarial agent is the follower. In our previous work [23], the falsified action attack on the linear quadratic regulator control is modeled by a dynamic cheap talk game and the adversarial attack is evaluated by the Fisher information between the random agent action and the falsified action.
Optimal stealthy attacks have also been studied. In [24,25], Kullback–Leibler divergence is used to measure the stealthiness of the attacks on the control signal and the sensing data, respectively; then the optimal attacks against LQG control system are developed with the objective of maximizing the quadratic cost while maintaining a degree of attack stealthiness.

1.2. Research on Privacy Problems

Besides the active attacks, passive eavesdropping in control systems leads to privacy problems. Most works focus on preserving the privacy-sensitive environment states. The design of agent actions in the Markov decision process has been investigated when the equivocation of states given system inputs and outputs is imposed as the privacy-preserving objective [26]. In [27,28,29,30], the notion of differential privacy [31] is introduced in the multi-agent control, where each agent adds privacy noise to his states before sharing them with other agents while guaranteeing the whole control system network to operate well. The reward function is a succinct description of the control task and is strongly relevant with the agent actions. The DRL-learned value function can reveal the privacy-sensitive reward function. Regarding this privacy problem, functional noise is added to the value function in the Q-learning such that the neighborhood reward functions are indistinguishable [32]. As a promising computational secrecy technology, labeled homomorphic encryption has been employed to encrypt the private states, gain matrices, control inputs, and intermediary steps in the cloud-outsourced LQG [33].

2. Introduction

2.1. Motivation

In this paper, we consider the agent identity privacy problem in the LQG control, which is motivated by the inverse reinforcement learning (IRL). IRL algorithms [34] can reconstruct the reward functions of agents and therefore can also be maliciously exploited to identify the agents. Similar to many other privacy problems in the big data era, such as the smart meter privacy problem, the agent identity of a control system is privacy-sensitive. When the agent identity is leaked, an adversary can further employ the corresponding optimal attacks on the control system.

2.2. Content and Contribution

We model the agent identity privacy problem as an adversarial binary hypothesis testing and employ the Kullback–Leibler divergence between the probability distributions of environment state sequences under different hypotheses as the privacy risk measure. We formulate a novel optimization problem and study the optimal privacy-preserving LQG policy. This work is compared with the previous research on privacy problems in Table 1.
The rest of this paper is organized as follows. In Section 3, we formulate the agent identity privacy problem in the LQG control system. In Section 4, we optimize the deterministic privacy-preserving LQG policy and give a sufficient condition for time-invariant optimal deterministic policy in the asymptotic regime. In Section 5, we discuss the random privacy-preserving LQG policy and show that the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy. In Section 6, we present and analyze the numerical experiment results. Section 7 concludes this paper.

2.3. Notation

Unless otherwise specified, we denote a random scalar by a capital letter, e.g., X, its realization by the corresponding lower case letter, e.g., x, the Gaussian distribution with mean μ and variance σ 2 by N ( μ , σ 2 ) , the expectation operation by E ( · ) , the Kullback–Leibler divergence between two probability distributions by D ( · | | · ) , and the natural logarithm by log ( · ) .

3. Agent Identity Privacy Problem in LQG Control

We consider an N-step LQG control in the presence of an eavesdropper as shown in Figure 1. There are two possible agents, Agent A and Agent B, which are with respect to a hypothesis H = 0 and an alternative hypothesis H = 1 . We assume that the agents and the eavesdropper have perfect observations of the environment states. Based on the intercepted state sequence, the eavesdropper makes a binary hypothesis testing (A binary hypothesis is considered in this paper for simplification and can be extended to a multi-hypothesis.) to identify the current agent, which results in an agent identity privacy problem. To have a better understanding of the privacy problem, we give an example in the emerging application of autonomous vehicle. An autonomous vehicle can be controlled by a human driver (Agent A) or an autonomous driving system (Agent B). An adversary, who can be a compromised manager of the vehicle to everything (V2X) network, has access to the sensing data (environment state) of the autonomous vehicle and aims to attack the autonomous vehicle, e.g., to mislead the autonomous vehicle off the lane. To this end, the adversary needs to first identify if the current driver is the autonomous driving system by the intercepted sensing data sequence. The agent identity privacy problem commonly exists in intelligent autonomous systems, e.g., unmanned aerial vehicles and robots, where the autonomous control agents depending strongly on the sensing data are vulnerable to injection attacks and therefore the agent identities are privacy-sensitive.
The LQG control model for each agent is given as follows: For H = 0 or H = 1 , 1 i N ,
s i + 1 ( H ) = α s i ( H ) + β a i ( H ) + z i ,
a i ( H ) = F i ( H ) s i ( H ) ,
r i ( H ) = R ( H ) s i ( H ) , a i ( H ) = θ ( H ) s i ( H ) 2 ϕ ( H ) a i ( H ) 2 ,
S 1 ( H ) b 1 ( H ) N ( μ 1 , σ 1 2 ) ,
Z i N ( 0 , ω 2 ) ,
where the parameters α 0 , β 0 , θ ( H ) > 0 , ϕ ( H ) > 0 , μ 1 , σ 1 2 > 0 , and ω 2 > 0 are given. The initial environment state s 1 ( H ) is randomly generated following an independent Gaussian distribution. In the i-th time step, on observing the environment state s i ( H ) , the agent with respect to the hypothesis H employs the control policy F i ( H ) to (randomly) determine an action a i ( H ) as (2); the instantaneous control reward r i ( H ) is jointly determined by the current state s i ( H ) and action a i ( H ) as (3); the next state s i + 1 ( H ) is jointly determined by the current state s i ( H ) , the current action a i ( H ) , and z i randomly generated following an independent zero-mean Gaussian distribution as (1). In the standard LQG problem, the agent with respect to the hypothesis H only aims to maximize the expected accumulative reward by optimizing the control policies F 1 : N ( H ) :
F 1 : N ( H ) = arg max F 1 : N ( H ) E i = 1 N R ( H ) S i ( H ) , A i ( H ) .
The optimal LQG control policy has been well established [35] and can be described as follows. For H = 0 or H = 1 , 1 i N ,
θ ˜ N + 1 ( H ) = 0 ,
θ ˜ i ( H ) = L ( H ) θ ˜ i + 1 ( H ) = θ ( H ) + θ ˜ i + 1 ( H ) α 2 θ ˜ i + 1 ( H ) 2 α 2 β 2 ϕ ( H ) + θ ˜ i + 1 ( H ) β 2 > 0 ,
κ i ( H ) = θ ˜ i + 1 ( H ) α β ϕ ( H ) + θ ˜ i + 1 ( H ) β 2 ,
F i ( H ) s i ( H ) = κ i ( H ) s i ( H ) .
For H = 0 or 1, it can be easily verified that the mapping L ( H ) is order-preserving, i.e., L ( H ) ( x ) L H ( x ) if 0 x x . From the Kleene’s fixed point theorem [36], it follows that
θ ˜ ( H ) = lim N L ( H ) ( L ( H ) ( ( L ( H ) ( L ( H ) N   i t e r a t i o n s ( θ ˜ N + 1 ( H ) ) ) ) ) ) = θ ( H ) + θ ˜ ( H ) α 2 θ ˜ ( H ) 2 α 2 β 2 ϕ ( H ) + θ ˜ ( H ) β 2 = ( ϕ ( H ) θ ( H ) β 2 ϕ ( H ) α 2 ) 2 + 4 θ ( H ) ϕ ( H ) β 2 ( ϕ ( H ) θ ( H ) β 2 ϕ ( H ) α 2 ) 2 β 2 .
Therefore, if we consider the asymptotic regime as N , the optimal control polices are time-invariant: For H = 0 or H = 1 , i 1 ,
κ ( H ) = θ ˜ ( H ) α β ϕ ( H ) + θ ˜ ( H ) β 2 ,
F i ( H ) s i ( H ) = κ ( H ) s i ( H ) .
For the agent identity privacy problem, we assume that the eavesdropper collects a sequence of environment states and carries out a binary hypothesis testing on the agent identity. Thus, the privacy risk can be measured by the hypothesis testing performance. In information theory, Kullback–Leibler divergence measures the “distance” between two probability distributions. When the value of the Kullback–Leibler divergence D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) is smaller, the random environment state sequences S 1 : N ( 0 ) and S 1 : N ( 1 ) are statistically “closer” to each other and it is more difficult for the eavesdropper to identify the current agent, i.e., a poorer hypothesis testing performance and a lower privacy risk. In this paper, we employ the Kullback–Leibler divergence D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) as the privacy risk measure.
Furthermore, we assume that both agents aim to improve their own expected accumulative rewards while only Agent B considers to reduce the privacy risk. This assumption makes sense in a lot of scenarios. In the aforementioned autonomous vehicle example, Agent A denotes the human driver and does not need to change the optimal driving style; Agent B denotes the autonomous driving system and can be reconfigured with respect to the human’s optimal driving style to improve the driving efficiency and to reduce the privacy risk. Under the assumption, Agent A takes the optimal LQG control policy as described by (7)–(10) with H = 0 . In the following, we focus on the privacy-preserving LQG control policy of Agent B. Taking into account the two design objectives of Agent B, we formulate the following optimization problem:
F 1 : N ( 1 ) = arg max F 1 : N ( 1 ) E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) ,
where λ 0 denotes the privacy-preserving design weight; the random environment state sequence S 1 : N ( 0 ) is induced by the optimal LQG policy F 1 : N ( 0 ) of Agent A. It follows from the chain rule of Kullback–Leibler divergence and the Markovian property of the state sequences that the privacy risk measure can be further decomposed as
D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) = D p S 1 ( 1 ) | | p S 1 ( 0 ) + i = 2 N D p S i ( 1 ) | S i 1 ( 1 ) | | p S i ( 0 ) | S i 1 ( 0 ) = i = 2 N D p S i ( 1 ) | S i 1 ( 1 ) | | p S i ( 0 ) | S i 1 ( 0 ) .
It is obvious that the optimal privacy-preserving LQG control policy of Agent B depends on the value of λ . In the following two remarks, the optimal privacy-preserving LQG control policies are characterized for two special cases, λ = 0 and λ , respectively.
Remark 1.
When λ = 0 , Agent B only aims to maximize the expected accumulative reward E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) . In this case, the optimal privacy-preserving LQG policy of Agent B reduces to the optimal LQG policy of Agent B, i.e., F i ( 1 ) s i ( 1 ) = F i ( 1 ) s i ( 1 ) = κ i ( 1 ) s i ( 1 ) for all 1 i N .
Remark 2.
When λ , Agent B only aims to minimize the privacy risk, which is measured by the Kullback–Leibler divergence D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) . In this case, the optimal privacy-preserving LQG policy of Agent B reduces to the optimal LQG policy of Agent A, i.e., F i ( 1 ) s i ( 1 ) = F i ( 0 ) s i ( 1 ) = κ i ( 0 ) s i ( 1 ) for all 1 i N , and the minimum privacy risk is achieved, i.e., D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) = 0 .
When 0 < λ < , we characterize the optimal privacy-preserving LQG control policies of Agent B in different forms in the following sections. For ease of reading, we list the parameters and their meanings in Table 2.

4. Deterministic Privacy-Preserving LQG Policy

When the privacy risk is not considered, as shown in (10), the optimal LQG control policy of Agent B is a deterministic linear mapping. In this section, we study the optimal deterministic privacy-preserving LQG policy of Agent B. Therefore, the policy of Agent B can be specified as: For 1 i N ,
F i ( 1 ) : R R .
In the following theorem, we characterize the optimal deterministic privacy-preserving LQG policy of Agent B.
Theorem 1.
At each step, the optimal deterministic privacy-preserving LQG policy of Agent B with respect to the optimization problem (14) is a linear mapping as: For 1 i N ,
θ ^ N + 1 ( 1 ) = 0 ,
θ ^ i ( 1 ) = J N + 1 i θ ^ i + 1 ( 1 ) = θ ( 1 ) + θ ^ i + 1 ( 1 ) α 2 + λ 2 ω 2 β 2 κ i ( 0 ) 2 λ 2 ω 2 β 2 κ i ( 0 ) θ ^ i + 1 ( 1 ) α β 2 ϕ ( 1 ) + θ ^ i + 1 ( 1 ) β 2 + λ 2 ω 2 β 2 > 0 ,
κ i ( 1 ) = λ 2 ω 2 β 2 κ i ( 0 ) θ ^ i + 1 ( 1 ) α β ϕ ( 1 ) + θ ^ i + 1 ( 1 ) β 2 + λ 2 ω 2 β 2 ,
F i ( 1 ) s i ( 1 ) = κ i ( 1 ) s i ( 1 ) .
Then, the maximum achievable weighted design objective of Agent B is
max F 1 : N ( 1 ) E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) = θ ^ 1 ( 1 ) ( μ 1 2 + σ 1 2 ) ω 2 i = 1 N 1 θ ^ i + 1 ( 1 ) .
The proof of Theorem 1 is presented in Appendix A.
Remark 3.
When λ = 0 , it is easy to show that κ i ( 1 ) = κ i ( 1 ) for all 1 i N , i.e., the optimal deterministic privacy-preserving LQG policy is consistent with the optimal privacy-preserving LQG policy shown in Remark 1.
Remark 4.
It is easy to show that lim λ κ i ( 1 ) = κ i ( 0 ) for all 1 i N , i.e., the optimal deterministic privacy-preserving LQG policy is consistent with the optimal privacy-preserving LQG policy shown in Remark 2.
Remark 5.
Although the objective in (14) is a linear combination of the expected accumulative reward and the privacy risk measured by the Kullback–Leibler divergence, the optimal linear coefficient κ i ( 1 ) is a non-linear function of κ i ( 1 ) (the optimal linear coefficient with respect to only maximize the expected accumulative reward) and κ i ( 0 ) (the optimal linear coefficient with respect to only minimize the privacy risk) when we consider the deterministic privacy-preserving LQG control policy of Agent B.
Remark 6.
When Agent B employs the optimal deterministic privacy-preserving LQG policy at each step, the random state-action sequence is jointly Gaussian distributed.
In the asymptotic regime as N , the optimal LQG control policy is time-invariant. In this case, the design of the optimal policy becomes an easier task. Theorem 2 gives a sufficient condition such that the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime.
Theorem 2.
When the model parameters satisfy the following inequality
λ 2 ω 2 β 4 κ ( 0 ) 2 ϕ ( 1 ) ϕ ( 1 ) α 2 + λ 2 ω 2 β 2 α + β κ ( 0 ) 2 ϕ ( 1 ) + λ 2 ω 2 β 2 ϕ ( 1 ) + λ 2 ω 2 β 2 2 < 1 ,
the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime. More specifically, J N ( J N 1 ( ( J 2 ( J 1 ( θ ^ N + 1 ( 1 ) ) ) ) ) ) converges to the unique fixed point θ ^ ( 1 ) as
θ ^ ( 1 ) = lim N J N ( J N 1 ( ( J 2 ( J 1 ( θ ^ N + 1 ( 1 ) ) ) ) ) ) = θ ( 1 ) + α 2 θ ^ ( 1 ) + λ 2 ω 2 β 2 κ ( 0 ) 2 λ 2 ω 2 β 2 κ ( 0 ) α β θ ^ ( 1 ) 2 ϕ ( 1 ) + β 2 θ ^ ( 1 ) + λ 2 ω 2 β 2 ;
and the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B can be described by
κ ( 1 ) = λ 2 ω 2 β 2 κ ( 0 ) θ ^ ( 1 ) α β ϕ ( 1 ) + θ ^ ( 1 ) β 2 + λ 2 ω 2 β 2 ,
F i ( 1 ) s i ( 1 ) = κ ( 1 ) s i ( 1 ) .
Under this condition, the asymptotic weighted design object rate of Agent B achieved by the time-invariant optimal deterministic privacy-preserving LQG policy is
lim N 1 N max F 1 : N ( 1 ) E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) = ω 2 θ ^ ( 1 ) .
The proof of Theorem 2 is given in Appendix B.

5. Random Privacy-Preserving LQG Policy

As shown in Theorem 1, the optimal deterministic privacy-preserving LQG policy of Agent B is a linear mapping. In this section, we first discuss the optimal random privacy-preserving LQG policy and then consider a particular random policy by extending the deterministic linear mapping to the linear Gaussian random policy for Agent B. Here, the random policy of Agent B can be specified as: For 1 i N ,
F i ( 1 ) : R × R R 0 .
With slight abuse of notation, we denote the condition probability (density) of taking the action a i ( 1 ) R given the state s i ( 1 ) R and the random policy F i ( 1 ) by F i ( 1 ) a i ( 1 ) s i ( 1 ) R 0 .
It can be easily shown that the optimal random privacy-preserving LQG policy of Agent B in the final step F N ( 1 ) reduces to the deterministic linear mapping in (A2). For 1 i N 1 , it follows from the backward dynamic programming that the optimal random privacy-preserving LQG policy of Agent B in the i-th step does not reduce to a deterministic linear mapping in general. That is because the conditional probability distribution p S i + 1 ( 1 ) | S i ( 1 ) given a random policy F i ( 1 ) is a Gaussian mixture model and then the Kullback–Leibler divergence D p S i + 1 ( 1 ) | S i ( 1 ) | | p S i + 1 ( 0 ) | S i ( 0 ) between a Gaussian mixture model and a Gaussian distribution generally does not reduce to the quadratic mean of A i ( 1 ) κ i ( 0 ) S i ( 1 ) as (A5). To the best of our knowledge, there is no analytically tractable formula for Kullback–Leibler divergence between Gaussian mixture models and only approximations are available [37,38,39]. Therefore, we do not give the close-form solution of the optimal random privacy-preserving LQG policy in this paper.
In what follows, we focus on the linear Gaussian random policy: For 1 i N ,
F i ( 1 ) s i ( 1 ) = κ i ( 1 ) s i ( 1 ) + w i ( 1 ) ,
where w i ( 1 ) is the realization of an independent zero-mean Gaussian random process noise W i ( 1 ) N ( 0 , δ i 2 ) . Thus, a linear Gaussian random policy F i ( 1 ) can be completely described by the parameters κ i ( 1 ) , δ i 2 . Theorem 3 characterizes the optimal linear Gaussian random privacy-preserving LQG policy of Agent B.
Theorem 3.
At each step, the optimal linear Gaussian random privacy-preserving LQG policy of Agent B with respect to the optimization problem (14) is the same deterministic linear mapping as in Theorem 1.
The proof of Theorem 3 is presented in Appendix C.
Remark 7.
Adding an independent zero-mean Gaussian random process noise to the linear mapping of the optimal deterministic privacy-preserving LQG policy cannot improve the performance of Agent B.

6. Numerical Experiments

6.1. Convergence of the Sequence θ ^ N + 1 ( 1 ) , θ ^ N ( 1 ) , θ ^ N 1 ( 1 ) ,

When the constraint (22) in Theorem 2 is satisfied, we first illustrate the convergence of the sequence θ ^ N + 1 ( 1 ) , θ ^ N ( 1 ) , θ ^ N 1 ( 1 ) , . In addition to the default model parameters in Table 3, we set θ ( 1 ) = 8 , ϕ ( 1 ) = 1 , and let the privacy-preserving design weight λ = 1 , 5 or 10. By using these parameters, it can be easily verified that the constraint (22) is satisfied. Figure 2 shows that θ ^ N + 1 k ( 1 ) = J k ( J k 1 ( ( J 2 ( J 1 ( θ ^ N + 1 ( 1 ) ) ) ) ) ) converges after k = 20 iterations for different values of λ . Furthermore, different convergence patterns can be observed for different values of λ .

6.2. Impact of the Privacy-Preserving Design Weight λ

Here, we show the impact of the privacy-preserving design weight λ on the trade-off between the control reward of Agent B and the privacy risk. We use the same parameters as in Section 6.1, but allow 0 λ 10,000. Then, Theorem 2 is applicable and therefore the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime. Figure 3 and Figure 4 show that both the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B decrease as λ increases, i.e., the control reward of Agent B is degraded while the privacy is enhanced. When the privacy risk is not considered, the best control reward of Agent B is achieved at the cost of the highest privacy risk.
In addition to the analytical results, we also present the simulation results by considering privacy in Figure 3 and Figure 4. Given 0 λ 10,000, we employ the corresponding time-invariant optimal deterministic privacy-preserving LQG policy of Agent B and run the 10,000-step privacy-preserving LQG control with 100 randomly generated initial states. Then, the average control reward and the average privacy risk are evaluated and compared with the analytical results of asymptotic average control reward and asymptotic average privacy risk, respectively. As shown in Figure 3 and Figure 4, the simulation results match quite well with the analytical results, which validates our analytical results.

6.3. Impact of Parameter θ ( 1 )

Here, we study the impact of the parameter θ ( 1 ) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we set ϕ ( 1 ) = ϕ ( 0 ) = 16 and allow 0.01 θ ( 1 ) 8 , λ = 0 (without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all 0.01 θ ( 1 ) 8 and by increasing the value of λ , Figure 5 and Figure 6 show a trade-off between the control reward of Agent B and the privacy risk, which is consistent with the previous observations. For λ = 0 , 10, 100, 1000 or 10,000, Figure 5 shows that the asymptotic average control reward of Agent B decreases as θ ( 1 ) increases. This is reasonable since θ ( 1 ) is the quadratic coefficient in the instantaneous reward function R ( 1 ) . For λ = 0 , 10, 100, 1000 or 10,000, Figure 6 shows that the asymptotic average privacy risk has a pattern to decrease first, then to increase, and to achieve the minimum value 0 when θ ( 1 ) = θ ( 0 ) = 1 . When θ ( 1 ) = θ ( 0 ) = 1 , both agents have the same instantaneous reward function and employ the same optimal LQG control policy, which leads to the same state sequence distribution under both hypotheses and the minimum value 0 of the Kullback–Leibler divergence. As θ ( 1 ) deviates from the value of θ ( 0 ) , the agents have more different instantaneous reward functions, which lead to more different state sequence distributions under both hypotheses and a larger value of the Kullback–Leibler divergence.

6.4. Impact of Parameter ϕ ( 1 )

Here, we show the impact of the parameter ϕ ( 1 ) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we set θ ( 1 ) = θ ( 0 ) = 1 and allow 0.01 ϕ ( 1 ) 40 , λ = 0 (without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all 0.01 ϕ ( 1 ) 40 and by increasing the value of λ , Figure 7 and Figure 8 also show a trade-off between the control reward of Agent B and the privacy risk. For λ = 0 , 10, 100, 1000 or 10,000, Figure 7 shows that the asymptotic average control reward of Agent B decreases as ϕ ( 1 ) increases. This is because ϕ ( 1 ) is the other quadratic coefficient in the instantaneous reward function R ( 1 ) . For λ = 0 , 10, 100, 1000 or 10,000, Figure 8 shows that the asymptotic average privacy risk has a similar pattern to decrease first, then to increase, and to achieve the minimum value 0 when ϕ ( 1 ) = ϕ ( 0 ) = 16 . This pattern can be similarly explained as Section 6.3.

6.5. Impact of Parameter θ ( 0 )

By fixing θ ( 1 ) = 1 and ϕ ( 1 ) = ϕ ( 0 ) = 16 , we study the impact of the parameter θ ( 0 ) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we allow 0.01 θ ( 0 ) 8 and λ = 0 (without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all 0.01 θ ( 0 ) 8 and by increasing the value of λ , Figure 9 and Figure 10 show a trade-off between the control reward of Agent B and the privacy risk. For λ = 0 , 10, 100, 1000 or 10,000, Figure 9 and Figure 10 show that the asymptotic average control reward of Agent B achieves the maximum value while the asymptotic average privacy risk achieves the minimum value 0 when θ ( 1 ) = θ ( 0 ) = 1 . In this case, both agents have the same instantaneous reward function and employ the same optimal LQG control policy, which maximizes their control rewards, leads to the same state sequence distribution under both hypotheses, and therefore achieves the minimum value 0 of the Kullback–Leibler divergence.

6.6. Impact of Parameter ϕ ( 0 )

By fixing ϕ ( 1 ) = 16 and θ ( 1 ) = θ ( 0 ) = 1 , we study the impact of the parameter ϕ ( 0 ) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we allow 0.01 ϕ ( 0 ) 40 and λ = 0 (without privacy), 10, 100, 1000 or 10,000. From Figure 11 and Figure 12, we have similar observations of the impact of ϕ ( 0 ) as in Section 6.5. These observations here can be similarly explained as well.

7. Conclusions

In this paper, we consider the agent identity privacy problem in the scalar LQG control. Regarding this novel privacy problem, we model it as an adversarial binary hypothesis testing and employ the Kullback–Leibler divergence to measure the privacy risk. We then formulate a novel privacy-preserving LQG control optimization by taking into account both the accumulative control reward of Agent B and the privacy risk. We prove that the optimal deterministic privacy-preserving LQG control policy of Agent B is a linear mapping, which is consistent with the standard LQG. We further show that the random policy formulated by adding an independent Gaussian random process noise to the optimal deterministic privacy-preserving LQG policy cannot improve the performance. We also give a sufficient condition to guarantee the time-invariant optimal deterministic privacy-preserving LQG policy in the asymptotic regime.
This research can be extended in our future works. Studying the general random policy of Agent B is an interesting extension. This theoretic study can be extended to develop privacy-preserving reinforcement learning algorithms. The problem can also be extended and formulated as a non-cooperative game of multiple agents with conflicting objectives, where some agents only aim to optimize their own accumulative control rewards while the other agents consider the agent identity privacy risk in addition to their own accumulative control rewards.

Author Contributions

Conceptualization, E.F. and Z.L.; methodology, E.F., Y.T. and Z.L.; validation, E.F., Y.T. and C.S.; formal analysis, E.F., Y.T. and Z.L.; experiment, C.S.; writing—original draft preparation, E.F. and Y.T.; writing—review and editing, Z.L. and C.W.; supervision, Z.L. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (62006173, 62171322) and the 2021-2023 China-Serbia Inter-Governmental S&T Cooperation Project (No. 6). We are also grateful for the support of the Sino-German Center of Intelligent Systems, Tongji University.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Theorem 1.
The proof is based on the backward dynamic programming.
We first consider the sub-problem of the final step. Given a probability distribution p S N ( 1 ) , the final step optimization problem of the deterministic control policy F N ( 1 ) is
F N ( 1 ) = arg max F N ( 1 ) E R ( 1 ) S N ( 1 ) , A N ( 1 ) = arg max F N ( 1 ) θ ( 1 ) E S N ( 1 ) 2 ϕ ( 1 ) E F N ( 1 ) S N ( 1 ) 2 .
Since p S N ( 1 ) is given, the first term θ ( 1 ) E S N ( 1 ) 2 is fixed. Note the upper bound on the second term ϕ ( 1 ) E F N ( 1 ) S N ( 1 ) 2 0 . The upper bound can be achieved by the optimal deterministic privacy-preserving LQG policy:
F N ( 1 ) s N ( 1 ) = κ N ( 1 ) s N ( 1 ) = 0 ,
where
κ N ( 1 ) = λ 2 ω 2 β 2 κ N ( 0 ) θ ^ N + 1 ( 1 ) α β ϕ ( 1 ) + θ ^ N + 1 ( 1 ) β 2 + λ 2 ω 2 β 2 = 0 = κ N ( 1 ) .
Then, the maximum achievable objective of the final step is
max F N ( 1 ) E R ( 1 ) S N ( 1 ) , A N ( 1 ) = θ ( 1 ) E S N ( 1 ) 2 = θ ^ N ( 1 ) E S N ( 1 ) 2 .
We then consider the sub-problem from the ( N 1 ) -th step until the final step. Given a probability distribution p S N 1 ( 1 ) and the optimal deterministic privacy-preserving LQG policy in the final step F N ( 1 ) , the sub-optimization problem of the deterministic control policy F N 1 ( 1 ) is
F N 1 ( 1 ) = arg max F N 1 ( 1 ) E i = N 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ D p S N ( 1 ) | S N 1 ( 1 ) | | p S N ( 0 ) | S N 1 ( 0 ) = arg max F N 1 ( 1 ) θ ( 1 ) E S N 1 ( 1 ) 2 ϕ ( 1 ) E A N 1 ( 1 ) 2 θ ^ N ( 1 ) E S N ( 1 ) 2 λ D p S N ( 1 ) | S N 1 ( 1 ) | | p S N ( 0 ) | S N 1 ( 0 ) = arg max F N 1 ( 1 ) θ ( 1 ) E S N 1 ( 1 ) 2 ϕ ( 1 ) E F N 1 ( 1 ) S N 1 ( 1 ) 2 θ ^ N ( 1 ) E S N ( 1 ) 2 λ E log 1 2 π ω 2 exp S N ( 1 ) α S N 1 ( 1 ) β F N 1 ( 1 ) S N 1 ( 1 ) 2 2 ω 2 1 2 π ω 2 exp S N ( 1 ) α S N 1 ( 1 ) β κ N 1 ( 0 ) S N 1 ( 1 ) 2 2 ω 2 = arg max F N 1 ( 1 ) θ ( 1 ) E S N 1 ( 1 ) 2 ϕ ( 1 ) E F N 1 ( 1 ) S N 1 ( 1 ) 2 θ ^ N ( 1 ) E α S N 1 ( 1 ) + β F N 1 ( 1 ) S N 1 ( 1 ) + Z N 1 2 λ 2 ω 2 β 2 E F N 1 ( 1 ) S N 1 ( 1 ) κ N 1 ( 0 ) S N 1 ( 1 ) 2 = arg max F N 1 ( 1 ) θ ( 1 ) + θ ^ N ( 1 ) α 2 + λ 2 ω 2 β 2 κ N 1 ( 0 ) 2 E S N 1 ( 1 ) 2 θ ^ N ( 1 ) E Z N 1 2 ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 E F N 1 ( 1 ) S N 1 ( 1 ) 2 2 θ ^ N ( 1 ) α β λ ω 2 β 2 κ N 1 ( 0 ) E S N 1 ( 1 ) F N 1 ( 1 ) S N 1 ( 1 ) = arg max F N 1 ( 1 ) R ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 F N 1 ( 1 ) s N 1 ( 1 ) 2 2 θ ^ N ( 1 ) α β λ ω 2 β 2 κ N 1 ( 0 ) s N 1 ( 1 ) F N 1 ( 1 ) s N 1 ( 1 ) p S N 1 ( 1 ) s N 1 ( 1 ) d s N 1 ( 1 ) = R arg max F N 1 ( 1 ) ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 F N 1 ( 1 ) s N 1 ( 1 ) 2 2 θ ^ N ( 1 ) α β λ ω 2 β 2 κ N 1 ( 0 ) s N 1 ( 1 ) F N 1 ( 1 ) s N 1 ( 1 ) p S N 1 ( 1 ) s N 1 ( 1 ) d s N 1 ( 1 ) .
Since θ ^ N ( 1 ) = θ ( 1 ) > 0 , it follows that
ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 < 0 .
Given any s N 1 ( 1 ) R , the objective of the inner optimization in (A5) is a concave quadratic function of F N 1 ( 1 ) ( s N 1 ( 1 ) ) . Therefore, we can obtain the optimal deterministic privacy-preserving LQG policy as
F N 1 ( 1 ) s N 1 ( 1 ) = κ N 1 ( 1 ) s N 1 ( 1 ) ,
where
κ N 1 ( 1 ) = λ 2 ω 2 β 2 κ N 1 ( 0 ) θ ^ N ( 1 ) α β ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 .
By using the optimal deterministic policies F N 1 : N ( 1 ) , the maximum achievable objective of the sub-problem is
max F N 1 : N ( 1 ) E i = N 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ D p S N ( 1 ) | S N 1 ( 1 ) | | p S N ( 0 ) | S N 1 ( 0 ) = θ ^ N 1 ( 1 ) E S N 1 ( 1 ) 2 θ ^ N ( 1 ) E Z N 1 2 .
The coefficient θ ^ N 1 ( 1 ) can be specified as
θ ^ N 1 ( 1 ) = θ ( 1 ) + ϕ ( 1 ) θ ^ N ( 1 ) α 2 + λ 2 ω 2 β 2 κ N 1 ( 0 ) 2 ϕ ( 1 ) + λ 2 ω 2 β 2 θ ^ N ( 1 ) α + β κ N 1 ( 0 ) 2 ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 .
It can be easily justified that θ ^ N 1 ( 1 ) > 0 since θ ^ N ( 1 ) > 0 .
We now consider the sub-problem from the ( N 2 ) -th step until the final step. Given a probability distribution p S N 2 ( 1 ) and the optimal deterministic privacy-preserving LQG policies F N 1 : N ( 1 ) , the sub-optimization problem of the deterministic control policy F N 2 ( 1 ) is
F N 2 ( 1 ) = arg max F N 2 ( 1 ) E i = N 2 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ i = N 1 N D p S i ( 1 ) | S i 1 ( 1 ) | | p S i ( 0 ) | S i 1 ( 0 ) = arg max F N 2 ( 1 ) θ ( 1 ) E S N 2 ( 1 ) 2 ϕ ( 1 ) E A N 2 ( 1 ) 2 θ ^ N 1 ( 1 ) E S N 1 ( 1 ) 2 θ ^ N ( 1 ) E Z N 1 2 λ D p S N 1 ( 1 ) | S N 2 ( 1 ) | | p S N 1 ( 0 ) | S N 2 ( 0 ) = arg max F N 2 ( 1 ) θ ( 1 ) E S N 2 ( 1 ) 2 ϕ ( 1 ) E F N 2 ( 1 ) S N 2 ( 1 ) 2 θ ^ N 1 ( 1 ) E α S N 2 ( 1 ) + β F N 2 ( 1 ) S N 2 ( 1 ) + Z N 2 2 λ 2 ω 2 β 2 E F N 2 ( 1 ) S N 2 ( 1 ) κ N 2 ( 0 ) S N 2 ( 1 ) 2 θ ^ N ( 1 ) E Z N 1 2 = R arg max F N 2 ( 1 ) ϕ ( 1 ) + θ ^ N 1 ( 1 ) β 2 + λ 2 ω 2 β 2 F N 2 ( 1 ) s N 2 ( 1 ) 2 2 θ ^ N 1 ( 1 ) α β λ ω 2 β 2 κ N 2 ( 0 ) s N 2 ( 1 ) F N 2 ( 1 ) s N 2 ( 1 ) p S N 2 ( 1 ) s N 2 ( 1 ) d s N 2 ( 1 ) .
Note that the objective functions in (A5) and (A11) have the same form. We have also proved that θ ^ N 1 ( 1 ) > 0 . Therefore, we can use the same arguments to obtain the optimal deterministic privacy-preserving LQG policy as
F N 2 ( 1 ) s N 2 ( 1 ) = κ N 2 ( 1 ) s N 2 ( 1 ) ,
where
κ N 2 ( 1 ) = λ 2 ω 2 β 2 κ N 2 ( 0 ) θ ^ N 1 ( 1 ) α β ϕ ( 1 ) + θ ^ N 1 ( 1 ) β 2 + λ 2 ω 2 β 2 ,
the maximum achievable objective of the sub-problem as
max F N 2 : N ( 1 ) E i = N 2 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ i = N 1 N D p S i ( 1 ) | S i 1 ( 1 ) | | p S i ( 0 ) | S i 1 ( 0 ) = θ ^ N 2 ( 1 ) E S N 2 ( 1 ) 2 θ ^ N 1 ( 1 ) E Z N 2 2 θ ^ N ( 1 ) E Z N 1 2 ,
and θ ^ N 2 ( 1 ) > 0 .
We can further prove the optimal deterministic privacy-preserving LQG policies in the remaining steps and the maximum achievable weighted design objective of Agent B in Theorem 1 using the same arguments. □

Appendix B

Proof of Theorem 2.
The proof is based on the optimal deterministic privacy-preserving LQG policy of Agent B in Theorem 1.
For all 1 i N 2 and x 0 , let
J ( x ) = lim N J N + 1 i ( x ) = θ ( 1 ) + α 2 x + λ 2 ω 2 β 2 κ ( 0 ) 2 λ 2 ω 2 β 2 κ ( 0 ) α β x 2 ϕ ( 1 ) + β 2 x + λ 2 ω 2 β 2 ,
where the second equality follows from
lim N κ i ( 0 ) = lim N θ ˜ i + 1 ( 0 ) α β ϕ ( 0 ) + θ ˜ i + 1 ( 0 ) β 2 = α β lim N L ( 0 ) ( L ( 0 ) ( ( L ( 0 ) ( L ( 0 ) N i   i t e r a t i o n s ( θ ˜ N + 1 ( 0 ) ) ) ) ) ) ϕ ( 0 ) + β 2 lim N L ( 0 ) ( L ( 0 ) ( ( L ( 0 ) ( L ( 0 ) N i   i t e r a t i o n s ( θ ˜ N + 1 ( 0 ) ) ) ) ) ) = θ ˜ ( 0 ) α β ϕ ( 0 ) + θ ˜ ( 0 ) β 2 = κ ( 0 ) .
When the model parameters satisfy the condition in (22), J ( x ) is a contraction mapping, i.e., there exists 0 < γ < 1 such that J ( x ) J ( x ) γ x x for all x 0 and x 0 . From the Banach’s fixed point theorem, there is a unique fixed point θ ^ ( 1 ) with respect to the contraction mapping J such that
θ ^ ( 1 ) = lim N J N ( J N 1 ( ( J 2 ( J 1 ( θ ^ N + 1 ( 1 ) ) ) ) ) ) = lim N J N ( J N 1 ( ( J N + 2 N 2 ( J N + 1 N 2 ( θ ^ N 2 + 1 ( 1 ) ) ) ) ) ) = lim N J ( J ( ( J ( J N 2   i t e r a t i o n s ( θ ^ N 2 + 1 ( H ) ) ) ) ) ) = θ ( 1 ) + α 2 θ ^ ( 1 ) + λ 2 ω 2 β 2 κ ( 0 ) 2 λ 2 ω 2 β 2 κ ( 0 ) α β θ ^ ( 1 ) 2 ϕ ( 1 ) + β 2 θ ^ ( 1 ) + λ 2 ω 2 β 2 .
From (19)–(21), (A16), and (A17), it is easy to verify the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B in (24)–(25) and the asymptotic weighted design objective rate in (26). □

Appendix C

Proof of Theorem 3.
The proof is similar as that of Theorem 1.
We first consider the sub-problem of the final step. Given a probability distribution p S N ( 1 ) , the final step optimization problem of the linear Gaussian random policy F N ( 1 ) with parameters κ N ( 1 ) and δ N 2 is
κ N ( 1 ) , δ N 2 = arg max κ N ( 1 ) R , δ N 2 R 0 E R ( 1 ) S N ( 1 ) , A N ( 1 ) = arg max κ N ( 1 ) R , δ N 2 R 0 θ ( 1 ) E S N ( 1 ) 2 ϕ ( 1 ) E A N ( 1 ) 2 = arg max κ N ( 1 ) R , δ N 2 R 0 θ ( 1 ) ϕ ( 1 ) κ N ( 1 ) 2 E S N ( 1 ) 2 ϕ ( 1 ) δ N 2 .
It is obvious the optimal parameters are κ N ( 1 ) = 0 and δ N 2 = 0 , i.e., the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the final step.
Similarly, we then consider the sub-problem from the ( N 1 ) -th step until the final step. Given a probability distribution p S N 1 ( 1 ) and the optimal linear Gaussian random policy in the final step F N ( 1 ) , the sub-optimization problem of the linear Gaussian random policy F N 1 ( 1 ) with parameters κ N 1 ( 1 ) and δ N 1 2 is
κ N 1 ( 1 ) , δ N 1 2 = arg max κ N 1 ( 1 ) R , δ N 1 2 R 0 E i = N 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ D p S N ( 1 ) | S N 1 ( 1 ) | | p S N ( 0 ) | S N 1 ( 0 ) = arg max κ N 1 ( 1 ) R , δ N 1 2 R 0 θ ( 1 ) E S N 1 ( 1 ) 2 ϕ ( 1 ) E A N 1 ( 1 ) 2 θ ^ N ( 1 ) E S N ( 1 ) 2 λ D p S N ( 1 ) | S N 1 ( 1 ) | | p S N ( 0 ) | S N 1 ( 0 ) = arg max κ N 1 ( 1 ) R , δ N 1 2 R 0 θ ( 1 ) E S N 1 ( 1 ) 2 ϕ ( 1 ) E A N 1 ( 1 ) 2 θ ^ N ( 1 ) E S N ( 1 ) 2 λ E log 1 2 π ω 2 + β 2 δ N 1 2 exp S N ( 1 ) α S N 1 ( 1 ) β κ N 1 ( 1 ) S N 1 ( 1 ) 2 2 ω 2 + β 2 δ N 1 2 1 2 π ω 2 exp S N ( 1 ) α S N 1 ( 1 ) β κ N 1 ( 0 ) S N 1 ( 1 ) 2 2 ω 2 = arg max κ N 1 ( 1 ) R , δ N 1 2 R 0 θ ( 1 ) E S N 1 ( 1 ) 2 ϕ ( 1 ) E κ N 1 ( 1 ) S N 1 ( 1 ) + W N 1 ( 1 ) 2 θ ^ N ( 1 ) E α S N 1 ( 1 ) + β κ N 1 ( 1 ) S N 1 ( 1 ) + β W N 1 ( 1 ) + Z N 1 2 λ 2 ω 2 β 2 κ N 1 ( 1 ) κ N 1 ( 0 ) 2 E S N 1 ( 1 ) 2 λ 2 ω 2 β 2 δ N 1 2 λ 2 log ω 2 ω 2 + β 2 δ N 1 2 = arg max κ N 1 ( 1 ) R , δ N 1 2 R 0 ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 E S N 1 ( 1 ) 2 κ N 1 ( 1 ) 2 2 θ ^ N ( 1 ) α β λ ω 2 β 2 κ N 1 ( 0 ) E S N 1 ( 1 ) 2 κ N 1 ( 1 ) θ ( 1 ) + θ ^ N ( 1 ) α 2 + λ 2 ω 2 β 2 κ N 1 ( 0 ) 2 E S N 1 ( 1 ) 2 θ ^ N ( 1 ) ω 2 ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 δ N 1 2 λ 2 log ω 2 ω 2 + β 2 δ N 1 2 .
(A19) consists of two independent optimizations: the optimization of κ N 1 ( 1 ) R and the optimization of δ N 1 2 R 0 . Since θ ^ N ( 1 ) > 0 , it follows that
ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 E S N 1 ( 1 ) 2 < 0 .
The optimization of κ N 1 ( 1 ) R has a concave quadratic objective. Then we can obtain the optimal linear coefficient as
κ N 1 ( 1 ) = λ 2 ω 2 β 2 κ N 1 ( 0 ) θ ^ N ( 1 ) α β ϕ ( 1 ) + θ ^ N ( 1 ) β 2 + λ 2 ω 2 β 2 .
The optimization of δ N 1 2 R 0 has a decreasing objective. Then, the optimal variance is δ N 1 2 = 0 . Therefore, the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the ( N 1 ) -th step.
We then consider the sub-problem from the ( N 2 ) -th step until the final step. Given a probability distribution p S N 2 ( 1 ) and the optimal linear Gaussian random policies F N 1 : N ( 1 ) , the sub-optimization problem of the linear Gaussian random policy F N 2 ( 1 ) with parameters κ N 2 ( 1 ) and δ N 2 2 is
κ N 2 ( 1 ) , δ N 2 2 = arg max κ N 2 ( 1 ) R , δ N 2 2 R 0 E i = N 2 N R ( 1 ) S i ( 1 ) , A i ( 1 ) λ i = N 1 N D p S i ( 1 ) | S i 1 ( 1 ) | | p S i ( 0 ) | S i 1 ( 0 ) = arg max κ N 2 ( 1 ) R , δ N 2 2 R 0 θ ( 1 ) E S N 2 ( 1 ) 2 ϕ ( 1 ) E A N 2 ( 1 ) 2 θ ^ N 1 ( 1 ) E S N 1 ( 1 ) 2 θ ^ N ( 1 ) ω 2 λ D p S N 1 ( 1 ) | S N 2 ( 1 ) | | p S N 1 ( 0 ) | S N 2 ( 0 ) = arg max κ N 2 ( 1 ) R , δ N 2 2 R 0 θ ( 1 ) E S N 2 ( 1 ) 2 ϕ ( 1 ) E κ N 2 ( 1 ) S N 2 ( 1 ) + W N 2 ( 1 ) 2 θ ^ N 1 ( 1 ) E α S N 2 ( 1 ) + β κ N 2 ( 1 ) S N 2 ( 1 ) + β W N 2 ( 1 ) + Z N 2 2 θ ^ N ( 1 ) ω 2 λ 2 ω 2 β 2 κ N 2 ( 1 ) κ N 2 ( 0 ) 2 E S N 2 ( 1 ) 2 λ 2 ω 2 β 2 δ N 2 2 λ 2 log ω 2 ω 2 + β 2 δ N 2 2 = arg max κ N 2 ( 1 ) R , δ N 2 2 R 0 ϕ ( 1 ) + θ ^ N 1 ( 1 ) β 2 + λ 2 ω 2 β 2 E S N 2 ( 1 ) 2 κ N 2 ( 1 ) 2 2 θ ^ N 1 ( 1 ) α β λ ω 2 β 2 κ N 2 ( 0 ) E S N 2 ( 1 ) 2 κ N 2 ( 1 ) θ ( 1 ) + θ ^ N 1 ( 1 ) α 2 + λ 2 ω 2 β 2 κ N 2 ( 0 ) 2 E S N 2 ( 1 ) 2 θ ^ N 1 ( 1 ) ω 2 θ ^ N ( 1 ) ω 2 ϕ ( 1 ) + θ ^ N 1 ( 1 ) β 2 + λ 2 ω 2 β 2 δ N 2 2 λ 2 log ω 2 ω 2 + β 2 δ N 2 2 .
Note that the objective functions in (A19) and (A22) have the same form. Therefore, we can use the same arguments to show that the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the ( N 2 ) -th step, i.e., δ N 2 2 = 0 and
κ N 2 ( 1 ) = λ 2 ω 2 β 2 κ N 2 ( 0 ) θ ^ N 1 ( 1 ) α β ϕ ( 1 ) + θ ^ N 1 ( 1 ) β 2 + λ 2 ω 2 β 2 .
We can further prove the optimal linear Gaussian random policies in the remaining steps reduce to the optimal deterministic privacy-preserving LQG policies based on the same arguments. □

References

  1. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjel, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  2. Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; Abbeel, P. Adversarial attacks on neural network policies. arXiv 2016, arXiv:1702.02284. [Google Scholar]
  3. Lin, Y.C.; Hong, Z.W.; Liao, Y.H.; Shih, M.L.; Liu, M.Y.; Min, S. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3756–3762. [Google Scholar]
  4. Behzadan, V.; Munir, A. Vulnerability of deep reinforcement learning to policy induction attacks. In Proceedings of the MLDM 2017, New York, NY, USA, 15–20 July 2017; pp. 262–275. [Google Scholar]
  5. Russo, A.; Proutiere, A. Optimal attacks on reinforcement learning policies. arXiv 2019, arXiv:1907.13548. [Google Scholar]
  6. Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  7. Tramer, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. In Proceedings of the ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  8. Sinha, A.; Namkoong, H.; Duchi, J. Certifying some distributional robustness with principled adversarial training. In Proceedings of the ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  9. Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the robustness of deep neural networks via stability training. In Proceedings of the CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  10. Yan, Z.; Guo, Y.; Zhang, C. Deep defense: Training DNNs with improved adversarial robustness. In Proceedings of the NIPS 2018, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
  11. Shapley, L. Stochastic games. Proc. Natl. Acad. Sci. USA 1953, 39, 1095–1100. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Gleave, A.; Dennis, M.; Wild, C.; Kant, N.; Levine, S.; Russell, S. Adversarial policies: Attacking deep reinforcement learning. In Proceedings of the ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  13. Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust adversarial reinforcement learning. In Proceedings of the ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; pp. 2817–2826. [Google Scholar]
  14. Horak, K.; Zhu, Q.; Bosansky, B. Manipulating adversary’s belief: A dynamic game approach to deception by design for proactive network security. In Proceedings of the GameSec 2017, Vienna, Austria, 23–25 October 2017; pp. 273–294. [Google Scholar]
  15. Crawford, V.P.; Sobel, J. Strategic information transmission. Econometrica 1982, 50, 1431–1451. [Google Scholar] [CrossRef]
  16. Saritas, S.; Yuksel, S.; Gezici, S. Nash and Stackelberg equilibria for dynamic cheap talk and signaling games. In Proceedings of the ACC 2017, Seattle, WA, USA, 24–26 May 2017; pp. 3644–3649. [Google Scholar]
  17. Saritas, S.; Shereen, E.; Sandberg, H.; Dán, G. Adversarial attacks on continuous authentication security: A dynamic game approach. In Proceedings of the GameSec 2019, Stockholm, Sweden, 30 October–1 November 2019; pp. 439–458. [Google Scholar]
  18. Li, Z.; Dán, G. Dynamic cheap talk for robust adversarial learning. In Proceedings of the GameSec 2019, Stockholm, Sweden, 30 October–1 November 2019; pp. 297–309. [Google Scholar]
  19. Li, Z.; Dán, G.; Liu, D. A game theoretic analysis of LQG control under adversarial attack. In Proceedings of the IEEE CDC 2020, Jeju Island, Korea, 14–18 December 2020; pp. 1632–1639. [Google Scholar]
  20. Osogami, T. Robust partially observable Markov decision process. In Proceedings of the ICML 2015, Lille, France, 6–11 July 2015. [Google Scholar]
  21. Sayin, M.O.; Basar, T. Secure sensor design for cyber-physical systems against advanced persistent threats. In Proceedings of the GameSec 2017, Vienna, Austria, 23–25 October 2017; pp. 91–111. [Google Scholar]
  22. Sayin, M.O.; Akyol, E.; Basar, T. Hierarchical multistage Gaussian signaling games in noncooperative communication and control systems. Automatica 2019, 107, 9–20. [Google Scholar] [CrossRef] [Green Version]
  23. Sun, C.; Li, Z.; Wang, C. Adversarial linear quadratic regulator under falsified actions. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
  24. Zhang, R.; Venkitasubramaniam, P. Stealthy control signal attacks in linear quadratic Gaussian control systems: Detectability reward tradeoff. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1555–1570. [Google Scholar] [CrossRef]
  25. Ren, X.X.; Yang, G.H. Kullback-Leibler divergence-based optimal stealthy sensor attack against networked linear quadratic Gaussian systems. IEEE Trans. Cybern. 2021, 1–10. [Google Scholar] [CrossRef] [PubMed]
  26. Venkitasubramaniam, P. Privacy in stochastic control: A Markov decision process perspective. In Proceedings of the Allerton 2013, Monticello, IL, USA, 2–4 October 2013; pp. 381–388. [Google Scholar]
  27. Ny, J.L.; Pappas, G.J. Differentially private filtering. IEEE Trans. Autom. Control. 2013, 59, 341–354. [Google Scholar]
  28. Hale, M.T.; Egerstedt, M. Cloud-enabled differentially private multiagent optimization with constraints. IEEE Trans. Control. Netw. Syst. 2017, 5, 1693–1706. [Google Scholar] [CrossRef] [Green Version]
  29. Hale, M.; Jones, A.; Leahy, K. Privacy in feedback: The differentially private LQG. In Proceedings of the ACC 2018, Milwaukee, WI, USA, 27–29 June 2018; pp. 3386–3391. [Google Scholar]
  30. Hawkins, C.; Hale, M. Differentially private formation control. In Proceedings of the IEEE CDC 2020, Jeju Island, Korea, 14–18 December 2020; pp. 6260–6265. [Google Scholar]
  31. Dwork, C. Differential privacy. In Proceedings of the ICALP 2006, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
  32. Wang, B.; Hegde, N. Privacy-preserving Q-learning with functional noise in continuous spaces. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  33. Alexandru, A.B.; Pappas, G.J. Encrypted LQG using labeled homomorphic encryption. In Proceedings of the ACM/IEEE ICCPS 2019, Montreal, QC, Canada, 16–18 April 2019. [Google Scholar]
  34. Arora, S.; Doshi, P. A survey of inverse reinforcement learning: Challenges, methods and progress. Artif. Intell. 2021, 297, 103500. [Google Scholar] [CrossRef]
  35. Soderstrom, T. Discrete-Time Stochastic Systems; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
  36. Baranga, A. The contraction principle as a particular case of Kleene’s fixed point theorem. Discret. Math. 1991, 98, 75–79. [Google Scholar] [CrossRef] [Green Version]
  37. Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Proceedings of the IEEE ICASSP 2007, Honolulu, HI, USA, 15–20 April 2007. [Google Scholar]
  38. Durrieu, J.; Thiran, J.; Kelly, F. Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models. In Proceedings of the IEEE ICASSP 2012, Kyoto, Japan, 25–30 March 2012. [Google Scholar]
  39. Cui, S.; Datcu, M. Comparison of Kullback-Leibler divergence approximation methods between Gaussian mixture models for satellite image retrieval. In Proceedings of the IEEE IGARSS 2015, Milan, Italy, 26–31 July 2015. [Google Scholar]
Figure 1. LQG control in the presence of an eavesdropper.
Figure 1. LQG control in the presence of an eavesdropper.
Entropy 24 00856 g001
Figure 2. For λ = 1 , 5 or 10, the convergence of θ ^ N + 1 k ( 1 ) = J k ( J k 1 ( ( J 2 ( J 1 ( θ ^ N + 1 ( 1 ) ) ) ) ) ) .
Figure 2. For λ = 1 , 5 or 10, the convergence of θ ^ N + 1 k ( 1 ) = J k ( J k 1 ( ( J 2 ( J 1 ( θ ^ N + 1 ( 1 ) ) ) ) ) ) .
Entropy 24 00856 g002
Figure 3. When 0 λ 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 3. When 0 λ 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g003
Figure 4. When 0 λ 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 4. When 0 λ 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g004
Figure 5. For 0.01 θ ( 1 ) 8 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 5. For 0.01 θ ( 1 ) 8 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g005
Figure 6. For 0.01 θ ( 1 ) 8 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 6. For 0.01 θ ( 1 ) 8 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g006
Figure 7. For 0.01 ϕ ( 1 ) 40 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 7. For 0.01 ϕ ( 1 ) 40 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g007
Figure 8. For 0.01 ϕ ( 1 ) 40 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 8. For 0.01 ϕ ( 1 ) 40 and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g008
Figure 9. For θ ( 1 ) = 1 , ϕ ( 1 ) = ϕ ( 0 ) = 16 , 0.01 θ ( 0 ) 8 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 9. For θ ( 1 ) = 1 , ϕ ( 1 ) = ϕ ( 0 ) = 16 , 0.01 θ ( 0 ) 8 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g009
Figure 10. For θ ( 1 ) = 1 , ϕ ( 1 ) = ϕ ( 0 ) = 16 , 0.01 θ ( 0 ) 8 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 10. For θ ( 1 ) = 1 , ϕ ( 1 ) = ϕ ( 0 ) = 16 , 0.01 θ ( 0 ) 8 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g010
Figure 11. For θ ( 1 ) = θ ( 0 ) = 1 , ϕ ( 1 ) = 16 , 0.01 ϕ ( 0 ) 40 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 11. For θ ( 1 ) = θ ( 0 ) = 1 , ϕ ( 1 ) = 16 , 0.01 ϕ ( 0 ) 40 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward lim N 1 N E i = 1 N R ( 1 ) S i ( 1 ) , A i ( 1 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g011
Figure 12. For θ ( 1 ) = θ ( 0 ) = 1 , ϕ ( 1 ) = 16 , 0.01 ϕ ( 0 ) 40 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Figure 12. For θ ( 1 ) = θ ( 0 ) = 1 , ϕ ( 1 ) = 16 , 0.01 ϕ ( 0 ) 40 , and λ = 0 (without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk lim N 1 N D p S 1 : N ( 1 ) | | p S 1 : N ( 0 ) achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.
Entropy 24 00856 g012
Table 1. Comparison of research on privacy problems.
Table 1. Comparison of research on privacy problems.
Private InformationPrivacy Model/MeasurePrivacy Mechanism
[26]StateEquivocationPrivacy-preserving policy design
[27,28,29,30]StateDifferential privacyAdding privacy noise to state
[32]Reward functionDifferential privacyAdding privacy noise to value function
[33]The whole LQG systemComputational secrecyLabeled homomorphic encryption
This workAgent identityKullback–Leibler divergencePrivacy-preserving policy design
Table 2. Parameters.
Table 2. Parameters.
ParameterMeaningParameterMeaning
NNumber of stepsHAgent identity binary hypothesis
α , β Time-invariant linear coefficients in the linear Gaussian dynamic model z i , ω 2 Independent zero-mean Gaussian-distributed disturbance noise in the i-th step and its variance
s i ( H ) State of the agent ( H ) in the i-th step a i ( H ) Action of the agent ( H ) in the i-th step
F i ( H ) Policy of the agent ( H ) in the i-th step κ i ( H ) State feedback gain of a linear policy of the agent ( H ) in the i-th step
r i ( H ) Instantaneous control reward of the agent ( H ) in the i-th step R ( H ) , θ ( H ) , ϕ ( H ) Time-invariant instantaneous quadratic control reward function of the agent ( H ) and its coefficients
μ 1 , σ 1 2 Mean and variance of the Gaussian-distributed initial state λ Privacy-preserving design weight
Table 3. Default model parameters.
Table 3. Default model parameters.
Parameter μ 1 σ 1 2 α β ω 2 θ ( 0 ) ϕ ( 0 )
Value111 0.5 0.5 116
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ferrari, E.; Tian, Y.; Sun, C.; Li, Z.; Wang, C. Privacy-Preserving Design of Scalar LQG Control. Entropy 2022, 24, 856. https://doi.org/10.3390/e24070856

AMA Style

Ferrari E, Tian Y, Sun C, Li Z, Wang C. Privacy-Preserving Design of Scalar LQG Control. Entropy. 2022; 24(7):856. https://doi.org/10.3390/e24070856

Chicago/Turabian Style

Ferrari, Edoardo, Yue Tian, Chenglong Sun, Zuxing Li, and Chao Wang. 2022. "Privacy-Preserving Design of Scalar LQG Control" Entropy 24, no. 7: 856. https://doi.org/10.3390/e24070856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop