Privacy-Preserving Design of Scalar LQG Control

This paper studies the agent identity privacy problem in the scalar linear quadratic Gaussian (LQG) control system. The agent identity is a binary hypothesis: Agent A or Agent B. An eavesdropper is assumed to make a hypothesis testing the agent identity based on the intercepted environment state sequence. The privacy risk is measured by the Kullback–Leibler divergence between the probability distributions of state sequences under two hypotheses. By taking into account both the accumulative control reward and privacy risk, an optimization problem of the policy of Agent B is formulated. This paper shows that the optimal deterministic privacy-preserving LQG policy of Agent B is a linear mapping. A sufficient condition is given to guarantee that the optimal deterministic privacy-preserving policy is time-invariant in the asymptotic regime. It is also shown that adding an independent Gaussian random process noise to the linear mapping of the optimal deterministic privacy-preserving policy cannot improve the performance of Agent B. The numerical experiments justify the theoretic results and illustrate the reward–privacy trade-off.


Related Work
During the last decades, control technologies have been widely employed and significantly improved the industry productivity, management efficiency, and life convenience. The breakthrough of the deep reinforcement learning (DRL) technology [1] enables the control systems to be intelligent and applicable for more complicated tasks. Along with the increasing concerns about information security and privacy, adversarial problems in control systems have also attracted increasing attentions recently.
The related works and literature are introduced and discussed in the following. There are two types of adversarial problems considered in these works: active attacks and privacy problems.

Research on Active Adversarial Attacks
Most previous works focus on studying the active adversarial attacks on the control systems, which aim to degenerate the control efficiency, or even worse, to lead the system to an undesired state, and developing the corresponding defense mechanisms. Depending on their methodologies, these works can be divided into two classes. One class aims to develop the adversarial reinforcement learning algorithm under attack. The other class makes a theoretic study on the adversarial problem in the standard control model. DRL takes advantage of the deep network to represent a complex non-linear value function or policy function. Similar to the deep network, DRL is also vulnerable to the adversarial example attack, i.e., the DRL-trained policy can be misled to take a wrong action by adding a minor distortion to the observation of the agent [2]. In [2][3][4][5], the optimal generation of adversarial examples has been studied for given DRL algorithms. As a countermeasure, the mechanism of adversarial training uses adversarial examples in the training phase to enhance the robustness of control policy under attack [6][7][8]. In [9,10], attack/robustness-related regularization terms are added in the optimization objective to improve the robustness of the policy.
In most theoretic studies, adversarial attack problems are modeled from the game theoretic perspective. Stochastic game (SG) [11] and partially observable SG (POSG) can model the indirect (In SG or POSG, players indirectly interact with each other by feeding their actions back to the dynamic environment.) interactions between multiple players in the dynamic control system and have been employed in the robust or adversarial control studies [12][13][14]. Cheap talk game [15] models direct (In the cheap talk game, the sender with private information sends a message to the receiver and the receiver takes an action based on the received message and a belief on the inaccessible private information.) interactions between a sender and a receiver. In [16][17][18][19], the single-step cheap talk game has been extended to dynamic cheap talk games to model the adversarial example attacks in the multi-step control systems. With uncertainty about the environment dynamics in a partially observable Markov decision process (POMDP), the robust POMDP is formulated as a Stackelberg game in [20], where the agent (leader) optimizes the control policy under the worst-case assumption of the environment dynamics (follower). Another kind of adversarial attack maliciously falsifies the agent actions and feeds the falsified actions back to the dynamic environment to degrade the control performance. The falsified action attack can be modeled by Stackelberg games [21,22], where the dynamic environment is the leader and the adversarial agent is the follower. In our previous work [23], the falsified action attack on the linear quadratic regulator control is modeled by a dynamic cheap talk game and the adversarial attack is evaluated by the Fisher information between the random agent action and the falsified action.
Optimal stealthy attacks have also been studied. In [24,25], Kullback-Leibler divergence is used to measure the stealthiness of the attacks on the control signal and the sensing data, respectively; then the optimal attacks against LQG control system are developed with the objective of maximizing the quadratic cost while maintaining a degree of attack stealthiness.

Research on Privacy Problems
Besides the active attacks, passive eavesdropping in control systems leads to privacy problems. Most works focus on preserving the privacy-sensitive environment states. The design of agent actions in the Markov decision process has been investigated when the equivocation of states given system inputs and outputs is imposed as the privacypreserving objective [26]. In [27][28][29][30], the notion of differential privacy [31] is introduced in the multi-agent control, where each agent adds privacy noise to his states before sharing them with other agents while guaranteeing the whole control system network to operate well. The reward function is a succinct description of the control task and is strongly relevant with the agent actions. The DRL-learned value function can reveal the privacysensitive reward function. Regarding this privacy problem, functional noise is added to the value function in the Q-learning such that the neighborhood reward functions are indistinguishable [32]. As a promising computational secrecy technology, labeled homomorphic encryption has been employed to encrypt the private states, gain matrices, control inputs, and intermediary steps in the cloud-outsourced LQG [33].

Motivation
In this paper, we consider the agent identity privacy problem in the LQG control, which is motivated by the inverse reinforcement learning (IRL). IRL algorithms [34] can reconstruct the reward functions of agents and therefore can also be maliciously exploited to identify the agents. Similar to many other privacy problems in the big data era, such as the smart meter privacy problem, the agent identity of a control system is privacy-sensitive. When the agent identity is leaked, an adversary can further employ the corresponding optimal attacks on the control system.

Content and Contribution
We model the agent identity privacy problem as an adversarial binary hypothesis testing and employ the Kullback-Leibler divergence between the probability distributions of environment state sequences under different hypotheses as the privacy risk measure. We formulate a novel optimization problem and study the optimal privacy-preserving LQG policy. This work is compared with the previous research on privacy problems in Table 1. The rest of this paper is organized as follows. In Section 3, we formulate the agent identity privacy problem in the LQG control system. In Section 4, we optimize the deterministic privacy-preserving LQG policy and give a sufficient condition for time-invariant optimal deterministic policy in the asymptotic regime. In Section 5, we discuss the random privacy-preserving LQG policy and show that the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy. In Section 6, we present and analyze the numerical experiment results. Section 7 concludes this paper.

Notation
Unless otherwise specified, we denote a random scalar by a capital letter, e.g., X, its realization by the corresponding lower case letter, e.g., x, the Gaussian distribution with mean µ and variance σ 2 by N (µ, σ 2 ), the expectation operation by E(·), the Kullback-Leibler divergence between two probability distributions by D(·||·), and the natural logarithm by log(·).

Agent Identity Privacy Problem in LQG Control
We consider an N-step LQG control in the presence of an eavesdropper as shown in Figure 1. There are two possible agents, Agent A and Agent B, which are with respect to a hypothesis H = 0 and an alternative hypothesis H = 1. We assume that the agents and the eavesdropper have perfect observations of the environment states. Based on the intercepted state sequence, the eavesdropper makes a binary hypothesis testing (A binary hypothesis is considered in this paper for simplification and can be extended to a multi-hypothesis.) to identify the current agent, which results in an agent identity privacy problem. To have a better understanding of the privacy problem, we give an example in the emerging application of autonomous vehicle. An autonomous vehicle can be controlled by a human driver (Agent A) or an autonomous driving system (Agent B). An adversary, who can be a compromised manager of the vehicle to everything (V2X) network, has access to the sensing data (environment state) of the autonomous vehicle and aims to attack the autonomous vehicle, e.g., to mislead the autonomous vehicle off the lane. To this end, the adversary needs to first identify if the current driver is the autonomous driving system by the intercepted sensing data sequence. The agent identity privacy problem commonly exists in intelligent autonomous systems, e.g., unmanned aerial vehicles and robots, where the autonomous control agents depending strongly on the sensing data are vulnerable to injection attacks and therefore the agent identities are privacy-sensitive. The LQG control model for each agent is given as follows: For H = 0 or H = 1,

Environment
where the parameters α = 0, β = 0, θ (H) > 0, φ (H) > 0, µ 1 , σ 2 1 > 0, and ω 2 > 0 are given. The initial environment state s (H) 1 is randomly generated following an independent Gaussian distribution. In the i-th time step, on observing the environment state s , and z i randomly generated following an independent zero-mean Gaussian distribution as (1). In the standard LQG problem, the agent with respect to the hypothesis H only aims to maximize the expected accumulative reward by optimizing the control policies F The optimal LQG control policy has been well established [35] and can be described as follows. For H = 0 or H = 1, 1 ≤ i ≤ N, For H = 0 or 1, it can be easily verified that the mapping L (H) is order-preserving, i.e., From the Kleene's fixed point theorem [36], it follows thatθ Therefore, if we consider the asymptotic regime as N → ∞, the optimal control polices are time-invariant: For H = 0 or H = 1, i ≥ 1, For the agent identity privacy problem, we assume that the eavesdropper collects a sequence of environment states and carries out a binary hypothesis testing on the agent identity. Thus, the privacy risk can be measured by the hypothesis testing performance. In information theory, Kullback-Leibler divergence measures the "distance" between two probability distributions. When the value of the Kullback-Leibler divergence D p 1:N are statistically "closer" to each other and it is more difficult for the eavesdropper to identify the current agent, i.e., a poorer hypothesis testing performance and a lower privacy risk. In this paper, we employ the Kullback-Leibler divergence D p Furthermore, we assume that both agents aim to improve their own expected accumulative rewards while only Agent B considers to reduce the privacy risk. This assumption makes sense in a lot of scenarios. In the aforementioned autonomous vehicle example, Agent A denotes the human driver and does not need to change the optimal driving style; Agent B denotes the autonomous driving system and can be reconfigured with respect to the human's optimal driving style to improve the driving efficiency and to reduce the privacy risk. Under the assumption, Agent A takes the optimal LQG control policy as described by (7)- (10) with H = 0. In the following, we focus on the privacy-preserving LQG control policy of Agent B. Taking into account the two design objectives of Agent B, we formulate the following optimization problem: where λ ≥ 0 denotes the privacy-preserving design weight; the random environment state sequence S (0) * 1:N is induced by the optimal LQG policy F (0) * 1:N of Agent A. It follows from the chain rule of Kullback-Leibler divergence and the Markovian property of the state sequences that the privacy risk measure can be further decomposed as It is obvious that the optimal privacy-preserving LQG control policy of Agent B depends on the value of λ. In the following two remarks, the optimal privacy-preserving LQG control policies are characterized for two special cases, λ = 0 and λ → ∞, respectively.

Remark 1.
When λ = 0, Agent B only aims to maximize the expected accumulative reward . In this case, the optimal privacy-preserving LQG policy of Agent B reduces to the optimal LQG policy of Agent B, i.e., F

Remark 2.
When λ → ∞, Agent B only aims to minimize the privacy risk, which is measured In this case, the optimal privacy-preserving LQG policy of Agent B reduces to the optimal LQG policy of Agent A, i.e., F for all 1 ≤ i ≤ N, and the minimum privacy risk is achieved, i.e., When 0 < λ < ∞, we characterize the optimal privacy-preserving LQG control policies of Agent B in different forms in the following sections. For ease of reading, we list the parameters and their meanings in Table 2.

Deterministic Privacy-Preserving LQG Policy
When the privacy risk is not considered, as shown in (10), the optimal LQG control policy of Agent B is a deterministic linear mapping. In this section, we study the optimal deterministic privacy-preserving LQG policy of Agent B. Therefore, the policy of Agent B can be specified as: In the following theorem, we characterize the optimal deterministic privacy-preserving LQG policy of Agent B.
Theorem 1. At each step, the optimal deterministic privacy-preserving LQG policy of Agent B with respect to the optimization problem (14) is a linear mapping as: Then, the maximum achievable weighted design objective of Agent B is The proof of Theorem 1 is presented in Appendix A.

Remark 3.
When λ = 0, it is easy to show that κ , the optimal deterministic privacy-preserving LQG policy is consistent with the optimal privacy-preserving LQG policy shown in Remark 1.

Remark 4.
It is easy to show that lim λ→∞ κ , the optimal deterministic privacy-preserving LQG policy is consistent with the optimal privacy-preserving LQG policy shown in Remark 2. (14) is a linear combination of the expected accumulative reward and the privacy risk measured by the Kullback-Leibler divergence, the optimal linear coefficient κ

Remark 5. Although the objective in
(the optimal linear coefficient with respect to only maximize the expected accumulative reward) and κ (0) * i (the optimal linear coefficient with respect to only minimize the privacy risk) when we consider the deterministic privacy-preserving LQG control policy of Agent B.

Remark 6.
When Agent B employs the optimal deterministic privacy-preserving LQG policy at each step, the random state-action sequence is jointly Gaussian distributed.
In the asymptotic regime as N → ∞, the optimal LQG control policy is time-invariant. In this case, the design of the optimal policy becomes an easier task. Theorem 2 gives a sufficient condition such that the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime.

Theorem 2. When the model parameters satisfy the following inequality
the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime. More specifically, N+1 ))) · · · )) converges to the unique fixed pointθ (1) aŝ and the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B can be described by Under this condition, the asymptotic weighted design object rate of Agent B achieved by the timeinvariant optimal deterministic privacy-preserving LQG policy is The proof of Theorem 2 is given in Appendix B.

Random Privacy-Preserving LQG Policy
As shown in Theorem 1, the optimal deterministic privacy-preserving LQG policy of Agent B is a linear mapping. In this section, we first discuss the optimal random privacypreserving LQG policy and then consider a particular random policy by extending the deterministic linear mapping to the linear Gaussian random policy for Agent B. Here, the random policy of Agent B can be specified as: With slight abuse of notation, we denote the condition probability (density) of taking the action a (1) i ∈ R given the state s To the best of our knowledge, there is no analytically tractable formula for Kullback-Leibler divergence between Gaussian mixture models and only approximations are available [37][38][39]. Therefore, we do not give the close-form solution of the optimal random privacy-preserving LQG policy in this paper.
In what follows, we focus on the linear Gaussian random policy: For 1 ≤ i ≤ N, where w (1) i is the realization of an independent zero-mean Gaussian random process noise W i , δ 2 i . Theorem 3 characterizes the optimal linear Gaussian random privacy-preserving LQG policy of Agent B. Theorem 3. At each step, the optimal linear Gaussian random privacy-preserving LQG policy of Agent B with respect to the optimization problem (14) is the same deterministic linear mapping as in Theorem 1.
The proof of Theorem 3 is presented in Appendix C.

Remark 7.
Adding an independent zero-mean Gaussian random process noise to the linear mapping of the optimal deterministic privacy-preserving LQG policy cannot improve the performance of Agent B.

Impact of the Privacy-Preserving Design Weight λ
Here, we show the impact of the privacy-preserving design weight λ on the trade-off between the control reward of Agent B and the privacy risk. We use the same parameters as in Section 6.1, but allow 0 ≤ λ ≤ 10,000. Then, Theorem 2 is applicable and therefore the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime. Figures 3 and 4 show that both the asymptotic average control    In addition to the analytical results, we also present the simulation results by considering privacy in Figures 3 and 4. Given 0 ≤ λ ≤ 10,000, we employ the corresponding time-invariant optimal deterministic privacy-preserving LQG policy of Agent B and run the 10,000-step privacy-preserving LQG control with 100 randomly generated initial states. Then, the average control reward and the average privacy risk are evaluated and compared with the analytical results of asymptotic average control reward and asymptotic average privacy risk, respectively. As shown in Figures 3 and 4, the simulation results match quite well with the analytical results, which validates our analytical results.

Impact of Parameter θ (1)
Here, we study the impact of the parameter θ (1) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we set φ (1) = φ (0) = 16 and allow 0.01 ≤ θ (1) ≤ 8, λ = 0 (without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all 0.01 ≤ θ (1) ≤ 8 and by increasing the value of λ, Figures 5 and 6 show a trade-off between the control reward of Agent B and the privacy risk, which is consistent with the previous observations. For λ = 0, 10, 100, 1000 or 10,000, Figure 5 shows that the asymptotic average control reward of Agent B decreases as θ (1) increases. This is reasonable since −θ (1) is the quadratic coefficient in the instantaneous reward function R (1) . For λ = 0, 10, 100, 1000 or 10,000, Figure 6 shows that the asymptotic average privacy risk has a pattern to decrease first, then to increase, and to achieve the minimum value 0 when θ (1) = θ (0) = 1. When θ (1) = θ (0) = 1, both agents have the same instantaneous reward function and employ the same optimal LQG control policy, which leads to the same state sequence distribution under both hypotheses and the minimum value 0 of the Kullback-Leibler divergence. As θ (1) deviates from the value of θ (0) , the agents have more different instantaneous reward functions, which lead to more different state sequence distributions under both hypotheses and a larger value of the Kullback-Leibler divergence.
(1) * i achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

Impact of Parameter φ (1)
Here, we show the impact of the parameter φ (1) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we set θ (1) = θ (0) = 1 and allow 0.01 ≤ φ (1) ≤ 40, λ = 0 (without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all 0.01 ≤ φ (1) ≤ 40 and by increasing the value of λ, Figures 7 and 8 also show a trade-off between the control reward of Agent B and the privacy risk. For λ = 0, 10, 100, 1000 or 10,000, Figure 7 shows that the asymptotic average control reward of Agent B decreases as φ (1) increases. This is because −φ (1) is the other quadratic coefficient in the instantaneous reward function R (1) . For λ = 0, 10, 100, 1000 or 10,000, Figure 8 shows that the asymptotic average privacy risk has a similar pattern to decrease first, then to increase, and to achieve the minimum value 0 when φ (1) = φ (0) = 16. This pattern can be similarly explained as Section 6.3.
(1) * i achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

Impact of Parameter θ (0)
By fixing θ (1) = 1 and φ (1) = φ (0) = 16, we study the impact of the parameter θ (0) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we allow 0.01 ≤ θ (0) ≤ 8 and λ = 0 (without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all 0.01 ≤ θ (0) ≤ 8 and by increasing the value of λ, Figures 9 and 10 show a trade-off between the control reward of Agent B and the privacy risk. For λ = 0, 10, 100, 1000 or 10,000, Figures 9 and 10 show that the asymptotic average control reward of Agent B achieves the maximum value while the asymptotic average privacy risk achieves the minimum value 0 when θ (1) = θ (0) = 1. In this case, both agents have the same instantaneous reward function and employ the same optimal LQG control policy, which maximizes their control rewards, leads to the same state sequence distribution under both hypotheses, and therefore achieves the minimum value 0 of the Kullback-Leibler divergence.
(1) * i achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control (1) i achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B. 6.6. Impact of Parameter φ (0) By fixing φ (1) = 16 and θ (1) = θ (0) = 1, we study the impact of the parameter φ (0) on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we allow 0.01 ≤ φ (0) ≤ 40 and λ = 0 (without privacy), 10, 100, 1000 or 10,000. From Figures 11 and 12, we have similar observations of the impact of φ (0) as in Section 6.5. These observations here can be similarly explained as well.

Conclusions
In this paper, we consider the agent identity privacy problem in the scalar LQG control. Regarding this novel privacy problem, we model it as an adversarial binary hypothesis testing and employ the Kullback-Leibler divergence to measure the privacy risk. We then formulate a novel privacy-preserving LQG control optimization by taking into account both the accumulative control reward of Agent B and the privacy risk. We prove that the optimal deterministic privacy-preserving LQG control policy of Agent B is a linear mapping, which is consistent with the standard LQG. We further show that the random policy formulated by adding an independent Gaussian random process noise to the optimal deterministic privacy-preserving LQG policy cannot improve the performance. We also give a sufficient condition to guarantee the time-invariant optimal deterministic privacy-preserving LQG policy in the asymptotic regime.
This research can be extended in our future works. Studying the general random policy of Agent B is an interesting extension. This theoretic study can be extended to develop privacy-preserving reinforcement learning algorithms. The problem can also be extended and formulated as a non-cooperative game of multiple agents with conflicting objectives, where some agents only aim to optimize their own accumulative control rewards while the other agents consider the agent identity privacy risk in addition to their own accumulative control rewards.
is fixed. Note the upper bound on the The upper bound can be achieved by the optimal deterministic privacy-preserving LQG policy: Then, the maximum achievable objective of the final step is We then consider the sub-problem from the (N − 1)-th step until the final step. Given a probability distribution p Given any s N−1 ). Therefore, we can obtain the optimal deterministic privacypreserving LQG policy as where κ (1) By using the optimal deterministic policies F The coefficientθ (1) N−1 can be specified aŝ It can be easily justified thatθ (A11) Note that the objective functions in (A5) and (A11) have the same form. We have also proved thatθ (1) N−1 > 0. Therefore, we can use the same arguments to obtain the optimal deterministic privacy-preserving LQG policy as where κ (1) the maximum achievable objective of the sub-problem as We can further prove the optimal deterministic privacy-preserving LQG policies in the remaining steps and the maximum achievable weighted design objective of Agent B in Theorem 1 using the same arguments.

Appendix C
Proof of Theorem 3. The proof is similar as that of Theorem 1.
We first consider the sub-problem of the final step. Given a probability distribution It is obvious the optimal parameters are κ (1) N = 0 and δ 2 N = 0, i.e., the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the final step.
Similarly, we then consider the sub-problem from the (N − 1)-th step until the final step. Given a probability distribution p The optimization of κ N−1 ∈ R has a concave quadratic objective. Then we can obtain the optimal linear coefficient as The optimization of δ 2 N−1 ∈ R ≥0 has a decreasing objective. Then, the optimal variance is δ 2 N−1 = 0. Therefore, the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the (N − 1)-th step.
We then consider the sub-problem from the (N − 2)-th step until the final step. Given a probability distribution p S (1) N−2 and the optimal linear Gaussian random policies F Note that the objective functions in (A19) and (A22) have the same form. Therefore, we can use the same arguments to show that the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the (N − 2)-th step, i.e., δ 2 N−2 = 0 and (A23)