Privacy-Preserving Design of Scalar LQG Control

Edoardo Ferrari; Yue Tian; Chenglong Sun; Zuxing Li; Chao Wang

doi:10.3390/e24070856

,

and

¹

School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China

²

School of Electrical, Electronic, and Information Engineering “Guglielmo Marconi”—DEI, University of Bologna, 40136 Bologna, Italy

^*

Author to whom correspondence should be addressed.

Entropy2022, 24(7), 856;https://doi.org/10.3390/e24070856

This article belongs to the Special Issue Adversarial Intelligence: Secrecy, Privacy, and Robustness

Version Notes

Order Reprints

Abstract

This paper studies the agent identity privacy problem in the scalar linear quadratic Gaussian (LQG) control system. The agent identity is a binary hypothesis: Agent A or Agent B. An eavesdropper is assumed to make a hypothesis testing the agent identity based on the intercepted environment state sequence. The privacy risk is measured by the Kullback–Leibler divergence between the probability distributions of state sequences under two hypotheses. By taking into account both the accumulative control reward and privacy risk, an optimization problem of the policy of Agent B is formulated. This paper shows that the optimal deterministic privacy-preserving LQG policy of Agent B is a linear mapping. A sufficient condition is given to guarantee that the optimal deterministic privacy-preserving policy is time-invariant in the asymptotic regime. It is also shown that adding an independent Gaussian random process noise to the linear mapping of the optimal deterministic privacy-preserving policy cannot improve the performance of Agent B. The numerical experiments justify the theoretic results and illustrate the reward–privacy trade-off.

Keywords:

control–privacy trade-off; hypothesis testing; Kullback–Leibler divergence; optimal control policy; privacy risk analysis

1. Related Work

During the last decades, control technologies have been widely employed and significantly improved the industry productivity, management efficiency, and life convenience. The breakthrough of the deep reinforcement learning (DRL) technology [1] enables the control systems to be intelligent and applicable for more complicated tasks. Along with the increasing concerns about information security and privacy, adversarial problems in control systems have also attracted increasing attentions recently.

The related works and literature are introduced and discussed in the following. There are two types of adversarial problems considered in these works: active attacks and privacy problems.

1.1. Research on Active Adversarial Attacks

Most previous works focus on studying the active adversarial attacks on the control systems, which aim to degenerate the control efficiency, or even worse, to lead the system to an undesired state, and developing the corresponding defense mechanisms. Depending on their methodologies, these works can be divided into two classes. One class aims to develop the adversarial reinforcement learning algorithm under attack. The other class makes a theoretic study on the adversarial problem in the standard control model.

DRL takes advantage of the deep network to represent a complex non-linear value function or policy function. Similar to the deep network, DRL is also vulnerable to the adversarial example attack, i.e., the DRL-trained policy can be misled to take a wrong action by adding a minor distortion to the observation of the agent [2]. In [2,3,4,5], the optimal generation of adversarial examples has been studied for given DRL algorithms. As a countermeasure, the mechanism of adversarial training uses adversarial examples in the training phase to enhance the robustness of control policy under attack [6,7,8]. In [9,10], attack/robustness-related regularization terms are added in the optimization objective to improve the robustness of the policy.

In most theoretic studies, adversarial attack problems are modeled from the game theoretic perspective. Stochastic game (SG) [11] and partially observable SG (POSG) can model the indirect (In SG or POSG, players indirectly interact with each other by feeding their actions back to the dynamic environment.) interactions between multiple players in the dynamic control system and have been employed in the robust or adversarial control studies [12,13,14]. Cheap talk game [15] models direct (In the cheap talk game, the sender with private information sends a message to the receiver and the receiver takes an action based on the received message and a belief on the inaccessible private information.) interactions between a sender and a receiver. In [16,17,18,19], the single-step cheap talk game has been extended to dynamic cheap talk games to model the adversarial example attacks in the multi-step control systems. With uncertainty about the environment dynamics in a partially observable Markov decision process (POMDP), the robust POMDP is formulated as a Stackelberg game in [20], where the agent (leader) optimizes the control policy under the worst-case assumption of the environment dynamics (follower). Another kind of adversarial attack maliciously falsifies the agent actions and feeds the falsified actions back to the dynamic environment to degrade the control performance. The falsified action attack can be modeled by Stackelberg games [21,22], where the dynamic environment is the leader and the adversarial agent is the follower. In our previous work [23], the falsified action attack on the linear quadratic regulator control is modeled by a dynamic cheap talk game and the adversarial attack is evaluated by the Fisher information between the random agent action and the falsified action.

Optimal stealthy attacks have also been studied. In [24,25], Kullback–Leibler divergence is used to measure the stealthiness of the attacks on the control signal and the sensing data, respectively; then the optimal attacks against LQG control system are developed with the objective of maximizing the quadratic cost while maintaining a degree of attack stealthiness.

1.2. Research on Privacy Problems

Besides the active attacks, passive eavesdropping in control systems leads to privacy problems. Most works focus on preserving the privacy-sensitive environment states. The design of agent actions in the Markov decision process has been investigated when the equivocation of states given system inputs and outputs is imposed as the privacy-preserving objective [26]. In [27,28,29,30], the notion of differential privacy [31] is introduced in the multi-agent control, where each agent adds privacy noise to his states before sharing them with other agents while guaranteeing the whole control system network to operate well. The reward function is a succinct description of the control task and is strongly relevant with the agent actions. The DRL-learned value function can reveal the privacy-sensitive reward function. Regarding this privacy problem, functional noise is added to the value function in the Q-learning such that the neighborhood reward functions are indistinguishable [32]. As a promising computational secrecy technology, labeled homomorphic encryption has been employed to encrypt the private states, gain matrices, control inputs, and intermediary steps in the cloud-outsourced LQG [33].

2. Introduction

2.1. Motivation

In this paper, we consider the agent identity privacy problem in the LQG control, which is motivated by the inverse reinforcement learning (IRL). IRL algorithms [34] can reconstruct the reward functions of agents and therefore can also be maliciously exploited to identify the agents. Similar to many other privacy problems in the big data era, such as the smart meter privacy problem, the agent identity of a control system is privacy-sensitive. When the agent identity is leaked, an adversary can further employ the corresponding optimal attacks on the control system.

2.2. Content and Contribution

We model the agent identity privacy problem as an adversarial binary hypothesis testing and employ the Kullback–Leibler divergence between the probability distributions of environment state sequences under different hypotheses as the privacy risk measure. We formulate a novel optimization problem and study the optimal privacy-preserving LQG policy. This work is compared with the previous research on privacy problems in Table 1.

Table 1. Comparison of research on privacy problems.

The rest of this paper is organized as follows. In Section 3, we formulate the agent identity privacy problem in the LQG control system. In Section 4, we optimize the deterministic privacy-preserving LQG policy and give a sufficient condition for time-invariant optimal deterministic policy in the asymptotic regime. In Section 5, we discuss the random privacy-preserving LQG policy and show that the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy. In Section 6, we present and analyze the numerical experiment results. Section 7 concludes this paper.

2.3. Notation

Unless otherwise specified, we denote a random scalar by a capital letter, e.g., X, its realization by the corresponding lower case letter, e.g., x, the Gaussian distribution with mean

μ

and variance

σ^{2}

by

N (μ, σ^{2})

, the expectation operation by

E (\cdot)

, the Kullback–Leibler divergence between two probability distributions by

D (\cdot | | \cdot)

, and the natural logarithm by

log (\cdot)

.

3. Agent Identity Privacy Problem in LQG Control

We consider an N-step LQG control in the presence of an eavesdropper as shown in Figure 1. There are two possible agents, Agent A and Agent B, which are with respect to a hypothesis

H = 0

and an alternative hypothesis

H = 1

. We assume that the agents and the eavesdropper have perfect observations of the environment states. Based on the intercepted state sequence, the eavesdropper makes a binary hypothesis testing (A binary hypothesis is considered in this paper for simplification and can be extended to a multi-hypothesis.) to identify the current agent, which results in an agent identity privacy problem. To have a better understanding of the privacy problem, we give an example in the emerging application of autonomous vehicle. An autonomous vehicle can be controlled by a human driver (Agent A) or an autonomous driving system (Agent B). An adversary, who can be a compromised manager of the vehicle to everything (V2X) network, has access to the sensing data (environment state) of the autonomous vehicle and aims to attack the autonomous vehicle, e.g., to mislead the autonomous vehicle off the lane. To this end, the adversary needs to first identify if the current driver is the autonomous driving system by the intercepted sensing data sequence. The agent identity privacy problem commonly exists in intelligent autonomous systems, e.g., unmanned aerial vehicles and robots, where the autonomous control agents depending strongly on the sensing data are vulnerable to injection attacks and therefore the agent identities are privacy-sensitive.

Figure 1. LQG control in the presence of an eavesdropper.

The LQG control model for each agent is given as follows: For

H = 0

or

H = 1

,

1 \leq i \leq N

,

\begin{matrix} s_{i + 1}^{(H)} = α s_{i}^{(H)} + β a_{i}^{(H)} + z_{i}, \end{matrix}

(1)

\begin{matrix} a_{i}^{(H)} = F_{i}^{(H)} (s_{i}^{(H)}), \end{matrix}

(2)

\begin{matrix} r_{i}^{(H)} = R^{(H)} (s_{i}^{(H)}, a_{i}^{(H)}) = - θ^{(H)} {(s_{i}^{(H)})}^{2} - ϕ^{(H)} {(a_{i}^{(H)})}^{2}, \end{matrix}

(3)

\begin{matrix} S_{1}^{(H)} \sim b_{1}^{(H)} ≜ N (μ_{1}, σ_{1}^{2}), \end{matrix}

(4)

\begin{matrix} Z_{i} \sim N (0, ω^{2}), \end{matrix}

(5)

where the parameters

α \neq 0

,

β \neq 0

,

θ^{(H)} > 0

,

ϕ^{(H)} > 0

,

μ_{1}

,

σ_{1}^{2} > 0

, and

ω^{2} > 0

are given. The initial environment state

s_{1}^{(H)}

is randomly generated following an independent Gaussian distribution. In the i-th time step, on observing the environment state

s_{i}^{(H)}

, the agent with respect to the hypothesis H employs the control policy

F_{i}^{(H)}

to (randomly) determine an action

a_{i}^{(H)}

as (2); the instantaneous control reward

r_{i}^{(H)}

is jointly determined by the current state

s_{i}^{(H)}

and action

a_{i}^{(H)}

as (3); the next state

s_{i + 1}^{(H)}

is jointly determined by the current state

s_{i}^{(H)}

, the current action

a_{i}^{(H)}

, and

z_{i}

randomly generated following an independent zero-mean Gaussian distribution as (1). In the standard LQG problem, the agent with respect to the hypothesis H only aims to maximize the expected accumulative reward by optimizing the control policies

F_{1 : N}^{(H)}

:

\begin{matrix} F_{1 : N}^{(H) *} = arg max_{F_{1 : N}^{(H)}} E (\sum_{i = 1}^{N} R^{(H)} (S_{i}^{(H)}, A_{i}^{(H)})) . \end{matrix}

(6)

The optimal LQG control policy has been well established [35] and can be described as follows. For

H = 0

or

H = 1

,

1 \leq i \leq N

,

\begin{matrix} {\tilde{θ}}_{N + 1}^{(H)} = 0, \end{matrix}

(7)

\begin{matrix} {\tilde{θ}}_{i}^{(H)} = L^{(H)} ({\tilde{θ}}_{i + 1}^{(H)}) = θ^{(H)} + {\tilde{θ}}_{i + 1}^{(H)} α^{2} - \frac{{({\tilde{θ}}_{i + 1}^{(H)})}^{2} α^{2} β^{2}}{ϕ^{(H)} + {\tilde{θ}}_{i + 1}^{(H)} β^{2}} > 0, \end{matrix}

(8)

\begin{matrix} κ_{i}^{(H) *} = - \frac{{\tilde{θ}}_{i + 1}^{(H)} α β}{ϕ^{(H)} + {\tilde{θ}}_{i + 1}^{(H)} β^{2}}, \end{matrix}

(9)

\begin{matrix} F_{i}^{(H) *} (s_{i}^{(H)}) = κ_{i}^{(H) *} s_{i}^{(H)} . \end{matrix}

(10)

For

H = 0

or 1, it can be easily verified that the mapping

L^{(H)}

is order-preserving, i.e.,

L^{(H)} (x) \leq L^{H} (x^{'})

if

0 \leq x \leq x^{'}

. From the Kleene’s fixed point theorem [36], it follows that

\begin{matrix} {\tilde{θ}}^{(H)} = & lim_{N \to \infty} \underset{N i t e r a t i o n s}{\underset{︸}{L^{(H)} (L^{(H)} (\dots (L^{(H)} (L^{(H)}}} ({\tilde{θ}}_{N + 1}^{(H)}))) \dots)) \\ = & θ^{(H)} + {\tilde{θ}}^{(H)} α^{2} - \frac{{({\tilde{θ}}^{(H)})}^{2} α^{2} β^{2}}{ϕ^{(H)} + {\tilde{θ}}^{(H)} β^{2}} \\ = & \frac{\sqrt{{(ϕ^{(H)} - θ^{(H)} β^{2} - ϕ^{(H)} α^{2})}^{2} + 4 θ^{(H)} ϕ^{(H)} β^{2}} - (ϕ^{(H)} - θ^{(H)} β^{2} - ϕ^{(H)} α^{2})}{2 β^{2}} . \end{matrix}

(11)

Therefore, if we consider the asymptotic regime as

N \to \infty

, the optimal control polices are time-invariant: For

H = 0

or

H = 1

,

i \geq 1

,

\begin{matrix} κ^{(H) *} = - \frac{{\tilde{θ}}^{(H)} α β}{ϕ^{(H)} + {\tilde{θ}}^{(H)} β^{2}}, \end{matrix}

(12)

\begin{matrix} F_{i}^{(H) *} (s_{i}^{(H)}) = κ^{(H) *} s_{i}^{(H)} . \end{matrix}

(13)

For the agent identity privacy problem, we assume that the eavesdropper collects a sequence of environment states and carries out a binary hypothesis testing on the agent identity. Thus, the privacy risk can be measured by the hypothesis testing performance. In information theory, Kullback–Leibler divergence measures the “distance” between two probability distributions. When the value of the Kullback–Leibler divergence

D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0)}})

is smaller, the random environment state sequences

S_{1 : N}^{(0)}

and

S_{1 : N}^{(1)}

are statistically “closer” to each other and it is more difficult for the eavesdropper to identify the current agent, i.e., a poorer hypothesis testing performance and a lower privacy risk. In this paper, we employ the Kullback–Leibler divergence

D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0)}})

as the privacy risk measure.

Furthermore, we assume that both agents aim to improve their own expected accumulative rewards while only Agent B considers to reduce the privacy risk. This assumption makes sense in a lot of scenarios. In the aforementioned autonomous vehicle example, Agent A denotes the human driver and does not need to change the optimal driving style; Agent B denotes the autonomous driving system and can be reconfigured with respect to the human’s optimal driving style to improve the driving efficiency and to reduce the privacy risk. Under the assumption, Agent A takes the optimal LQG control policy as described by (7)–(10) with

H = 0

. In the following, we focus on the privacy-preserving LQG control policy of Agent B. Taking into account the two design objectives of Agent B, we formulate the following optimization problem:

\begin{matrix} F_{1 : N}^{(1) ★} = arg max_{F_{1 : N}^{(1)}} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0) *}}), \end{matrix}

(14)

where

λ \geq 0

denotes the privacy-preserving design weight; the random environment state sequence

S_{1 : N}^{(0) *}

is induced by the optimal LQG policy

F_{1 : N}^{(0) *}

of Agent A. It follows from the chain rule of Kullback–Leibler divergence and the Markovian property of the state sequences that the privacy risk measure can be further decomposed as

\begin{matrix} D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0) *}}) = & D (p_{S_{1}^{(1)}} | | p_{S_{1}^{(0)}}) + \sum_{i = 2}^{N} D (p_{S_{i}^{(1)} | S_{i - 1}^{(1)}} | | p_{S_{i}^{(0) *} | S_{i - 1}^{(0) *}}) \\ = & \sum_{i = 2}^{N} D (p_{S_{i}^{(1)} | S_{i - 1}^{(1)}} | | p_{S_{i}^{(0) *} | S_{i - 1}^{(0) *}}) . \end{matrix}

(15)

It is obvious that the optimal privacy-preserving LQG control policy of Agent B depends on the value of

λ

. In the following two remarks, the optimal privacy-preserving LQG control policies are characterized for two special cases,

λ = 0

and

λ \to \infty

, respectively.

Remark 1.

When

λ = 0

, Agent B only aims to maximize the expected accumulative reward

E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)}))

. In this case, the optimal privacy-preserving LQG policy of Agent B reduces to the optimal LQG policy of Agent B, i.e.,

F_{i}^{(1) ★} (s_{i}^{(1)}) = F_{i}^{(1) *} (s_{i}^{(1)}) = κ_{i}^{(1) *} s_{i}^{(1)}

for all

1 \leq i \leq N

.

Remark 2.

When

λ \to \infty

, Agent B only aims to minimize the privacy risk, which is measured by the Kullback–Leibler divergence

D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0) *}})

. In this case, the optimal privacy-preserving LQG policy of Agent B reduces to the optimal LQG policy of Agent A, i.e.,

F_{i}^{(1) ★} (s_{i}^{(1)}) = F_{i}^{(0) *} (s_{i}^{(1)}) = κ_{i}^{(0) *} s_{i}^{(1)}

for all

1 \leq i \leq N

, and the minimum privacy risk is achieved, i.e.,

D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}}) = 0

.

When

0 < λ < \infty

, we characterize the optimal privacy-preserving LQG control policies of Agent B in different forms in the following sections. For ease of reading, we list the parameters and their meanings in Table 2.

Table 2. Parameters.

4. Deterministic Privacy-Preserving LQG Policy

When the privacy risk is not considered, as shown in (10), the optimal LQG control policy of Agent B is a deterministic linear mapping. In this section, we study the optimal deterministic privacy-preserving LQG policy of Agent B. Therefore, the policy of Agent B can be specified as: For

1 \leq i \leq N

,

\begin{matrix} F_{i}^{(1)} : R \to R . \end{matrix}

(16)

In the following theorem, we characterize the optimal deterministic privacy-preserving LQG policy of Agent B.

Theorem 1.

At each step, the optimal deterministic privacy-preserving LQG policy of Agent B with respect to the optimization problem (14) is a linear mapping as: For

1 \leq i \leq N

,

{\hat{θ}}_{N + 1}^{(1)} = 0,

(17)

{\hat{θ}}_{i}^{(1)} = J_{N + 1 - i} ({\hat{θ}}_{i + 1}^{(1)}) = θ^{(1)} + {\hat{θ}}_{i + 1}^{(1)} α^{2} + \frac{λ}{2 ω^{2}} β^{2} {(κ_{i}^{(0) *})}^{2} - \frac{{(\frac{λ}{2 ω^{2}} β^{2} κ_{i}^{(0) *} - {\hat{θ}}_{i + 1}^{(1)} α β)}^{2}}{ϕ^{(1)} + {\hat{θ}}_{i + 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}} > 0,

(18)

κ_{i}^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ_{i}^{(0) *} - {\hat{θ}}_{i + 1}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}_{i + 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}},

(19)

F_{i}^{(1) ★} (s_{i}^{(1)}) = κ_{i}^{(1) ★} s_{i}^{(1)} .

(20)

Then, the maximum achievable weighted design objective of Agent B is

\begin{matrix} max_{F_{1 : N}^{(1)}} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0) *}}) = - {\hat{θ}}_{1}^{(1)} (μ_{1}^{2} + σ_{1}^{2}) - ω^{2} \sum_{i = 1}^{N - 1} {\hat{θ}}_{i + 1}^{(1)} . \end{matrix}

(21)

The proof of Theorem 1 is presented in Appendix A.

Remark 3.

When

λ = 0

, it is easy to show that

κ_{i}^{(1) ★} = κ_{i}^{(1) *}

for all

1 \leq i \leq N

, i.e., the optimal deterministic privacy-preserving LQG policy is consistent with the optimal privacy-preserving LQG policy shown in Remark 1.

Remark 4.

It is easy to show that

{lim}_{λ \to \infty} κ_{i}^{(1) ★} = κ_{i}^{(0) *}

for all

1 \leq i \leq N

, i.e., the optimal deterministic privacy-preserving LQG policy is consistent with the optimal privacy-preserving LQG policy shown in Remark 2.

Remark 5.

Although the objective in (14) is a linear combination of the expected accumulative reward and the privacy risk measured by the Kullback–Leibler divergence, the optimal linear coefficient

κ_{i}^{(1) ★}

is a non-linear function of

κ_{i}^{(1) *}

(the optimal linear coefficient with respect to only maximize the expected accumulative reward) and

κ_{i}^{(0) *}

(the optimal linear coefficient with respect to only minimize the privacy risk) when we consider the deterministic privacy-preserving LQG control policy of Agent B.

Remark 6.

When Agent B employs the optimal deterministic privacy-preserving LQG policy at each step, the random state-action sequence is jointly Gaussian distributed.

In the asymptotic regime as

N \to \infty

, the optimal LQG control policy is time-invariant. In this case, the design of the optimal policy becomes an easier task. Theorem 2 gives a sufficient condition such that the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime.

Theorem 2.

When the model parameters satisfy the following inequality

\begin{matrix} \frac{|\frac{λ}{2 ω^{2}} β^{4} {(κ^{(0) *})}^{2} ϕ^{(1)} - (ϕ^{(1)} α^{2} + \frac{λ}{2 ω^{2}} β^{2} {(α + β κ^{(0) *})}^{2}) (ϕ^{(1)} + \frac{λ}{2 ω^{2}} β^{2})|}{{(ϕ^{(1)} + \frac{λ}{2 ω^{2}} β^{2})}^{2}} < 1, \end{matrix}

(22)

the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime. More specifically,

J_{N} (J_{N - 1} (\dots (J_{2} (J_{1} ({\hat{θ}}_{N + 1}^{(1)}))) \dots))

converges to the unique fixed point

{\hat{θ}}^{(1)}

as

\begin{matrix} {\hat{θ}}^{(1)} = & lim_{N \to \infty} J_{N} (J_{N - 1} (\dots (J_{2} (J_{1} ({\hat{θ}}_{N + 1}^{(1)}))) \dots)) \\ = & θ^{(1)} + α^{2} {\hat{θ}}^{(1)} + \frac{λ}{2 ω^{2}} β^{2} {(κ^{(0) *})}^{2} - \frac{{(\frac{λ}{2 ω^{2}} β^{2} κ^{(0) *} - α β {\hat{θ}}^{(1)})}^{2}}{ϕ^{(1)} + β^{2} {\hat{θ}}^{(1)} + \frac{λ}{2 ω^{2}} β^{2}}; \end{matrix}

(23)

and the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B can be described by

\begin{matrix} κ^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ^{(0) *} - {\hat{θ}}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}}, \end{matrix}

(24)

\begin{matrix} F_{i}^{(1) ★} (s_{i}^{(1)}) = κ^{(1) ★} s_{i}^{(1)} . \end{matrix}

(25)

Under this condition, the asymptotic weighted design object rate of Agent B achieved by the time-invariant optimal deterministic privacy-preserving LQG policy is

\begin{matrix} lim_{N \to \infty} \frac{1}{N} max_{F_{1 : N}^{(1)}} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ D (p_{S_{1 : N}^{(1)}} | | p_{S_{1 : N}^{(0) *}}) = - ω^{2} {\hat{θ}}^{(1)} . \end{matrix}

(26)

The proof of Theorem 2 is given in Appendix B.

5. Random Privacy-Preserving LQG Policy

As shown in Theorem 1, the optimal deterministic privacy-preserving LQG policy of Agent B is a linear mapping. In this section, we first discuss the optimal random privacy-preserving LQG policy and then consider a particular random policy by extending the deterministic linear mapping to the linear Gaussian random policy for Agent B. Here, the random policy of Agent B can be specified as: For

1 \leq i \leq N

,

\begin{matrix} F_{i}^{(1)} : R \times R \to R_{\geq 0} . \end{matrix}

(27)

With slight abuse of notation, we denote the condition probability (density) of taking the action

a_{i}^{(1)} \in R

given the state

s_{i}^{(1)} \in R

and the random policy

F_{i}^{(1)}

by

F_{i}^{(1)} (a_{i}^{(1)} |s_{i}^{(1)}) \in R_{\geq 0}

.

It can be easily shown that the optimal random privacy-preserving LQG policy of Agent B in the final step

F_{N}^{(1) ★}

reduces to the deterministic linear mapping in (A2). For

1 \leq i \leq N - 1

, it follows from the backward dynamic programming that the optimal random privacy-preserving LQG policy of Agent B in the i-th step does not reduce to a deterministic linear mapping in general. That is because the conditional probability distribution

p_{S_{i + 1}^{(1)} | S_{i}^{(1)}}

given a random policy

F_{i}^{(1)}

is a Gaussian mixture model and then the Kullback–Leibler divergence

D (p_{S_{i + 1}^{(1)} | S_{i}^{(1)}} | | p_{S_{i + 1}^{(0) *} | S_{i}^{(0) *}})

between a Gaussian mixture model and a Gaussian distribution generally does not reduce to the quadratic mean of

A_{i}^{(1)} - κ_{i}^{(0) *} S_{i}^{(1)}

as (A5). To the best of our knowledge, there is no analytically tractable formula for Kullback–Leibler divergence between Gaussian mixture models and only approximations are available [37,38,39]. Therefore, we do not give the close-form solution of the optimal random privacy-preserving LQG policy in this paper.

In what follows, we focus on the linear Gaussian random policy: For

1 \leq i \leq N

,

\begin{matrix} F_{i}^{(1)} (s_{i}^{(1)}) = κ_{i}^{(1)} s_{i}^{(1)} + w_{i}^{(1)}, \end{matrix}

(28)

where

w_{i}^{(1)}

is the realization of an independent zero-mean Gaussian random process noise

W_{i}^{(1)} \sim N (0, δ_{i}^{2})

. Thus, a linear Gaussian random policy

F_{i}^{(1)}

can be completely described by the parameters

(κ_{i}^{(1)}, δ_{i}^{2})

. Theorem 3 characterizes the optimal linear Gaussian random privacy-preserving LQG policy of Agent B.

Theorem 3.

At each step, the optimal linear Gaussian random privacy-preserving LQG policy of Agent B with respect to the optimization problem (14) is the same deterministic linear mapping as in Theorem 1.

The proof of Theorem 3 is presented in Appendix C.

Remark 7.

Adding an independent zero-mean Gaussian random process noise to the linear mapping of the optimal deterministic privacy-preserving LQG policy cannot improve the performance of Agent B.

6. Numerical Experiments

6.1. Convergence of the Sequence $({\hat{θ}}_{N + 1}^{(1)}, {\hat{θ}}_{N}^{(1)}, {\hat{θ}}_{N - 1}^{(1)}, \dots)$

When the constraint (22) in Theorem 2 is satisfied, we first illustrate the convergence of the sequence

({\hat{θ}}_{N + 1}^{(1)}, {\hat{θ}}_{N}^{(1)}, {\hat{θ}}_{N - 1}^{(1)}, \dots)

. In addition to the default model parameters in Table 3, we set

θ^{(1)} = 8

,

ϕ^{(1)} = 1

, and let the privacy-preserving design weight

λ = 1

, 5 or 10. By using these parameters, it can be easily verified that the constraint (22) is satisfied. Figure 2 shows that

{\hat{θ}}_{N + 1 - k}^{(1)} = J_{k} (J_{k - 1} (\dots (J_{2} (J_{1} ({\hat{θ}}_{N + 1}^{(1)}))) \dots))

converges after

k = 20

iterations for different values of

λ

. Furthermore, different convergence patterns can be observed for different values of

λ

.

Table 3. Default model parameters.

Figure 2. For

λ = 1

, 5 or 10, the convergence of

{\hat{θ}}_{N + 1 - k}^{(1)} = J_{k} (J_{k - 1} (\dots (J_{2} (J_{1} ({\hat{θ}}_{N + 1}^{(1)}))) \dots))

.

6.2. Impact of the Privacy-Preserving Design Weight $λ$

Here, we show the impact of the privacy-preserving design weight

λ

on the trade-off between the control reward of Agent B and the privacy risk. We use the same parameters as in Section 6.1, but allow

0 \leq λ \leq

10,000. Then, Theorem 2 is applicable and therefore the optimal deterministic privacy-preserving LQG policy of Agent B is time-invariant in the asymptotic regime. Figure 3 and Figure 4 show that both the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B decrease as

λ

increases, i.e., the control reward of Agent B is degraded while the privacy is enhanced. When the privacy risk is not considered, the best control reward of Agent B is achieved at the cost of the highest privacy risk.

Figure 3. When

0 \leq λ \leq

10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 4. When

0 \leq λ \leq

10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

In addition to the analytical results, we also present the simulation results by considering privacy in Figure 3 and Figure 4. Given

0 \leq λ \leq

10,000, we employ the corresponding time-invariant optimal deterministic privacy-preserving LQG policy of Agent B and run the 10,000-step privacy-preserving LQG control with 100 randomly generated initial states. Then, the average control reward and the average privacy risk are evaluated and compared with the analytical results of asymptotic average control reward and asymptotic average privacy risk, respectively. As shown in Figure 3 and Figure 4, the simulation results match quite well with the analytical results, which validates our analytical results.

6.3. Impact of Parameter $θ^{(1)}$

Here, we study the impact of the parameter

θ^{(1)}

on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we set

ϕ^{(1)} = ϕ^{(0)} = 16

and allow

0.01 \leq θ^{(1)} \leq 8

,

λ = 0

(without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all

0.01 \leq θ^{(1)} \leq 8

and by increasing the value of

λ

, Figure 5 and Figure 6 show a trade-off between the control reward of Agent B and the privacy risk, which is consistent with the previous observations. For

λ = 0

, 10, 100, 1000 or 10,000, Figure 5 shows that the asymptotic average control reward of Agent B decreases as

θ^{(1)}

increases. This is reasonable since

- θ^{(1)}

is the quadratic coefficient in the instantaneous reward function

R^{(1)}

. For

λ = 0

, 10, 100, 1000 or 10,000, Figure 6 shows that the asymptotic average privacy risk has a pattern to decrease first, then to increase, and to achieve the minimum value 0 when

θ^{(1)} = θ^{(0)} = 1

. When

θ^{(1)} = θ^{(0)} = 1

, both agents have the same instantaneous reward function and employ the same optimal LQG control policy, which leads to the same state sequence distribution under both hypotheses and the minimum value 0 of the Kullback–Leibler divergence. As

θ^{(1)}

deviates from the value of

θ^{(0)}

, the agents have more different instantaneous reward functions, which lead to more different state sequence distributions under both hypotheses and a larger value of the Kullback–Leibler divergence.

Figure 5. For

0.01 \leq θ^{(1)} \leq 8

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 6. For

0.01 \leq θ^{(1)} \leq 8

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

6.4. Impact of Parameter $ϕ^{(1)}$

Here, we show the impact of the parameter

ϕ^{(1)}

on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we set

θ^{(1)} = θ^{(0)} = 1

and allow

0.01 \leq ϕ^{(1)} \leq 40

,

λ = 0

(without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all

0.01 \leq ϕ^{(1)} \leq 40

and by increasing the value of

λ

, Figure 7 and Figure 8 also show a trade-off between the control reward of Agent B and the privacy risk. For

λ = 0

, 10, 100, 1000 or 10,000, Figure 7 shows that the asymptotic average control reward of Agent B decreases as

ϕ^{(1)}

increases. This is because

- ϕ^{(1)}

is the other quadratic coefficient in the instantaneous reward function

R^{(1)}

. For

λ = 0

, 10, 100, 1000 or 10,000, Figure 8 shows that the asymptotic average privacy risk has a similar pattern to decrease first, then to increase, and to achieve the minimum value 0 when

ϕ^{(1)} = ϕ^{(0)} = 16

. This pattern can be similarly explained as Section 6.3.

Figure 7. For

0.01 \leq ϕ^{(1)} \leq 40

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 8. For

0.01 \leq ϕ^{(1)} \leq 40

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

6.5. Impact of Parameter $θ^{(0)}$

By fixing

θ^{(1)} = 1

and

ϕ^{(1)} = ϕ^{(0)} = 16

, we study the impact of the parameter

θ^{(0)}

on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we allow

0.01 \leq θ^{(0)} \leq 8

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000. It can be verified that Theorem 2 holds for those model parameters. For all

0.01 \leq θ^{(0)} \leq 8

and by increasing the value of

λ

, Figure 9 and Figure 10 show a trade-off between the control reward of Agent B and the privacy risk. For

λ = 0

, 10, 100, 1000 or 10,000, Figure 9 and Figure 10 show that the asymptotic average control reward of Agent B achieves the maximum value while the asymptotic average privacy risk achieves the minimum value 0 when

θ^{(1)} = θ^{(0)} = 1

. In this case, both agents have the same instantaneous reward function and employ the same optimal LQG control policy, which maximizes their control rewards, leads to the same state sequence distribution under both hypotheses, and therefore achieves the minimum value 0 of the Kullback–Leibler divergence.

Figure 9. For

θ^{(1)} = 1

,

ϕ^{(1)} = ϕ^{(0)} = 16

,

0.01 \leq θ^{(0)} \leq 8

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 10. For

θ^{(1)} = 1

,

ϕ^{(1)} = ϕ^{(0)} = 16

,

0.01 \leq θ^{(0)} \leq 8

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

6.6. Impact of Parameter $ϕ^{(0)}$

By fixing

ϕ^{(1)} = 16

and

θ^{(1)} = θ^{(0)} = 1

, we study the impact of the parameter

ϕ^{(0)}

on the control reward of Agent B and the privacy risk. In addition to the default model parameters in Table 3, we allow

0.01 \leq ϕ^{(0)} \leq 40

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000. From Figure 11 and Figure 12, we have similar observations of the impact of

ϕ^{(0)}

as in Section 6.5. These observations here can be similarly explained as well.

Figure 11. For

θ^{(1)} = θ^{(0)} = 1

,

ϕ^{(1)} = 16

,

0.01 \leq ϕ^{(0)} \leq 40

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 12. For

θ^{(1)} = θ^{(0)} = 1

,

ϕ^{(1)} = 16

,

0.01 \leq ϕ^{(0)} \leq 40

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

7. Conclusions

In this paper, we consider the agent identity privacy problem in the scalar LQG control. Regarding this novel privacy problem, we model it as an adversarial binary hypothesis testing and employ the Kullback–Leibler divergence to measure the privacy risk. We then formulate a novel privacy-preserving LQG control optimization by taking into account both the accumulative control reward of Agent B and the privacy risk. We prove that the optimal deterministic privacy-preserving LQG control policy of Agent B is a linear mapping, which is consistent with the standard LQG. We further show that the random policy formulated by adding an independent Gaussian random process noise to the optimal deterministic privacy-preserving LQG policy cannot improve the performance. We also give a sufficient condition to guarantee the time-invariant optimal deterministic privacy-preserving LQG policy in the asymptotic regime.

This research can be extended in our future works. Studying the general random policy of Agent B is an interesting extension. This theoretic study can be extended to develop privacy-preserving reinforcement learning algorithms. The problem can also be extended and formulated as a non-cooperative game of multiple agents with conflicting objectives, where some agents only aim to optimize their own accumulative control rewards while the other agents consider the agent identity privacy risk in addition to their own accumulative control rewards.

Author Contributions

Conceptualization, E.F. and Z.L.; methodology, E.F., Y.T. and Z.L.; validation, E.F., Y.T. and C.S.; formal analysis, E.F., Y.T. and Z.L.; experiment, C.S.; writing—original draft preparation, E.F. and Y.T.; writing—review and editing, Z.L. and C.W.; supervision, Z.L. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China (62006173, 62171322) and the 2021-2023 China-Serbia Inter-Governmental S&T Cooperation Project (No. 6). We are also grateful for the support of the Sino-German Center of Intelligent Systems, Tongji University.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Proof of Theorem 1.

The proof is based on the backward dynamic programming.

We first consider the sub-problem of the final step. Given a probability distribution

p_{S_{N}^{(1)}}

, the final step optimization problem of the deterministic control policy

F_{N}^{(1)}

is

\begin{matrix} F_{N}^{(1) ★} = & arg max_{F_{N}^{(1)}} E (R^{(1)} (S_{N}^{(1)}, A_{N}^{(1)})) \\ = & arg max_{F_{N}^{(1)}} - θ^{(1)} E {(S_{N}^{(1)})}^{2} - ϕ^{(1)} E {(F_{N}^{(1)} (S_{N}^{(1)}))}^{2} . \end{matrix}

(A1)

Since

p_{S_{N}^{(1)}}

is given, the first term

- θ^{(1)} E {(S_{N}^{(1)})}^{2}

is fixed. Note the upper bound on the second term

- ϕ^{(1)} E {(F_{N}^{(1)} (S_{N}^{(1)}))}^{2} \leq 0

. The upper bound can be achieved by the optimal deterministic privacy-preserving LQG policy:

\begin{matrix} F_{N}^{(1) ★} (s_{N}^{(1)}) = κ_{N}^{(1) ★} s_{N}^{(1)} = 0, \end{matrix}

(A2)

where

\begin{matrix} κ_{N}^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ_{N}^{(0) *} - {\hat{θ}}_{N + 1}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}_{N + 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}} = 0 = κ_{N}^{(1) *} . \end{matrix}

(A3)

Then, the maximum achievable objective of the final step is

\begin{matrix} max_{F_{N}^{(1)}} E (R^{(1)} (S_{N}^{(1)}, A_{N}^{(1)})) = - θ^{(1)} E {(S_{N}^{(1)})}^{2} = - {\hat{θ}}_{N}^{(1)} E {(S_{N}^{(1)})}^{2} . \end{matrix}

(A4)

We then consider the sub-problem from the

(N - 1)

-th step until the final step. Given a probability distribution

p_{S_{N - 1}^{(1)}}

and the optimal deterministic privacy-preserving LQG policy in the final step

F_{N}^{(1) ★}

, the sub-optimization problem of the deterministic control policy

F_{N - 1}^{(1)}

is

\begin{matrix} F_{N - 1}^{(1) ★} = & arg max_{F_{N - 1}^{(1)}} E (\sum_{i = N - 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ D (p_{S_{N}^{(1)} | S_{N - 1}^{(1)}} | | p_{S_{N}^{(0) *} | S_{N - 1}^{(0) *}}) \\ = & arg max_{F_{N - 1}^{(1)}} - θ^{(1)} E {(S_{N - 1}^{(1)})}^{2} - ϕ^{(1)} E {(A_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(S_{N}^{(1)})}^{2} \\ - λ D (p_{S_{N}^{(1)} | S_{N - 1}^{(1)}} | | p_{S_{N}^{(0) *} | S_{N - 1}^{(0) *}}) \\ = & arg max_{F_{N - 1}^{(1)}} - θ^{(1)} E {(S_{N - 1}^{(1)})}^{2} - ϕ^{(1)} E {(F_{N - 1}^{(1)} (S_{N - 1}^{(1)}))}^{2} - {\hat{θ}}_{N}^{(1)} E {(S_{N}^{(1)})}^{2} \\ - λ E (log \frac{\frac{1}{\sqrt{2 π ω^{2}}} exp (- \frac{{(S_{N}^{(1)} - α S_{N - 1}^{(1)} - β F_{N - 1}^{(1)} (S_{N - 1}^{(1)}))}^{2}}{2 ω^{2}})}{\frac{1}{\sqrt{2 π ω^{2}}} exp (- \frac{{(S_{N}^{(1)} - α S_{N - 1}^{(1)} - β κ_{N - 1}^{(0) *} S_{N - 1}^{(1)})}^{2}}{2 ω^{2}})}) \\ = & arg max_{F_{N - 1}^{(1)}} - θ^{(1)} E {(S_{N - 1}^{(1)})}^{2} - ϕ^{(1)} E {(F_{N - 1}^{(1)} (S_{N - 1}^{(1)}))}^{2} \\ - {\hat{θ}}_{N}^{(1)} E {(α S_{N - 1}^{(1)} + β F_{N - 1}^{(1)} (S_{N - 1}^{(1)}) + Z_{N - 1})}^{2} \\ - \frac{λ}{2 ω^{2}} β^{2} E {(F_{N - 1}^{(1)} (S_{N - 1}^{(1)}) - κ_{N - 1}^{(0) *} S_{N - 1}^{(1)})}^{2} \\ = & arg max_{F_{N - 1}^{(1)}} - (θ^{(1)} + {\hat{θ}}_{N}^{(1)} α^{2} + \frac{λ}{2 ω^{2}} β^{2} {(κ_{N - 1}^{(0) *})}^{2}) E {(S_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(Z_{N - 1})}^{2} \\ - (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) E {(F_{N - 1}^{(1)} (S_{N - 1}^{(1)}))}^{2} \\ - (2 {\hat{θ}}_{N}^{(1)} α β - \frac{λ}{ω^{2}} β^{2} κ_{N - 1}^{(0) *}) E (S_{N - 1}^{(1)} F_{N - 1}^{(1)} (S_{N - 1}^{(1)})) \\ = & arg max_{F_{N - 1}^{(1)}} \int_{R} [- (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) {(F_{N - 1}^{(1)} (s_{N - 1}^{(1)}))}^{2} \\ - (2 {\hat{θ}}_{N}^{(1)} α β - \frac{λ}{ω^{2}} β^{2} κ_{N - 1}^{(0) *}) s_{N - 1}^{(1)} F_{N - 1}^{(1)} (s_{N - 1}^{(1)})] p_{S_{N - 1}^{(1)}} (s_{N - 1}^{(1)}) d s_{N - 1}^{(1)} \\ = & \int_{R} arg max_{F_{N - 1}^{(1)}} [- (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) {(F_{N - 1}^{(1)} (s_{N - 1}^{(1)}))}^{2} \\ - (2 {\hat{θ}}_{N}^{(1)} α β - \frac{λ}{ω^{2}} β^{2} κ_{N - 1}^{(0) *}) s_{N - 1}^{(1)} F_{N - 1}^{(1)} (s_{N - 1}^{(1)})] p_{S_{N - 1}^{(1)}} (s_{N - 1}^{(1)}) d s_{N - 1}^{(1)} . \end{matrix}

(A5)

Since

{\hat{θ}}_{N}^{(1)} = θ^{(1)} > 0

, it follows that

\begin{matrix} - (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) < 0 . \end{matrix}

(A6)

Given any

s_{N - 1}^{(1)} \in R

, the objective of the inner optimization in (A5) is a concave quadratic function of

F_{N - 1}^{(1)} (s_{N - 1}^{(1)})

. Therefore, we can obtain the optimal deterministic privacy-preserving LQG policy as

\begin{matrix} F_{N - 1}^{(1) ★} (s_{N - 1}^{(1)}) = κ_{N - 1}^{(1) ★} s_{N - 1}^{(1)}, \end{matrix}

(A7)

where

\begin{matrix} κ_{N - 1}^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ_{N - 1}^{(0) *} - {\hat{θ}}_{N}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}} . \end{matrix}

(A8)

By using the optimal deterministic policies

F_{N - 1 : N}^{(1) ★}

, the maximum achievable objective of the sub-problem is

\begin{matrix} max_{F_{N - 1 : N}^{(1)}} E (\sum_{i = N - 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ D (p_{S_{N}^{(1)} | S_{N - 1}^{(1)}} | | p_{S_{N}^{(0) *} | S_{N - 1}^{(0) *}}) \\ = - {\hat{θ}}_{N - 1}^{(1)} E {(S_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(Z_{N - 1})}^{2} . \end{matrix}

(A9)

The coefficient

{\hat{θ}}_{N - 1}^{(1)}

can be specified as

\begin{matrix} {\hat{θ}}_{N - 1}^{(1)} = θ^{(1)} + \frac{ϕ^{(1)} {\hat{θ}}_{N}^{(1)} α^{2} + \frac{λ}{2 ω^{2}} β^{2} {(κ_{N - 1}^{(0) *})}^{2} ϕ^{(1)} + \frac{λ}{2 ω^{2}} β^{2} {\hat{θ}}_{N}^{(1)} {(α + β κ_{N - 1}^{(0) *})}^{2}}{ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}} . \end{matrix}

(A10)

It can be easily justified that

{\hat{θ}}_{N - 1}^{(1)} > 0

since

{\hat{θ}}_{N}^{(1)} > 0

.

We now consider the sub-problem from the

(N - 2)

-th step until the final step. Given a probability distribution

p_{S_{N - 2}^{(1)}}

and the optimal deterministic privacy-preserving LQG policies

F_{N - 1 : N}^{(1) ★}

, the sub-optimization problem of the deterministic control policy

F_{N - 2}^{(1)}

is

\begin{matrix} F_{N - 2}^{(1) ★} = & arg max_{F_{N - 2}^{(1)}} E (\sum_{i = N - 2}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ \sum_{i = N - 1}^{N} D (p_{S_{i}^{(1)} | S_{i - 1}^{(1)}} | | p_{S_{i}^{(0) *} | S_{i - 1}^{(0) *}}) \\ = & arg max_{F_{N - 2}^{(1)}} - θ^{(1)} E {(S_{N - 2}^{(1)})}^{2} - ϕ^{(1)} E {(A_{N - 2}^{(1)})}^{2} - {\hat{θ}}_{N - 1}^{(1)} E {(S_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(Z_{N - 1})}^{2} \\ - λ D (p_{S_{N - 1}^{(1)} | S_{N - 2}^{(1)}} | | p_{S_{N - 1}^{(0) *} | S_{N - 2}^{(0) *}}) \\ = & arg max_{F_{N - 2}^{(1)}} - θ^{(1)} E {(S_{N - 2}^{(1)})}^{2} - ϕ^{(1)} E {(F_{N - 2}^{(1)} (S_{N - 2}^{(1)}))}^{2} \\ - {\hat{θ}}_{N - 1}^{(1)} E {(α S_{N - 2}^{(1)} + β F_{N - 2}^{(1)} (S_{N - 2}^{(1)}) + Z_{N - 2})}^{2} \\ - \frac{λ}{2 ω^{2}} β^{2} E {(F_{N - 2}^{(1)} (S_{N - 2}^{(1)}) - κ_{N - 2}^{(0) *} S_{N - 2}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(Z_{N - 1})}^{2} \\ = & \int_{R} arg max_{F_{N - 2}^{(1)}} [- (ϕ^{(1)} + {\hat{θ}}_{N - 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) {(F_{N - 2}^{(1)} (s_{N - 2}^{(1)}))}^{2} \\ - (2 {\hat{θ}}_{N - 1}^{(1)} α β - \frac{λ}{ω^{2}} β^{2} κ_{N - 2}^{(0) *}) s_{N - 2}^{(1)} F_{N - 2}^{(1)} (s_{N - 2}^{(1)})] p_{S_{N - 2}^{(1)}} (s_{N - 2}^{(1)}) d s_{N - 2}^{(1)} . \end{matrix}

(A11)

Note that the objective functions in (A5) and (A11) have the same form. We have also proved that

{\hat{θ}}_{N - 1}^{(1)} > 0

. Therefore, we can use the same arguments to obtain the optimal deterministic privacy-preserving LQG policy as

\begin{matrix} F_{N - 2}^{(1) ★} (s_{N - 2}^{(1)}) = κ_{N - 2}^{(1) ★} s_{N - 2}^{(1)}, \end{matrix}

(A12)

where

\begin{matrix} κ_{N - 2}^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ_{N - 2}^{(0) *} - {\hat{θ}}_{N - 1}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}_{N - 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}}, \end{matrix}

(A13)

the maximum achievable objective of the sub-problem as

\begin{matrix} max_{F_{N - 2 : N}^{(1)}} E (\sum_{i = N - 2}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ \sum_{i = N - 1}^{N} D (p_{S_{i}^{(1)} | S_{i - 1}^{(1)}} | | p_{S_{i}^{(0) *} | S_{i - 1}^{(0) *}}) \\ = - {\hat{θ}}_{N - 2}^{(1)} E {(S_{N - 2}^{(1)})}^{2} - {\hat{θ}}_{N - 1}^{(1)} E {(Z_{N - 2})}^{2} - {\hat{θ}}_{N}^{(1)} E {(Z_{N - 1})}^{2}, \end{matrix}

(A14)

and

{\hat{θ}}_{N - 2}^{(1)} > 0

.

We can further prove the optimal deterministic privacy-preserving LQG policies in the remaining steps and the maximum achievable weighted design objective of Agent B in Theorem 1 using the same arguments. □

Appendix B

Proof of Theorem 2.

The proof is based on the optimal deterministic privacy-preserving LQG policy of Agent B in Theorem 1.

For all

1 \leq i \leq ⌈\frac{N}{2}⌉

and

x \geq 0

, let

\begin{matrix} J (x) = lim_{N \to \infty} J_{N + 1 - i} (x) = θ^{(1)} + α^{2} x + \frac{λ}{2 ω^{2}} β^{2} {(κ^{(0) *})}^{2} - \frac{{(\frac{λ}{2 ω^{2}} β^{2} κ^{(0) *} - α β x)}^{2}}{ϕ^{(1)} + β^{2} x + \frac{λ}{2 ω^{2}} β^{2}}, \end{matrix}

(A15)

where the second equality follows from

\begin{matrix} lim_{N \to \infty} κ_{i}^{(0) *} = & lim_{N \to \infty} - \frac{{\tilde{θ}}_{i + 1}^{(0)} α β}{ϕ^{(0)} + {\tilde{θ}}_{i + 1}^{(0)} β^{2}} \\ = & - \frac{α β {lim}_{N \to \infty} \underset{N - i i t e r a t i o n s}{\underset{︸}{L^{(0)} (L^{(0)} (\dots (L^{(0)} (L^{(0)}}} ({\tilde{θ}}_{N + 1}^{(0)}))) \dots))}{ϕ^{(0)} + β^{2} {lim}_{N \to \infty} \underset{N - i i t e r a t i o n s}{\underset{︸}{L^{(0)} (L^{(0)} (\dots (L^{(0)} (L^{(0)}}} ({\tilde{θ}}_{N + 1}^{(0)}))) \dots))} \\ = & - \frac{{\tilde{θ}}^{(0)} α β}{ϕ^{(0)} + {\tilde{θ}}^{(0)} β^{2}} \\ = & κ^{(0) *} . \end{matrix}

(A16)

When the model parameters satisfy the condition in (22),

J (x)

is a contraction mapping, i.e., there exists

0 < γ < 1

such that

|J (x) - J (x^{'})| \leq γ |x - x^{'}|

for all

x \geq 0

and

x^{'} \geq 0

. From the Banach’s fixed point theorem, there is a unique fixed point

{\hat{θ}}^{(1)}

with respect to the contraction mapping J such that

\begin{matrix} {\hat{θ}}^{(1)} = & lim_{N \to \infty} J_{N} (J_{N - 1} (\dots (J_{2} (J_{1} ({\hat{θ}}_{N + 1}^{(1)}))) \dots)) \\ = & lim_{N \to \infty} J_{N} (J_{N - 1} (\dots (J_{N + 2 - ⌈\frac{N}{2}⌉} (J_{N + 1 - ⌈\frac{N}{2}⌉} ({\hat{θ}}_{⌈\frac{N}{2}⌉ + 1}^{(1)}))) \dots)) \\ = & lim_{N \to \infty} \underset{⌈\frac{N}{2}⌉ i t e r a t i o n s}{\underset{︸}{J (J (\dots (J (J}} ({\hat{θ}}_{⌈\frac{N}{2}⌉ + 1}^{(H)}))) \dots)) \\ = & θ^{(1)} + α^{2} {\hat{θ}}^{(1)} + \frac{λ}{2 ω^{2}} β^{2} {(κ^{(0) *})}^{2} - \frac{{(\frac{λ}{2 ω^{2}} β^{2} κ^{(0) *} - α β {\hat{θ}}^{(1)})}^{2}}{ϕ^{(1)} + β^{2} {\hat{θ}}^{(1)} + \frac{λ}{2 ω^{2}} β^{2}} . \end{matrix}

(A17)

From (19)–(21), (A16), and (A17), it is easy to verify the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B in (24)–(25) and the asymptotic weighted design objective rate in (26). □

Appendix C

Proof of Theorem 3.

The proof is similar as that of Theorem 1.

We first consider the sub-problem of the final step. Given a probability distribution

p_{S_{N}^{(1)}}

, the final step optimization problem of the linear Gaussian random policy

F_{N}^{(1)}

with parameters

κ_{N}^{(1)}

and

δ_{N}^{2}

is

\begin{matrix} (κ_{N}^{(1) ★}, δ_{N}^{2 ★}) = & arg max_{κ_{N}^{(1)} \in R, δ_{N}^{2} \in R_{\geq 0}} E (R^{(1)} (S_{N}^{(1)}, A_{N}^{(1)})) \\ = & arg max_{κ_{N}^{(1)} \in R, δ_{N}^{2} \in R_{\geq 0}} - θ^{(1)} E {(S_{N}^{(1)})}^{2} - ϕ^{(1)} E {(A_{N}^{(1)})}^{2} \\ = & arg max_{κ_{N}^{(1)} \in R, δ_{N}^{2} \in R_{\geq 0}} (- θ^{(1)} - ϕ^{(1)} {(κ_{N}^{(1)})}^{2}) E {(S_{N}^{(1)})}^{2} - ϕ^{(1)} δ_{N}^{2} . \end{matrix}

(A18)

It is obvious the optimal parameters are

κ_{N}^{(1) ★} = 0

and

δ_{N}^{2 ★} = 0

, i.e., the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the final step.

Similarly, we then consider the sub-problem from the

(N - 1)

-th step until the final step. Given a probability distribution

p_{S_{N - 1}^{(1)}}

and the optimal linear Gaussian random policy in the final step

F_{N}^{(1) ★}

, the sub-optimization problem of the linear Gaussian random policy

F_{N - 1}^{(1)}

with parameters

κ_{N - 1}^{(1)}

and

δ_{N - 1}^{2}

is

\begin{matrix} (κ_{N - 1}^{(1) ★}, δ_{N - 1}^{2 ★}) \\ = arg max_{κ_{N - 1}^{(1)} \in R, δ_{N - 1}^{2} \in R_{\geq 0}} E (\sum_{i = N - 1}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ D (p_{S_{N}^{(1)} | S_{N - 1}^{(1)}} | | p_{S_{N}^{(0) *} | S_{N - 1}^{(0) *}}) \\ = arg max_{κ_{N - 1}^{(1)} \in R, δ_{N - 1}^{2} \in R_{\geq 0}} - θ^{(1)} E {(S_{N - 1}^{(1)})}^{2} - ϕ^{(1)} E {(A_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(S_{N}^{(1)})}^{2} \\ - λ D (p_{S_{N}^{(1)} | S_{N - 1}^{(1)}} | | p_{S_{N}^{(0) *} | S_{N - 1}^{(0) *}}) \\ = arg max_{κ_{N - 1}^{(1)} \in R, δ_{N - 1}^{2} \in R_{\geq 0}} - θ^{(1)} E {(S_{N - 1}^{(1)})}^{2} - ϕ^{(1)} E {(A_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} E {(S_{N}^{(1)})}^{2} \\ - λ E (log \frac{\frac{1}{\sqrt{2 π (ω^{2} + β^{2} δ_{N - 1}^{2})}} exp (- \frac{{(S_{N}^{(1)} - α S_{N - 1}^{(1)} - β κ_{N - 1}^{(1)} S_{N - 1}^{(1)})}^{2}}{2 (ω^{2} + β^{2} δ_{N - 1}^{2})})}{\frac{1}{\sqrt{2 π ω^{2}}} exp (- \frac{{(S_{N}^{(1)} - α S_{N - 1}^{(1)} - β κ_{N - 1}^{(0) *} S_{N - 1}^{(1)})}^{2}}{2 ω^{2}})}) \\ = arg max_{κ_{N - 1}^{(1)} \in R, δ_{N - 1}^{2} \in R_{\geq 0}} - θ^{(1)} E {(S_{N - 1}^{(1)})}^{2} - ϕ^{(1)} E {(κ_{N - 1}^{(1)} S_{N - 1}^{(1)} + W_{N - 1}^{(1)})}^{2} \\ - {\hat{θ}}_{N}^{(1)} E {(α S_{N - 1}^{(1)} + β κ_{N - 1}^{(1)} S_{N - 1}^{(1)} + β W_{N - 1}^{(1)} + Z_{N - 1})}^{2} \\ - \frac{λ}{2 ω^{2}} β^{2} {(κ_{N - 1}^{(1)} - κ_{N - 1}^{(0) *})}^{2} E {(S_{N - 1}^{(1)})}^{2} - \frac{λ}{2 ω^{2}} β^{2} δ_{N - 1}^{2} \\ - \frac{λ}{2} log \frac{ω^{2}}{ω^{2} + β^{2} δ_{N - 1}^{2}} \\ = arg max_{κ_{N - 1}^{(1)} \in R, δ_{N - 1}^{2} \in R_{\geq 0}} - (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) E {(S_{N - 1}^{(1)})}^{2} {(κ_{N - 1}^{(1)})}^{2} \\ - (2 {\hat{θ}}_{N}^{(1)} α β - \frac{λ}{ω^{2}} β^{2} κ_{N - 1}^{(0) *}) E {(S_{N - 1}^{(1)})}^{2} κ_{N - 1}^{(1)} \\ - (θ^{(1)} + {\hat{θ}}_{N}^{(1)} α^{2} + \frac{λ}{2 ω^{2}} β^{2} {(κ_{N - 1}^{(0) *})}^{2}) E {(S_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} ω^{2} \\ - (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) δ_{N - 1}^{2} - \frac{λ}{2} log \frac{ω^{2}}{ω^{2} + β^{2} δ_{N - 1}^{2}} . \end{matrix}

(A19)

(A19) consists of two independent optimizations: the optimization of

κ_{N - 1}^{(1)} \in R

and the optimization of

δ_{N - 1}^{2} \in R_{\geq 0}

. Since

{\hat{θ}}_{N}^{(1)} > 0

, it follows that

\begin{matrix} - (ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) E {(S_{N - 1}^{(1)})}^{2} < 0 . \end{matrix}

(A20)

The optimization of

κ_{N - 1}^{(1)} \in R

has a concave quadratic objective. Then we can obtain the optimal linear coefficient as

\begin{matrix} κ_{N - 1}^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ_{N - 1}^{(0) *} - {\hat{θ}}_{N}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}_{N}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}} . \end{matrix}

(A21)

The optimization of

δ_{N - 1}^{2} \in R_{\geq 0}

has a decreasing objective. Then, the optimal variance is

δ_{N - 1}^{2 ★} = 0

. Therefore, the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the

(N - 1)

-th step.

We then consider the sub-problem from the

(N - 2)

-th step until the final step. Given a probability distribution

p_{S_{N - 2}^{(1)}}

and the optimal linear Gaussian random policies

F_{N - 1 : N}^{(1) ★}

, the sub-optimization problem of the linear Gaussian random policy

F_{N - 2}^{(1)}

with parameters

κ_{N - 2}^{(1)}

and

δ_{N - 2}^{2}

is

\begin{matrix} (κ_{N - 2}^{(1) ★}, δ_{N - 2}^{2 ★}) \\ = arg max_{κ_{N - 2}^{(1)} \in R, δ_{N - 2}^{2} \in R_{\geq 0}} E (\sum_{i = N - 2}^{N} R^{(1)} (S_{i}^{(1)}, A_{i}^{(1)})) - λ \sum_{i = N - 1}^{N} D (p_{S_{i}^{(1)} | S_{i - 1}^{(1)}} | | p_{S_{i}^{(0) *} | S_{i - 1}^{(0) *}}) \\ = arg max_{κ_{N - 2}^{(1)} \in R, δ_{N - 2}^{2} \in R_{\geq 0}} - θ^{(1)} E {(S_{N - 2}^{(1)})}^{2} - ϕ^{(1)} E {(A_{N - 2}^{(1)})}^{2} - {\hat{θ}}_{N - 1}^{(1)} E {(S_{N - 1}^{(1)})}^{2} - {\hat{θ}}_{N}^{(1)} ω^{2} \\ - λ D (p_{S_{N - 1}^{(1)} | S_{N - 2}^{(1)}} | | p_{S_{N - 1}^{(0) *} | S_{N - 2}^{(0) *}}) \\ = arg max_{κ_{N - 2}^{(1)} \in R, δ_{N - 2}^{2} \in R_{\geq 0}} - θ^{(1)} E {(S_{N - 2}^{(1)})}^{2} - ϕ^{(1)} E {(κ_{N - 2}^{(1)} S_{N - 2}^{(1)} + W_{N - 2}^{(1)})}^{2} \\ - {\hat{θ}}_{N - 1}^{(1)} E {(α S_{N - 2}^{(1)} + β κ_{N - 2}^{(1)} S_{N - 2}^{(1)} + β W_{N - 2}^{(1)} + Z_{N - 2})}^{2} - {\hat{θ}}_{N}^{(1)} ω^{2} \\ - \frac{λ}{2 ω^{2}} β^{2} {(κ_{N - 2}^{(1)} - κ_{N - 2}^{(0) *})}^{2} E {(S_{N - 2}^{(1)})}^{2} - \frac{λ}{2 ω^{2}} β^{2} δ_{N - 2}^{2} \\ - \frac{λ}{2} log \frac{ω^{2}}{ω^{2} + β^{2} δ_{N - 2}^{2}} \\ = arg max_{κ_{N - 2}^{(1)} \in R, δ_{N - 2}^{2} \in R_{\geq 0}} - (ϕ^{(1)} + {\hat{θ}}_{N - 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) E {(S_{N - 2}^{(1)})}^{2} {(κ_{N - 2}^{(1)})}^{2} \\ - (2 {\hat{θ}}_{N - 1}^{(1)} α β - \frac{λ}{ω^{2}} β^{2} κ_{N - 2}^{(0) *}) E {(S_{N - 2}^{(1)})}^{2} κ_{N - 2}^{(1)} \\ - (θ^{(1)} + {\hat{θ}}_{N - 1}^{(1)} α^{2} + \frac{λ}{2 ω^{2}} β^{2} {(κ_{N - 2}^{(0) *})}^{2}) E {(S_{N - 2}^{(1)})}^{2} - {\hat{θ}}_{N - 1}^{(1)} ω^{2} - {\hat{θ}}_{N}^{(1)} ω^{2} \\ - (ϕ^{(1)} + {\hat{θ}}_{N - 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}) δ_{N - 2}^{2} - \frac{λ}{2} log \frac{ω^{2}}{ω^{2} + β^{2} δ_{N - 2}^{2}} . \end{matrix}

(A22)

Note that the objective functions in (A19) and (A22) have the same form. Therefore, we can use the same arguments to show that the optimal linear Gaussian random policy reduces to the optimal deterministic privacy-preserving LQG policy in the

(N - 2)

-th step, i.e.,

δ_{N - 2}^{2 ★} = 0

and

\begin{matrix} κ_{N - 2}^{(1) ★} = \frac{\frac{λ}{2 ω^{2}} β^{2} κ_{N - 2}^{(0) *} - {\hat{θ}}_{N - 1}^{(1)} α β}{ϕ^{(1)} + {\hat{θ}}_{N - 1}^{(1)} β^{2} + \frac{λ}{2 ω^{2}} β^{2}} . \end{matrix}

(A23)

We can further prove the optimal linear Gaussian random policies in the remaining steps reduce to the optimal deterministic privacy-preserving LQG policies based on the same arguments. □

References

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjel, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; Abbeel, P. Adversarial attacks on neural network policies. arXiv 2016, arXiv:1702.02284. [Google Scholar]
Lin, Y.C.; Hong, Z.W.; Liao, Y.H.; Shih, M.L.; Liu, M.Y.; Min, S. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 2017 International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3756–3762. [Google Scholar]
Behzadan, V.; Munir, A. Vulnerability of deep reinforcement learning to policy induction attacks. In Proceedings of the MLDM 2017, New York, NY, USA, 15–20 July 2017; pp. 262–275. [Google Scholar]
Russo, A.; Proutiere, A. Optimal attacks on reinforcement learning policies. arXiv 2019, arXiv:1907.13548. [Google Scholar]
Goodfellow, I.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Tramer, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. In Proceedings of the ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Sinha, A.; Namkoong, H.; Duchi, J. Certifying some distributional robustness with principled adversarial training. In Proceedings of the ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zheng, S.; Song, Y.; Leung, T.; Goodfellow, I. Improving the robustness of deep neural networks via stability training. In Proceedings of the CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yan, Z.; Guo, Y.; Zhang, C. Deep defense: Training DNNs with improved adversarial robustness. In Proceedings of the NIPS 2018, Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Shapley, L. Stochastic games. Proc. Natl. Acad. Sci. USA 1953, 39, 1095–1100. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gleave, A.; Dennis, M.; Wild, C.; Kant, N.; Levine, S.; Russell, S. Adversarial policies: Attacking deep reinforcement learning. In Proceedings of the ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust adversarial reinforcement learning. In Proceedings of the ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; pp. 2817–2826. [Google Scholar]
Horak, K.; Zhu, Q.; Bosansky, B. Manipulating adversary’s belief: A dynamic game approach to deception by design for proactive network security. In Proceedings of the GameSec 2017, Vienna, Austria, 23–25 October 2017; pp. 273–294. [Google Scholar]
Crawford, V.P.; Sobel, J. Strategic information transmission. Econometrica 1982, 50, 1431–1451. [Google Scholar] [CrossRef]
Saritas, S.; Yuksel, S.; Gezici, S. Nash and Stackelberg equilibria for dynamic cheap talk and signaling games. In Proceedings of the ACC 2017, Seattle, WA, USA, 24–26 May 2017; pp. 3644–3649. [Google Scholar]
Saritas, S.; Shereen, E.; Sandberg, H.; Dán, G. Adversarial attacks on continuous authentication security: A dynamic game approach. In Proceedings of the GameSec 2019, Stockholm, Sweden, 30 October–1 November 2019; pp. 439–458. [Google Scholar]
Li, Z.; Dán, G. Dynamic cheap talk for robust adversarial learning. In Proceedings of the GameSec 2019, Stockholm, Sweden, 30 October–1 November 2019; pp. 297–309. [Google Scholar]
Li, Z.; Dán, G.; Liu, D. A game theoretic analysis of LQG control under adversarial attack. In Proceedings of the IEEE CDC 2020, Jeju Island, Korea, 14–18 December 2020; pp. 1632–1639. [Google Scholar]
Osogami, T. Robust partially observable Markov decision process. In Proceedings of the ICML 2015, Lille, France, 6–11 July 2015. [Google Scholar]
Sayin, M.O.; Basar, T. Secure sensor design for cyber-physical systems against advanced persistent threats. In Proceedings of the GameSec 2017, Vienna, Austria, 23–25 October 2017; pp. 91–111. [Google Scholar]
Sayin, M.O.; Akyol, E.; Basar, T. Hierarchical multistage Gaussian signaling games in noncooperative communication and control systems. Automatica 2019, 107, 9–20. [Google Scholar] [CrossRef] [Green Version]
Sun, C.; Li, Z.; Wang, C. Adversarial linear quadratic regulator under falsified actions. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
Zhang, R.; Venkitasubramaniam, P. Stealthy control signal attacks in linear quadratic Gaussian control systems: Detectability reward tradeoff. IEEE Trans. Inf. Forensics Secur. 2017, 12, 1555–1570. [Google Scholar] [CrossRef]
Ren, X.X.; Yang, G.H. Kullback-Leibler divergence-based optimal stealthy sensor attack against networked linear quadratic Gaussian systems. IEEE Trans. Cybern. 2021, 1–10. [Google Scholar] [CrossRef] [PubMed]
Venkitasubramaniam, P. Privacy in stochastic control: A Markov decision process perspective. In Proceedings of the Allerton 2013, Monticello, IL, USA, 2–4 October 2013; pp. 381–388. [Google Scholar]
Ny, J.L.; Pappas, G.J. Differentially private filtering. IEEE Trans. Autom. Control. 2013, 59, 341–354. [Google Scholar]
Hale, M.T.; Egerstedt, M. Cloud-enabled differentially private multiagent optimization with constraints. IEEE Trans. Control. Netw. Syst. 2017, 5, 1693–1706. [Google Scholar] [CrossRef] [Green Version]
Hale, M.; Jones, A.; Leahy, K. Privacy in feedback: The differentially private LQG. In Proceedings of the ACC 2018, Milwaukee, WI, USA, 27–29 June 2018; pp. 3386–3391. [Google Scholar]
Hawkins, C.; Hale, M. Differentially private formation control. In Proceedings of the IEEE CDC 2020, Jeju Island, Korea, 14–18 December 2020; pp. 6260–6265. [Google Scholar]
Dwork, C. Differential privacy. In Proceedings of the ICALP 2006, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
Wang, B.; Hegde, N. Privacy-preserving Q-learning with functional noise in continuous spaces. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Alexandru, A.B.; Pappas, G.J. Encrypted LQG using labeled homomorphic encryption. In Proceedings of the ACM/IEEE ICCPS 2019, Montreal, QC, Canada, 16–18 April 2019. [Google Scholar]
Arora, S.; Doshi, P. A survey of inverse reinforcement learning: Challenges, methods and progress. Artif. Intell. 2021, 297, 103500. [Google Scholar] [CrossRef]
Soderstrom, T. Discrete-Time Stochastic Systems; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Baranga, A. The contraction principle as a particular case of Kleene’s fixed point theorem. Discret. Math. 1991, 98, 75–79. [Google Scholar] [CrossRef] [Green Version]
Hershey, J.R.; Olsen, P.A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In Proceedings of the IEEE ICASSP 2007, Honolulu, HI, USA, 15–20 April 2007. [Google Scholar]
Durrieu, J.; Thiran, J.; Kelly, F. Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models. In Proceedings of the IEEE ICASSP 2012, Kyoto, Japan, 25–30 March 2012. [Google Scholar]
Cui, S.; Datcu, M. Comparison of Kullback-Leibler divergence approximation methods between Gaussian mixture models for satellite image retrieval. In Proceedings of the IEEE IGARSS 2015, Milan, Italy, 26–31 July 2015. [Google Scholar]

Figure 1. LQG control in the presence of an eavesdropper.

Figure 2. For

λ = 1

, 5 or 10, the convergence of

{\hat{θ}}_{N + 1 - k}^{(1)} = J_{k} (J_{k - 1} (\dots (J_{2} (J_{1} ({\hat{θ}}_{N + 1}^{(1)}))) \dots))

.

Figure 3. When

0 \leq λ \leq

10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 4. When

0 \leq λ \leq

10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 5. For

0.01 \leq θ^{(1)} \leq 8

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 6. For

0.01 \leq θ^{(1)} \leq 8

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 7. For

0.01 \leq ϕ^{(1)} \leq 40

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 8. For

0.01 \leq ϕ^{(1)} \leq 40

and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 9. For

θ^{(1)} = 1

,

ϕ^{(1)} = ϕ^{(0)} = 16

,

0.01 \leq θ^{(0)} \leq 8

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 10. For

θ^{(1)} = 1

,

ϕ^{(1)} = ϕ^{(0)} = 16

,

0.01 \leq θ^{(0)} \leq 8

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 11. For

θ^{(1)} = θ^{(0)} = 1

,

ϕ^{(1)} = 16

,

0.01 \leq ϕ^{(0)} \leq 40

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) *}, A_{i}^{(1) *}))

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average control reward

{lim}_{N \to \infty} \frac{1}{N} E (\sum_{i = 1}^{N} R^{(1)} (S_{i}^{(1) ★}, A_{i}^{(1) ★}))

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Figure 12. For

θ^{(1)} = θ^{(0)} = 1

,

ϕ^{(1)} = 16

,

0.01 \leq ϕ^{(0)} \leq 40

, and

λ = 0

(without privacy), 10, 100, 1000 or 10,000, comparison of the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) *}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal LQG policy of Agent B and the asymptotic average privacy risk

{lim}_{N \to \infty} \frac{1}{N} D (p_{S_{1 : N}^{(1) ★}} | | p_{S_{1 : N}^{(0) *}})

achieved by the time-invariant optimal deterministic privacy-preserving LQG policy of Agent B.

Table 1. Comparison of research on privacy problems.

	Private Information	Privacy Model/Measure	Privacy Mechanism
[26]	State	Equivocation	Privacy-preserving policy design
[27,28,29,30]	State	Differential privacy	Adding privacy noise to state
[32]	Reward function	Differential privacy	Adding privacy noise to value function
[33]	The whole LQG system	Computational secrecy	Labeled homomorphic encryption
This work	Agent identity	Kullback–Leibler divergence	Privacy-preserving policy design

Table 2. Parameters.

Parameter	Meaning	Parameter	Meaning
N	Number of steps	H	Agent identity binary hypothesis
$α$ , $β$	Time-invariant linear coefficients in the linear Gaussian dynamic model	$z_{i}$ , $ω^{2}$	Independent zero-mean Gaussian-distributed disturbance noise in the i-th step and its variance
$s_{i}^{(H)}$	State of the agent $(H)$ in the i-th step	$a_{i}^{(H)}$	Action of the agent $(H)$ in the i-th step
$F_{i}^{(H)}$	Policy of the agent $(H)$ in the i-th step	$κ_{i}^{(H)}$	State feedback gain of a linear policy of the agent $(H)$ in the i-th step
$r_{i}^{(H)}$	Instantaneous control reward of the agent $(H)$ in the i-th step	$R^{(H)}$ , $θ^{(H)}$ , $ϕ^{(H)}$	Time-invariant instantaneous quadratic control reward function of the agent $(H)$ and its coefficients
$μ_{1}$ , $σ_{1}^{2}$	Mean and variance of the Gaussian-distributed initial state	$λ$	Privacy-preserving design weight

Table 3. Default model parameters.

Parameter	$μ_{1}$	$σ_{1}^{2}$	$α$	$β$	$ω^{2}$	$θ^{(0)}$	$ϕ^{(0)}$
Value	1	1	1	$0.5$	$0.5$	1	16

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Privacy-Preserving Design of Scalar LQG Control

Abstract

2. Introduction

2.1. Motivation

2.2. Content and Contribution

2.3. Notation

3. Agent Identity Privacy Problem in LQG Control

4. Deterministic Privacy-Preserving LQG Policy

5. Random Privacy-Preserving LQG Policy

6. Numerical Experiments

6.1. Convergence of the Sequence $({\hat{θ}}_{N + 1}^{(1)}, {\hat{θ}}_{N}^{(1)}, {\hat{θ}}_{N - 1}^{(1)}, \dots)$

6.2. Impact of the Privacy-Preserving Design Weight $λ$

6.3. Impact of Parameter $θ^{(1)}$

6.4. Impact of Parameter $ϕ^{(1)}$

6.5. Impact of Parameter $θ^{(0)}$

6.6. Impact of Parameter $ϕ^{(0)}$

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Article Metrics

Citations

Article Access Statistics

Privacy-Preserving Design of Scalar LQG Control

Abstract

1. Related Work

1.1. Research on Active Adversarial Attacks

1.2. Research on Privacy Problems

2. Introduction

2.1. Motivation

2.2. Content and Contribution

2.3. Notation

3. Agent Identity Privacy Problem in LQG Control

4. Deterministic Privacy-Preserving LQG Policy

5. Random Privacy-Preserving LQG Policy

6. Numerical Experiments

6.1. Convergence of the Sequence θ ^ N + 1 ( 1 ) , θ ^ N ( 1 ) , θ ^ N − 1 ( 1 ) , ⋯

6.2. Impact of the Privacy-Preserving Design Weight λ

6.3. Impact of Parameter θ ( 1 )

6.4. Impact of Parameter ϕ ( 1 )

6.5. Impact of Parameter θ ( 0 )

6.6. Impact of Parameter ϕ ( 0 )

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Article Metrics

Citations

Article Access Statistics

6.1. Convergence of the Sequence $({\hat{θ}}_{N + 1}^{(1)}, {\hat{θ}}_{N}^{(1)}, {\hat{θ}}_{N - 1}^{(1)}, \dots)$

6.2. Impact of the Privacy-Preserving Design Weight $λ$

6.3. Impact of Parameter $θ^{(1)}$

6.4. Impact of Parameter $ϕ^{(1)}$

6.5. Impact of Parameter $θ^{(0)}$

6.6. Impact of Parameter $ϕ^{(0)}$