UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things

Wei, Dawei; Xi, Ning; Ma, Jianfeng; He, Lei

doi:10.3390/rs13234853

Open AccessArticle

UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things

¹

School of Computer Science and Technology, Xidian University, Xi’an 710071, China

²

School of Cyber Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2021, 13(23), 4853; https://doi.org/10.3390/rs13234853

Submission received: 17 October 2021 / Revised: 21 November 2021 / Accepted: 26 November 2021 / Published: 29 November 2021

(This article belongs to the Special Issue Multiple Access Edge Computing in Non-Terrestial and Terrestrial Internet of Things Networks)

Download

Browse Figures

Versions Notes

Abstract

:

Unmanned aerial vehicle (UAV) plays a more and more important role in Internet of Things (IoT) for remote sensing and device interconnecting. Due to the limitation of computing capacity and energy, the UAV cannot handle complex tasks. Recently, computation offloading provides a promising way for the UAV to handle complex tasks by deep reinforcement learning (DRL)-based methods. However, existing DRL-based computation offloading methods merely protect usage pattern privacy and location privacy. In this paper, we consider a new privacy issue in UAV-assisted IoT, namely computation offloading preference leakage, which lacks through study. To cope with this issue, we propose a novel privacy-preserving online computation offloading method for UAV-assisted IoT. Our method integrates the differential privacy mechanism into deep reinforcement learning (DRL), which can protect UAV’s offloading preference. We provide the formal analysis on security and utility loss of our method. Extensive real-world experiments are conducted. Results demonstrate that, compared with baseline methods, our method can learn cost-efficient computation offloading policy without preference leakage and a priori knowledge of the wireless channel model.

Keywords:

Internet of Things (IoT); computation offloading; differential privacy; unmanned aerial vehicle; deep reinforcement learning

1. Introduction

With the rapid development of unmanned aerial vehicles (UAVs), they are applied in the various applications, such as data collection and remote sensing among Internet of Things (IoT) sensors [1]. Although the benefits of high mobility, swift deployment, and low economic cost, large-scale application of UAV is limited by the computation capacity and energy. Recently, computation offloading is regarded as a promising solution for enabling UAV-assisted IoT to process huge data produced by IoT sensors [2,3].

Existing computation offloading methods for IoT focus on two main categories, i.e., one-shot optimization methods [4,5] and DRL-based methods [6,7]. Compared with one-shot optimization methods, the DRL-based methods can assist devices to learn computation offloading policy with higher energy efficiency and low time delay. Besides this benefit, DRL-based methods allow the devices to learn computation offloading policy without a priori knowledge of wireless channel model, which can be applied to solve the wireless channel dynamic between the UAV and IoT sensors [8].

Although the benefits of applying DRL into computation offloading, the vulnerabilities in DRL can be exploited by adversaries to interfere the UAV with learning policy [9], which hinders it from being applied to the real world. Figure 1 provides a case of computation offloading preference leakage over UAV-assisted IoT. The adversary misleads the UAV to offload tasks to malicious BSs by inversing the RL algorithm based on the observations of the offloading decision and the transmission radio link status.

1.1. Related Works and Challenges

(1) DRL-based Computation Offloading in UAV-Assisted MEC Networks: Zhou et al. [10] formulated the computing task scheduling problem as a constrained Markov decision process (CMDP) and solved it by proposing a novel risk-sensitive DRL method, where the UAV’s energy consumption violation is defined as the risk metric. Liu et al. [11] modified the vanilla Q-Learning algorithm to maximize the profit of the UAV under the constant cruising path. Recently, multi-UAV scenario also has been taken into consideration. Wei et al. [12] proposed a distributed DRL-based method that is enhanced by prioritized experience replay (PER), denoted as DDRL-PER. In their work, the proposed DDRL-PER method is adopted to solve the computation offloading problem over multi-UAV MEC network. Zhu et al. [13] decomposed the complex task and proposed a DRL-based method for the task offloading over UAV group, which optimizes the policy under the constraints of energy, dynamic network state. Seid et al. [14] designed a collaborative learning framework for computation offloading and resource allocation over multi-UAV-assisted IoT, where the UAV group was divided into multiple clusters. Then, the distributed deep deterministic policy gradient (DDPG) algorithms was adopted to solve the multi-cluster computation offloading problem. Sacco et al. [15] also proposed a multi-agent DRL-based method to solve the multi-tasks offloading problem over multi-UAV networks. Moreover, Gao et al. [16] combined game theory with DRL to solve the joint optimization of task offloading and multi-UAV trajectories.

Challenge 1: Existing DRL-based methods for computation offloading over UAV-assisted IoT take protecting computation offloading preference into consideration.

(2) Privacy Preserving in MEC: Considering the privacy preserving has not been widely researched over UAV-assisted MEC networks, we extend our review to the architectures of MEC networks that are not limited to UAV-assisted MEC networks. The privacy issues related to DRL-based offloading method were firstly investigated by He et al. [17]. The privacy constraints were generated in the value function, and the computation offloading policy was learned within several training episodes. Then, the authors adopted Q-learning method to solve the private problem. Furthermore, an active suppression method was proposed in Reference [18], which can prevent adversaries from eavesdropping. However, this method needs to modify the hardware, such as the antenna of mobile devices, in order to effectively suppress adversaries with different signal types. In addition, this active suppression method leads to excessive hardware modification costs when it is applied on a large scale UAV-assisted IoT. To improve the privacy-preserving learning efficiency of computation offloading policy, Min et al. [19] utilized the transfer learning and Dyna architecture to accelerate the learning speed, while He et al. [20] proposed the RL- and DRL-based methods based on generic framework of Lyapunov optimization, which is resistant to user presence inference attack. Existing works holds the assumption that there is no malicious BSs; thus, they mainly generate private constraints in value function to preserve the privacy. However, malicious BSs cannot be completely avoided in the real world. Once the value function is obtained by adversaries, the adversaries can compromise the mobile devices to offload computation tasks to malicious BSs.

Challenge 2: Existing privacy-preserving works that are designed for the DRL-based computation cannot prevent the malicious BSs from inferring the value function of the DRL algorithm.

1.2. Contributions

To solve the aforementioned privacy issues in existing works, we propose a differential privacy (DP)-based deep Q-learning (DP-DQL) method to solve the computation offloading preference leakage issue over UAV-assisted IoT. Our contributions can be summarized as follows.

We investigate a new privacy leakage issue within the online computation offloading over UAV-assisted IoT, namely computation offloading preference leakage.
We propose a differential privacy-based deep Q-learning (DP-DQL) method to protect computation offloading preference over UAV-assisted IoT. In the proposed DP-DQL method, the DQL is adopted as the basic framework for efficiently learning computation offloading policy without the a priori knowledge of the wireless channel model. Then, a generated Gaussian noise is generated in the policy updating process of DQL, which can protect the computation offloading preference. Finally, the learning speed of DP-DQL is accelerated by the PER technique [21] by replying the experience with high temporal-difference error.
We provide theoretical analysis for the differential privacy guarantee and utility loss. Then, the convergence, privacy protection, and cost efficiency of our method is demonstrated by extensive real-world experiments. The results show that our method can help UAV learn the cost-efficient computation offloading policy with the differential privacy guarantee.

The rest of this paper is organized as follows. Section 2 gives the necessary background, system model and problem formulation, details of the proposed method and theoretical analysis. Then, we design and conduct the experiments in Section 3. We discuss the impact of key parameters on the convergence and the limitations of the proposed method in Section 4. Finally, we conclude this paper in Section 5.

2. Materials and Methods

In this section, we first provide the background techniques. Then, we describe the system model and formulate the research problem of this paper. Finally, we show the proposed DP-DQL method and give the theoretical analysis in terms of privacy guarantee and utility loss.

2.1. Background Techniques

2.1.1. Differential Privacy

Differential privacy [22,23] establishes a strong standard for privacy guarantees in knowledge transfer, which aims to disable data analysis algorithms from distinguishing between two neighboring inputs. The key definitions are provided in the following.

Definition 1.

For any two neighboring inputs

z, z^{'} \in B

and subset of outputs

D \subseteq E

, the (

α, 𝓎

)-differential privacy can be guaranteed once the mechanism

C : B \to D

satisfies the inequality

\frac{1}{e^{α}} P (C (z) \in D) ⩽ P (C (z^{'}) \in D) + 𝓎 .

(1)

The definition of output’s global sensitivity is shown as follows.

Definition 2.

Given

\forall z, z^{'} \in B

as neighboring inputs, the output’s global sensitivity can be computed as

Γ_{C} = {sup}_{z, z^{'} \in B} ∥ C (z) - C (z^{'}) ∥,

(2)

where

∥ \cdot ∥

represents the norm function in

E

.

2.1.2. Deep Q-Learning

Deep Q-learning [24] leverages the deep neural network to approximate the value function, which aims to find a policy

Π^{*} (\cdot)

that can minimize the Bellman error as follows.

\frac{1}{2} {(Q^{Π} (s_{t}, a_{t}) - E [u_{t} + τ {max}_{a^{'}} Q^{Π} (s^{'}, a^{'})])}^{2} .

(3)

2.2. System Model and Problem Formulation

2.2.1. System Model

Figure 1 shows a UAV-assisted MEC system for IoT, which contains N fixed base stations (BSs), denoted as a set

N = {1, 2, . . ., N}

and a UAV.

(x_{n}, y_{n}, 0)

is the coordinate of BS n, and

(x, y, h)

is the coordinate of the UAV, where h indicates the flight height of UAV. Referring to Reference [25], UAV adopts Wi-Fi or LTE technology to communicate with BS and smart factories. At each time slot t, a computation task

T_{t}

(e.g., pattern recognition) is collected by the UAV from a smart factory, where

T_{t} \in T

. The task

T_{t}

is described as

T_{t} = (D_{t}, C_{t})

, where

D_{t}

is the maximum execution time, and

C_{t}

is the bits of task

T_{t}

. Morevoer, every

ξ

CPU cycles can process a bit in the task.

Due to the binary computation offloading is the special case of partial computation offloading [26,27], we investigate partial offloading in this paper for generality. To process a task, the UAV needs to decide how much of a task should be offload to BSs. Formally,

s_{t}

represents the offloaded proportion of a task

T_{t}

. To improve the performance of computation offloading decision, the UAV can adjust the offloaded proportion of a task. We define the CPU frequency of the BS and UAV as

f^{n}

and f, respectively. We assume that the each BS has the same CPU frequency, which can avoid some of fallacies of computing during BS parallel processes tasks [28]. Hence, the cost model is shown in the following. The time spent for locally processing a task

P_{t}^{L}

is

P_{t}^{L} = \frac{(1 - s_{t}) C_{t} ξ}{f} .

(4)

The local energy consumption

E_{t}^{L}

is

E_{t}^{L} = (1 - s_{t}) C_{t} {(f)}^{2} ξ β,

(5)

where the

β

is a coefficient related to the CPU architecture [29].

The cost on offloading task to BS n consists of two parts, i.e., the time cost and the energy cost. The time

P_{t}^{O}

for offloading and processing task is

P_{t}^{O} = \frac{s_{t} C_{t}}{r_{t}^{n}} + \frac{s_{t} C_{t} ξ}{f^{n}},

(6)

where the

C_{t} s_{t} / r_{t}^{n}

is the time for transmitting the task to BSs, and

ξ C_{t} s_{t} / f^{n}

represents the processing time in the BS.

Based on References [30,31], the energy consumption on transmitting a task to BS n depends on the transmit power

E P

, which can be shown as

E_{t}^{O} = E P \frac{s_{t} C_{t}}{r_{t}^{n}} .

(7)

In this paper, we assume that the BSs have sufficient energy. This is reasonable that fixed BSs are usually deployed in the area that has wired power supplied by grid. Hence, the energy consumption on the BS is not considered in this paper. For convenience, the major notations used in this paper are summarized in Table 1.

2.2.2. Threat Model and Privacy Issue

In this paper, we consider a new privacy leakage issue over UAV-assisted IoT, namely computation offloading preference leakage, which is shown in Figure 1. In this paper, we assume that the adversary knows the inputs and formats of UAV’s computation offloading policy in advance. It is reasonable to make the assumption because:

The BSs can provide customized services for UAV based on the formats of UAV’s computation offloading policy and the inputs of the policy.
Once the adversary induce the BS, it can monitor the inputs and formats of UAV’s computation offloading policy.

In accordance with above assumptions, the adversaries monitor UAV’s computation offloading decision and the input of computation offloading policy, e.g., radio link transmission rate between UAV and BSs. The adversaries utilize an algorithm, such as inverse reinforcement learning algorithm, to infer the UAV’s computation offloading preference based on monitoring results. Furthermore, the adversaries construct specific inputs for UAV’s computation offloading policy, e.g., improving radio link transmission rate with the help of malicious BSs, which can mislead UAV to offload computation tasks onto the malicious BSs.

2.2.3. Design Goals

To solve above privacy issues, our proposed method should reach the goals as follows.

Differential privacy guarantee: DP-DQL method should provide $(α, 𝓎)$ -differential privacy for UAV during learning process so that the value function of the UAV’s computation offloading policy will not be inferred by the adversaries based on the system state and offloading decision.
Minor utility loss: DP-DQL method should guarantee that, compared with the traditional DQL method, the performance of the DP-DQL method will not be significantly degraded by adding the differential privacy mechanism.

2.2.4. Problem Formulation

As claimed in Section 2.1.2, we formulate the problem of privacy-preserving computation offloading in UAV-assisted MEC network for IoT as the Markov decision process (MDP), which is defined as a tuple

M = (S, A, P, R)

.

(1): System state: The system state is the offloaded proportion. Formally, $s_{t} \in S$ ranges from 0 to 1.
(2): Action space: The UAV adjusts the offloaded proportion of a task by increasing or decreasing from 0 to 0.25. Formally, $a_{t} \in [0, 0.25]$ .
(3): Reward function: The weighted average of energy and time costs is adopted as the reward function, which is given as follows:

u_{t} = - η (E_{t}^{O} + E_{t}^{L} + P_{t}^{L} + P_{t}^{O}),

(8)

where

η

is the normalization function. To meet the differential private requirements, the value domain of the reward function is constrained from 0 to 1. The proof will be given in Section 2.4.1.

2.3. DP-Based Deep Q-Learning for Computation Offloading

In this section, we firstly give an overview of the DP-DQL. Then, the details for the DP-DQL are provided.

2.3.1. Overview

The steps of online learning are shown in Figure 2, which consists of four stages.

Initialization: Initializing the parameters used in DP-DQL approach.
Exploring: The UAV executes offloading action and obtains reward from the environment.
Generating differential disturbance: The UAV generates the specific Gaussian noise to prevent the computation offloading preference leakage.
PER-based policy updating: The UAV updates computation offloading policy with the help of PER technique.

Algorithm 1 shows the details of the DP-DQL method, and its description is given as follows.

Algorithm 1 Differential Privacy-based Deep Q-Learning for computation offloading method

1:: Initialize the parameters of DP-DQL method;
2:: for $j \in [1, T P]$ do
3:: Reset the environment;
4:: Reset differential dict $l (\cdot)$ ;
5:: for $t \in [0, V - 1]$ do
6:: Conduct the action $a_{t} = arg {max}_{a} l (s_{t}) + Π_{ζ} (s_{t}, a)$ ;
7:: Reach to the state $s_{t + 1}$ and get a reward $u_{t}$ ;
8:: Compute the maximum priority $𝓏_{t}$ via Equation (9) and store it with transition;
9:: if $t \equiv 0 m o d A$ then
10:: for $i \in [1, A]$ do
11:: Generating differential disturbance $δ_{t}$ ;
12:: Sample transition via Equation (13);
13:: Compute importance-sampling weight via Equation (14);
14:: Compute TD-error via Equation (9);
15:: Update the priority of the transition;
16:: Compute accumulated policy gradient $ψ$ by Equation (15);
17:: end for
18:: Softly updating;
19:: end if
20:: end for
21:: end for

2.3.2. Initialization (Lines 1–4)

The online policy

Π (\cdot)

and target policy

Π^{'} (\cdot)

are initialized with random weights

ζ

and

ζ^{'}

(Lines 1), where the target policy

Π^{'} (\cdot)

is used to slow the updating rate of online policy

Π (\cdot)

and, hence, improve the stability of the algorithm. The environment is constructed for learning computation offloading policy. Then, the differential dict

l (\cdot)

is initialized and reset to

N U L L

every

T P / H

training episodes. (Line 4). The differential dict

l (\cdot)

is defined as a Dictionary structure, where the key size and value size are set to

| A |

and 2, respectively.

2.3.3. Exploring (Lines 5–8)

If a task

T_{t}

is collected by UAV at time slot t, the UAV makes offloading decision by online computation offloading policy

Π (s_{t})

under differential disturbance

l (s_{t})

(Line 6). After receiving the reward

u_{t}

(Line 6), the UAV obtains a new state

s_{t + 1}

(Line 7). Based on old state

s_{t}

, action

a_{t}

, new state

s_{t + 1}

, and reward

u_{t}

, the UAV constructs a transition. Then, the UAV compute the priority

𝓏_{t}

and store it in replay buffer

Z

with the transition (Line 8). Based on Reference [21], the priority

𝓏_{t}

can be computed as follows:

𝓏_{t} = | Π_{ζ} (s_{t}, a_{t}) - Θ | .

(9)

The

Θ

in Equation (9) is given as follows:

Θ = \{\begin{matrix} u_{t} & i f t = = V \\ Π^{'} (s_{t + 1}, a {^{'}}_{t + 1}) τ + u_{t} & t < V \end{matrix},

(10)

where

a {^{'}}_{t + 1} = Π^{'} (s_{t + 1})

.

2.3.4. Generating Differential Disturbance (Lines 9–11)

Once replay buffer is filled with the transitions (Line 9), a mini-batch of the transitions will be sampled from replay buffer to update online and target policies (Line 10). Gaussian process

δ_{t} \sim Y (υ_{t}, ϕ_{t})

generates a differential disturbance

y_{t}

for each action

a \in A

(Line 11). The differential disturbance

δ_{t}

is appended to differential dict

l (s_{t}) \leftarrow δ_{t}

, then

l (\cdot)

is sorted. The

υ_{t}

and

ϕ_{t}

are given below based on [32]:

υ_{t} = \frac{(e^{Φ ϵ} - e^{- Φ ϵ}) l (s_{t - 1}) + (e^{Φ Ω} - e^{- Φ Ω}) l (s_{t + 1})}{e^{Φ Λ} - e^{- Φ Λ}},

(11)

ϕ_{t} = 1 - [\frac{(e^{Φ ϵ} - e^{- Φ ϵ}) e^{Φ ϵ}) + (e^{Φ Ω} - e^{- Φ Ω}) e^{Φ Ω})}{e^{Φ Λ} - e^{- Φ Λ}}],

(12)

where

Φ = {(4 (1 + Ψ) / A)}^{- 1} γ

,

Ψ

represents the balance factor,

ϵ = {∥ s_{t} - s_{t - 1} ∥}_{2}

,

Λ = {∥ s_{t + 1} - s_{t - 1} ∥}_{2}

, and

Ω = {∥ s_{t + 1} - s_{t} ∥}_{2}

.

2.3.5. PER-Based Policy Updating (Lines 12–21)

In this stage, the accumulated policy gradient

ψ

is calculated based on PER technique. Specifically, the sampling probability of a transition is computed by (Line 12)

P (i) = \frac{𝓏_{i}^{θ_{1}}}{\sum_{ℊ} 𝓏_{ℊ}^{θ_{1}}},

(13)

where the

ℊ

is the index of a transition. Then, the function of importance sampling weight (Line 13) is to decrease bias referred to Reference [21], which is given as follows:

Υ_{i} = \frac{{(| Z | p P (i))}^{- θ_{2}}}{{max}_{A} Υ_{i}},

(14)

where

P (\cdot)

is the sampled probability,

| Z |

represents the replay buffer size,

θ_{2}

is used to determine how much priority affects the sampling probability, and

Υ_{i}

is an importance factor of transition. Hence, the TD-error is calculated via Equation (9) (Lines 14), and the absolute TD-error value is adopted to update the ith transition priority (Lines 15). Finally, the accumulated policy gradient

ψ

is computed based on a chain rule in Reference [33] as shown below (Line 16):

ψ \leftarrow ψ + l (s_{i}) + Υ_{i} \nabla_{ζ} Π (s_{i}, a_{i}) 𝓏_{t},

(15)

where

\nabla_{ζ} Π (s_{i}, a_{i})

is the gradient of the online policy for vector

(s_{n}, a_{n})

. Then, the online policy

Π (\cdot)

is softly update as follows:

ζ^{Π} = ζ + ψ ω,

(16)

where

ω

is a soft update coefficient (Line 18). Then, set

ψ \leftarrow 0

. Finally, the target policy

Π^{'} (\cdot)

is updated (Line 18).

2.4. Theoretical Analysis

2.4.1. Differential Privacy Guarantee

To analyze the differential privacy guarantee of the DP-DQL method under the adversary model in Section 2.2.2, we firstly provide a necessary theorem.

Theorem 1.

Given the sample path k from Gaussian process

F (0, ρ^{2} K)

,

{max}_{μ \in [0, 1]} k ()

exists with high probability. For each

w > 0

in Sobolev space

G^{1}

, we have

P (w + 8.68 \sqrt{Φ} ρ ⩽ {max}_{⪵ \in [0, 1]} k (⪵)) ⩽ \frac{1}{e^{w^{2} / 2}},

(17)

where the

Φ = \frac{1}{4 γ (1 + Ψ) / A}

.

Proof.

Firstly, we define some necessary notations. For a sample path k, we define

k_{0 p} = {k (ℯ_{0}), k (ℯ_{2}), \dots, k (ℯ_{2 p})}

and

k_{1 p} = {k (ℯ_{1}), k (ℯ_{3}), \dots, k (ℯ_{2 p - 1})}

, where

p ⩾ 1

,

ℯ_{𝒾} = 𝒾 / 2 p

,

(𝒾 = 0, \dots, 2 p)

.

Then, we consider the base case that the sample path k contains two elements, i.e.,

k_{2} = k (0), k (1)

. Therefore, the expectation

E

is

\begin{matrix} E (max k_{2}) & = \frac{1}{2} E (| k (0) - k (1) |) \\ = \sqrt{\frac{(1 - e^{- Φ})}{π}} ρ \\ ⩽ \sqrt{\frac{Φ}{π}} ρ, \end{matrix}

(18)

which is based on

k (0) - k (1) \sim F (0, 2 (1 - e^{- Φ}) ρ^{2})

.

Finally, we consider the case of

| k (⪵) | > 2

. Specifically, we aim to bound the expectation

E (max k_{p})

for all

p > 1

. Referring to Chernoff bound, we have

e^{𝓉 E ({max}_{𝒾} 𝒿_{𝒾})} ⩽ E (e^{𝓉 {max}_{𝒾} 𝒿_{𝒾}}) ⩽ p e^{𝓉^{2} ρ^{2} / 2},

(19)

where

𝒿_{𝒾}

is the p independent Gaussian random variables based on

F (0, ρ_{𝒿}^{2})

. Let

𝓉 = \sqrt{2 ln p} / ρ_{𝒿}

, we have

{max}_{𝒾} 𝒿_{𝒾} ⩽ \sqrt{2 ln p} / ρ_{𝒿}

.

Let

𝓂_{p} = E (max k_{2^{p} 0})

. Due to

k_{2^{p} 0} \subset k_{2^{p + 1} 0}

, the series

𝓂_{p}

is non-decreasing. Then, we derive the upper bound of

𝓂_{p + 1} - 𝓂_{p}

. Referring to Reference [32], we have

\exists 𝒿, E (max (0, max k_{2^{p} 1} - k_{2^{p} 0})) ⩽ E (max (0, 𝒿))

. The bound of

E (max (0, 𝒿))

is given as follows.

\begin{matrix} e^{\frac{E (𝒿, 0)}{\sqrt{Φ / 2^{p}} ρ}} & ⩽ E (e^{\frac{E (𝒿, 0)}{\sqrt{Φ / 2^{p}} ρ}}) \\ ⩽ E (max (e^{\frac{𝒿}{\sqrt{Φ / 2^{p}} ρ}}, 1)) \\ ⩽ E (e^{1 + \frac{𝒿}{\sqrt{Φ / 2^{p}} ρ}}) \\ ⩽ e^{\sqrt{p} + 1} + 1 . \end{matrix}

(20)

Hence,

E (max (𝒿, 0)) ⩽ (\sqrt{p} + 1 + \frac{1}{e^{\sqrt{p} + 1}}) \sqrt{\frac{Φ}{2^{p}}} ρ .

(21)

Finally, we have

𝓂_{p + 1} - 𝓂_{p} ⩽ (\sqrt{p} + 1 + \frac{1}{e^{\sqrt{p} + 1}}) \sqrt{\frac{Φ}{2^{p}}} ρ .

(22)

Based on induction, we can get

\forall p

,

𝓂_{p} ⩽ \sqrt{\frac{Φ}{π}} ρ + \sum_{𝒾}^{\infty} (\sqrt{𝒾} + 1 + \frac{1}{e^{\sqrt{𝒾} + 1}}) \sqrt{\frac{Φ}{2^{𝒾}}} ρ < 8.68 \sqrt{Φ} ρ .

(23)

Referring to Reference [34],

E (max k)

shares the same upper bound of

𝓂_{p}

almost surely when k is continuous with probability one. Hence, Theorem 1 follows. □

Theorem 2.

The proposed DP-DQL method ensures the (

α, 𝓎 + H e^{- {(2 Ψ - 8.68 \sqrt{Φ} ρ)}^{2} / 2}

)-differential privacy once neighboring rewards

{∥ u^{'} - u ∥}_{\infty} ⩽ 1

, if

ρ < \frac{Ψ}{8.68 \sqrt{Φ}},

(24)

and

ρ ⩾ 1.41 σ \sqrt{ln \frac{1.25}{𝓎} ln (\frac{α}{𝓎} + e) \frac{V}{A}},

(25)

where

Φ = {(4 {(A)}^{- 1} γ (1 + Ψ))}^{- 1}

,

σ = \frac{X}{A} \sqrt{(8 γ (1 + Ψ)) (A + 16 γ (1 + Ψ))}

, X represents the Lipschitz constant, A is the mini-batch size, and V is the training episodes.

Proof.

To prove this theorem, we firstly prove that the neighboring value functions cannot be distinguished with differential privacy guarantee in one policy updating step. Then, we extend conclusion to multiple policy updating steps. Referring to Reference [35], we have

{∥ Π^{*} (\cdot) - Π (\cdot) ∥}_{\infty} ⩽ \frac{γ X (l (s_{t + 1} - l (s_{t})) + 2)}{A},

(26)

where

Π (\cdot)

is the value function from u and

Π^{*} (\cdot)

is the value function from

u^{*}

. Note that

{∥ u^{*} - u ∥}_{\infty} ⩽ 1

. The inequality

| Π (\cdot) - c | ⩽ 2 γ X (1 + Ψ) / A

holds with at least probability of

1 - e^{- {(2 Ψ - 8.68 \sqrt{Φ} ρ)}^{2} / 2}

, where

Π^{#} (\cdot)

is the neighboring value function of

Π (\cdot)

, according to Theorem 1. Then, for each

{∥ u^{*} - u ∥}_{\infty} ⩽ 1

,

{∥ Π^{*} (\cdot) - Π (\cdot) ∥}_{\infty} ⩽ 4 γ X (1 + Ψ) / A

holds based on triangle inequality with the same

Π^{#} (\cdot)

. Let

k = Π^{*} (\cdot) - Π (\cdot)

, we can obtain Equation (27) in the Sobolev space.

{∥ k ∥}_{G}^{2} ⩽ (1 + \frac{Φ}{2}) {(\frac{4 γ X (1 + Ψ)}{A})}^{2} + \frac{X^{2}}{2 Φ} .

(27)

Let

Φ = B / (4 γ X (1 + Ψ))

, and the Equation (27) can be rewritten as

{∥ k ∥}_{G}^{2} ⩽ \frac{16 X^{2} γ^{2} {(1 + Ψ)}^{2} + 4 γ A X^{2} (1 + Ψ)}{A^{2}} .

(28)

Referring to Reference [36], we can get

P [max_{u, u^{'}} ∥ u^{'} - u ∥ ⩽ α] > 1 - (𝓎 + H e^{- \frac{{(2 Ψ - 8.68 \sqrt{Φ} ρ)}^{2}}{2}}),

(29)

by adding Gaussian disturbance

l \sim F (0, ρ^{2} K)

to

Π (\cdot)

within a policy updating step on the basis of Equation (25). This conclusion can be generalized to multiple policy updating steps by the theorem in Reference [37]. Hence, Theorem 2 is proved. □

2.4.2. Minor Utility Loss

Before giving the final proof of utility loss, we show a necessary theorem and its proof.

Theorem 3.

Assume that

U_{a}^{#}

is the optimal result of the inequality constraint problem

\begin{matrix} \underset{U_{0}, U_{1}, s, U_{n}}{m a x i m i z e} & \sum_{a} U_{a}^{T} u {^{'}}_{a} \\ s . t . & \sum_{a} e^{T} U_{a} ⩽ \frac{| S |}{1 - τ}, \\ U_{a} ⩾ 0 \\ \sum_{a} (I - τ χ_{a}^{T}) U_{a} = e, \end{matrix}

(30)

and we can obtain

E [\sum_{a} U_{a}^{# T} u_{a}] ⩾ \sum_{a} U_{a}^{* T} u_{a} - \frac{2 \sqrt{2} | S | ρ}{\sqrt{π} (1 - τ)} .

(31)

Proof.

Given

u_{a}^{#} = u_{a} + δ_{a}

, we can get Equation (32) based on the strong duality and non-negativity in (♠) and (▪), respectively.

\begin{matrix} E [\sum_{a} U_{a}^{# T} u_{a}] & = E [\sum_{a} U_{a}^{# T} (u_{a}^{#} - δ_{a})] \\ ⩾ E [\sum_{a} U_{a}^{* T} (u_{a}^{#} - \sum_{a} U_{a}^{# T} δ_{a})] \\ = E [\sum_{a} U_{a}^{* T} (u_{a} + δ_{a}) - \sum_{a} U_{a}^{# T} δ_{a}] \\ \overset{♠}{=} \sum_{a} U_{a}^{* T} u_{a} + E [\sum_{a} {(U_{a}^{*} - U_{a}^{#})}^{T} δ_{a} \\ \overset{▪}{⩾} \sum_{a} U_{a}^{* T} u_{a} - \frac{2}{| S | (1 - τ)} E [\sum_{a} {∥ δ_{a} ∥}_{1}] \\ = \sum_{a} U_{a}^{* T} u_{a} - \frac{2 \sqrt{2} ρ | S |}{\sqrt{(1 - τ) π}} . \end{matrix}

(32)

□

Finally, we prove the convergence of the proposed method through Theorem 4.

Theorem 4.

Compared with vanilla DQL, the utility loss of the our DP-DQL method tends to be 0 even under the worst case, where

H = 1

.

Proof.

We have Equation (33) by solving Equation (30)

E [\sum_{a} u_{a} U_{a}^{# T}] ⩾ \sum_{a} u_{a} U_{a}^{* T} - \frac{2 \sqrt{2} ρ | S |}{\sqrt{π} (1 - τ)},

(33)

according to Theorem 3. Further, according to Reference [32], the equation

E [u^{#} e^{T}] = E [\sum_{a} u_{a} U_{a}^{# T}]

holds. Based on the strong duality, we have

\sum_{a} u_{a} U_{a}^{* T} = u^{*} e^{T}

. As

E {[∥ u^{*} - u^{#} ∥]}_{1} = U^{T} u^{*} - E [e^{T} u^{#}]

. Considering state space is infinity, the upper bound of

E {[∥ u^{*} - u^{#} ∥]}_{1}

tends to be zero. Moreover, Reference [24] guarantees the convergence of vanilla DQL. Therefore, our DP-DQL method achieves minor utility loss compared with vanilla DQL and converges within finite training episodes. □

3. Results

In this section, we design four experiments to evaluate the convergence, privacy protection and cost efficiency of the DP-DQL method.

3.1. Experiment Settings

Scenario: In this paper, the UAV flies around the area at a constant height to collect data and make computation offloading decision based on its offloading policy. The device for locally processing task is a Raspberry Pi 3B+, which is adopted as the airborne computer. Figure 3a shows the architecture of UAV used for experiments, and the flying area is shown in Figure 3b. In this paper, we firstly randomly deploy three laptops in Figure 3b to represent the BSs. Then, we deploy three actual rodeside units (RSU) to evaluate the variation of results. Because the computing power of actual RSU is similar to that of Raspberry Pi [38,39], the actual RSU is represented by the Raspberry Pi 4.

Parameter settings: The radio link transmission rate from n-th BS to the UAV is

r_{t}^{n} \in

{2 Mb/s, 6 Mb/s, 10 Mb/s}. At each time slot t, the values of data size

C_{t}

for the task

T_{t}

is

C_{t} \in

{20 Mb, 40 Mb, 60 Mb}, and each task should be processed within

D_{t} = 3

s. During task processing, each bit needs

ξ = 1000

CPU cycles to process [40]. As defined in Reference [41], the transmit power form BSs to the UAV

E P

is

0.2

W. The proposed DP-DQL method is implemented by Pytorch 1.1 and Python 3.6. We adopt a four-layer fully-connected feedforward neural network to implement the online and target policies. The learning rate

γ

is set to 0.001, while the discounted factor

τ

is set to 0.999. The replay buffer size

| Z |

is 1024. The size of mini-batch A is 128. During online learning, there are

T P = 100

training episodes and

V = 50

learning steps in every training episode. For convenience, the all values of parameters are summarized in Table 2.

3.2. Baseline Methods

We evaluate the efficiency of our DP-DQL method by comparing it with two baseline methods:

Greedy: This method has been widely adopted as a baseline method, where all tasks are fully offloaded to the BSs.
Deep Q-learning with non-differentially-private mechanism (DQL-non-DP) [19]: We adopt a model-free method designed for healthcare IoT network [19] and adjust it according to the system state space of this paper. This method can learn the cost-efficient computation offloading policy and serve as the baseline of cost efficiency for the DP-DQL method. The DQL-non-DP method shares the same hyperparameters with the DP-DQL method in the following experiments.

3.3. The Convergence of the DP-DQL Method

In this paper, the proposed DP-DQL method is modified at the action selection step in exploring and the accumulated policy gradient computing step in PER-based policy updating with the aim of protecting computation offloading preference. However, these two modifications may affect the learning performance of the proposed method. To evaluate the potential effect, we vary the

σ

to test the impact of the modifications on the learning performance of the proposed method. According to Theorem 2, once the other hyperparameters and experiment parameters are determined,

σ

will be an important parameter in the DP-DQL method to determine privacy level. In this paper, we set

σ \in {0, 0.2, 0.4, 0.6, 0.8}

. Note that

σ = 0

is a special case that privacy-preserving mechanism is not applied. Figure 4 shows the results. The results indicate that, with

σ

increasing, DP-DQL method needs more training episodes to approximate the learning performance than the non-privacy case. It raises a problem that what value of

σ

should we choose to achieve the best learning performance while preserving computation offloading preference. From Figure 4, it can be seen that the DP-DQL method allows UAV to learn the a stability computation offloading policy within 20 TPs, 20 TPs, and 34 TPs under the case of

σ = 0

,

σ = 0.2

, and

σ = 0.4

, respectively. When

σ > 0.4

, the DP-DQL method will continue to oscillate with no sign of convergence. Hence, we can see that DP-DQL method performs well when

σ ⩽ 0.4

.

3.4. The Privacy Protection of the DP-DQL Method

According to the threat model in Section 2.2.2, the adversary tries to increase the similarity of the distributions of the vanilla value function and recovered value function in the state space

S

. In this paper, we adopt the t-test to quantitatively evaluate the above similarity. To conduct the t-test, we firstly make a set of hypotheses, including

K_{0}

and

K_{W}

. Note that the subscript 0 is the null hypothesis, and

W

is the alternative hypothesis.

$K_{0}$ : The distributions of the vanilla value function is the same as that of recovered value function in the state space $S$ .
$K_{W}$ : The distributions of the vanilla value function is not the same as that of recovered value function in the state space $S$ .

Referring to Section 3.3, we set the value function with

σ = 0

as the vanilla value function, while the value function with

σ = 0

,

σ = 0.2

,

σ = 0.4

,

σ = 0.6

, and

σ = 0.8

is set as recovered value function. We randomly generate the twenty pairs (

r_{t}^{n}

,

C_{t}

), and input them to the vanilla value function and recovered value function, respectively. Then, we obtain two sets of values. Finally, we calculate the p-value of two sets. Table 3 shows the results. It can be seen that the p-value is less than 0.001 in most cases, except the case of

σ - 0

. Hence, the null hypothesis

K_{W}

is accepted with strong evidence. It indicates that the value function of proposed DP-DRL method cannot be recovered by inverse RL.

3.5. The Cost Efficiency of the DP-DQL Method

In this experiment, our aim is to evaluate how much the privacy preserving mechanism in the proposed method affects its cost efficiency, compared to the baseline methods. We adopt the weighted average of the cost

P_{t}^{L}, E_{t}^{L}, P_{t}^{O},

and

E_{t}^{O}

as the comparing metric, which is calculated based on Equation (8). Note that, due to the different ranges of the cost

P_{t}^{L}, E_{t}^{L}, P_{t}^{O},

and

E_{t}^{O}

, they are normalized as Equation (8). According to Section 3.3, we set

σ = 0.2

and

σ = 0.4

in the experiments. To avoid statistical deviation, we perform each experiment with 10 random seeds.

Figure 5 shows the influence of the radio transmission rate

r_{t}^{n}

on the cost efficiency of DP-DQL and baseline methods. The radio transmission rate

r_{t}^{n}

is set to be 2 Mb/s, 6 Mb/s, and 10 Mb/s, respectively, while the bits of a task is

C_{t}

= 40 Mb. Compred with the baseline method, i.e., DQL-non-DP method, we can see that DP-DQL method has little cost efficiency reduction. For instance, we select the DP-DQL method with

σ = 0.2

to compare with DQL-non-DP method. When two methods converge, the average cost of DP-DQL method is 15%, 18%, and 20% less than that of DQL-non-DP method in the case of

r_{t}^{n} = 2

Mb/s,

r_{t}^{n} = 6

Mb/s, and

r_{t}^{n} = 10

Mb/s. Moreover, we observe that the proposed DP-DQL method requires more training episodes, which varies from 20 TPs to 35 TPs, to achieve the similar cost than DQL-non-DP method. The extra training episodes used by the proposed DP-DQL method indicate the tradeoff between privacy and cost. Furthermore, we can find that, with the increase of radio transmission rate

r_{t}^{n}

, the Greedy method can achieve better the cost efficiency of DP-DQL method in the cases of

r_{t}^{n} = 6

Mb/s and 10 Mb/s. The reason is that promising wireless channel status reduces the transmitting cost. However, compared with DP-DQL method, the Greedy method cannot preserve computation offloading preference.

Moreover, the cost efficiency of DP-DQL is evaluated under different task bits

C_{t}

, compared with baseline methods. The bits of a task

C_{t}

is set to be 20 Mb, 40 Mb, and 60 Mb, while the radio transmission rate

r_{t}^{n}

is 10 Mb/s. The results in Figure 6 show that DP-DQL method and DQL-non-DP method outperform the Greedy method in most cases, except the early learning stage of the case of

C_{t} = 20

Mb. The largest improvement of rewards is in the case of

C_{t}

= 60 Mb, which is 260%. We can see that the difference in rewards between the DP-DQL method and DQL-non-DP method varies relatively little with

C_{t}

. The maximum difference is only 12%, indicating that the proposed method is not overly affected by privacy-preserving mechanisms in terms of cost efficiency and can learn a cost-efficient computation offloading policy. The reason is that offloading all of a task to the BS will not cause too much cost in transmission time when the size of a task is small. With the size of a task increases, transmission time becomes the bottleneck of the cost efficiency of the Greedy method.

3.6. The Performance of the DP-DQL Method Deployed in a Realistic Scenario

Through changing the laptops to the actual RSUs in the scenario, we re-examine the performance of the proposed method and DQL-non-DRL method in terms of cost efficiency. The parameters are set the same as Section 3.5. Figure 7 and Figure 8 show the cost efficiency of the proposed method and baseline methods under different transmission rate

r_{t}^{n}

and bits of a task

C_{t}

, respectively. It can be seen that the proposed method can still converge with finite TPs. However, by pairwise comparing Figure 5 with Figure 7, and Figure 6 with Figure 8, we can see that the experiments with actual RSUs increase the cost. The reason is that the weaker CPU computing power of the actual RSU increases the time cost and eventually leads to an increase in the total cost. Specifically, based on the assumption that the BS energy consumption is not considered in Section 2.2.1, the energy cost does not change when the actual RSUs are used to replace the laptops, which is still the local energy cost

E_{t}^{L}

plus the transmission energy cost

E_{t}^{O}

. However, the time

P_{t}^{O}

for offloading and processing will increase due to the weaker CPU computing power of the actual RSUs. To further verify our point, setting

σ = 0.4

, Figure 9 shows the proportion of time cost in the total cost. It can be seen that the proportion of time cost to the total cost increases after replacing with actual RSUs. Hence, changing laptops to the actual RSUs will increase the time cost and eventually lead to an increase in the total cost.

4. Discussion

In this section, we firstly discuss the impact of the parameter value on the learning performance of the proposed method. Then, we discuss the limitations of the proposed method.

4.1. Impact of the Key Parameters on the Convergence of DP-DQL Method

As shown in Line 4, Algorithm 1, the reset factor H is the updating frequency of the differential dict

l (\cdot)

, which can affect the convergence of the proposed method. In this experiment, we evaluate the influence of value selection of reset factor H on DP-DQL performance by setting the

σ = 0.2

. Figure 10 shows the results. In the figure, it can be seen that the average reward is not influenced by the value of reset factor H.

4.2. Limitations and Future Works

The proposed method supports only one UAV to offload the task to single BS at one time slot. This can cover most of the existing daily application scenarios, such as grid inspection, remote sensing, etc. However, the limited endurance of a single UAV limits its ability to perform more complex tasks. Multi-UAV collaboration provides a viable idea, but the proposed method is not able to support privacy-preserving computation offloading in multi-UAV-assisted IoT scenarios. In the future, techniques, such as distributed reinforcement learning and local differential privacy, offer potential solutions to the above needs.

5. Conclusions

In this paper, we propose a differential privacy-based deep Q-learning method for computation offloading over UAV-assisted IoT, which can protect UAV’s computation offloading preference. The formal analysis shows that the proposed DP-DQL method meets the design goals, i.e., differential privacy guarantee and minor utility loss. Furthermore, we evaluate the convergence and privacy of DP-DQL method by the real-world experiment. The results indicate that our DP-DQL method can achieve long-term energy performance under the privacy guarantee, compared with baseline methods. In the future, we will further investigate various privacy issues on DRL-based computation offloading methods.

Author Contributions

Conceptualization, D.W. and N.X.; methodology, D.W.; software, L.H.; validation, D.W. and L.H.; formal analysis, D.W.; investigation, D.W.; writing—original draft preparation, D.W.; writing—review and editing, N.X.; visualization, D.W.; supervision, J.M.; funding acquisition, N.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key R&D Program of China (Grant No. 2018YFE0207600), National Natural Science Foundation of China (No. 61902291), the Fundamental Research Funds for the Central Universities (Project No. XJS201503) and China Postdoctoral Science Foundation Funded Project (2019M653567).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fraga-Lamas, P.; Ramos, L.; Mondéjar-Guerra, V.M.; Fernández-Caramés, T.M. A Review on IoT Deep Learning UAV Systems for Autonomous Obstacle Detection and Collision Avoidance. Remote Sens. 2019, 11, 2144. [Google Scholar] [CrossRef] [Green Version]
Qiu, T.; Chi, J.; Zhou, X.; Ning, Z.; Atiquzzaman, M.; Wu, D.O. Edge Computing in Industrial Internet of Things: Architecture, Advances and Challenges. IEEE Commun. Surv. Tutor. 2020, 22, 2462–2488. [Google Scholar] [CrossRef]
Zhou, F.; Hu, R.Q.; Li, Z.; Wang, Y. Mobile Edge Computing in Unmanned Aerial Vehicle Networks. IEEE Wirel. Commun. 2020, 27, 140–146. [Google Scholar] [CrossRef] [Green Version]
Hong, Z.; Chen, W.; Huang, H.; Guo, S.; Zheng, Z. Multi-Hop Cooperative Computation Offloading for Industrial IoT-Edge-Cloud Computing Environments. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2759–2774. [Google Scholar] [CrossRef]
Chen, W.; Zhang, Z.; Hong, Z.; Chen, C.; Wu, J.; Maharjan, S.; Zheng, Z.; Zhang, Y. Cooperative and Distributed Computation Offloading for Blockchain-Empowered Industrial Internet of Things. IEEE Internet Things J. 2019, 6, 8433–8446. [Google Scholar] [CrossRef]
Dai, Y.; Zhang, K.; Maharjan, S.; Zhang, Y. Deep Reinforcement Learning for Stochastic Computation Offloading in Digital Twin Networks. IEEE Trans. Ind. Inform. 2021, 17, 4968–4977. [Google Scholar] [CrossRef]
Ren, Y.; Sun, Y.; Peng, M. Deep Reinforcement Learning Based Computation Offloading in Fog Enabled Industrial Internet of Things. IEEE Trans. Ind. Inform. 2021, 17, 4978–4987. [Google Scholar] [CrossRef]
Yang, L.; Li, M.; Si, P.; Yang, R.; Sun, E.; Zhang, Y. Energy-Efficient Resource Allocation for Blockchain-Enabled Industrial Internet of Things with Deep Reinforcement Learning. IEEE Internet Things J. 2021, 8, 2318–2329. [Google Scholar] [CrossRef]
Pan, X.; Wang, W.; Zhang, X.; Li, B.; Yi, J.; Song, D. How You Act Tells a Lot: Privacy-Leaking Attack on Deep Reinforcement Learning. In Proceedings of the International Foundation for Autonomous Agents and Multiagent Systems, AAMAS 2019, Montreal, QC, Canada, 13–17 May 2019; pp. 368–376. [Google Scholar]
Zhou, C.; Wu, W.; He, H.; Yang, P.; Lyu, F.; Cheng, N.; Shen, X. Deep Reinforcement Learning for Delay-Oriented IoT Task Scheduling in SAGIN. IEEE Trans. Wirel. Commun. 2021, 20, 911–925. [Google Scholar] [CrossRef]
Liu, P.; He, H.; Fu, T.; Lu, H.; Alelaiwi, A.; Wasi, M.W.I. Task offloading optimization of cruising UAV with fixed trajectory. Comput. Netw. 2021, 199, 108397. [Google Scholar] [CrossRef]
Wei, D.; Ma, J.; Luo, L.; Wang, Y.; He, L.; Li, X. Computation offloading over multi-UAV MEC network: A distributed deep reinforcement learning approach. Comput. Netw. 2021, 199, 108439. [Google Scholar] [CrossRef]
Zhu, S.; Gui, L.; Zhao, D.; Cheng, N.; Zhang, Q.; Lang, X. Learning-Based Computation Offloading Approaches in UAVs-Assisted Edge Computing. IEEE Trans. Veh. Technol. 2021, 70, 928–944. [Google Scholar] [CrossRef]
Seid, A.M.; Boateng, G.O.; Anokye, S.; Kwantwi, T.; Sun, G.; Liu, G. Collaborative Computation Offloading and Resource Allocation in Multi-UAV-Assisted IoT Networks: A Deep Reinforcement Learning Approach. IEEE Internet Things J. 2021, 8, 12203–12218. [Google Scholar] [CrossRef]
Sacco, A.; Esposito, F.; Marchetto, G.; Montuschi, P. Sustainable Task Offloading in UAV Networks via Multi-Agent Reinforcement Learning. IEEE Trans. Veh. Technol. 2021, 70, 5003–5015. [Google Scholar] [CrossRef]
Gao, A.; Wang, Q.; Chen, K.; Liang, W. Multi-UAV Assisted Offloading Optimization: A Game Combined Reinforcement Learning Approach. IEEE Commun. Lett. 2021, 25, 2629–2633. [Google Scholar] [CrossRef]
He, X.; Liu, J.; Jin, R.; Dai, H. Privacy-Aware Offloading in Mobile-Edge Computing. In Proceedings of the 2017 IEEE Global Communications Conference (GLOBECOM 2017), Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar]
He, X.; Jin, R.; Dai, H. Physical-Layer Assisted Privacy-Preserving Offloading in Mobile-Edge Computing. In Proceedings of the 2019 IEEE International Conference on Communications (ICC 2019), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
Min, M.; Wan, X.; Xiao, L.; Chen, Y.; Xia, M.; Wu, D.; Dai, H. Learning-Based Privacy-Aware Offloading for Healthcare IoT with Energy Harvesting. IEEE Internet Things J. 2019, 6, 4307–4316. [Google Scholar] [CrossRef]
He, X.; Jin, R.; Dai, H. Deep PDS-Learning for Privacy-Aware Offloading in MEC-Enabled IoT. IEEE Internet Things J. 2019, 6, 4547–4555. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the 4th International Conference on Learning Representations, (ICLR 2016), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Dwork, C.; McSherry, F.; Nissim, K.; Smith, A.D. Calibrating Noise to Sensitivity in Private Data Analysis. Lecture Notes in Computer Science. In Proceedings of the Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, 4–7 March 2006; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3876, pp. 265–284. [Google Scholar]
Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Messous, M.A.; Senouci, S.; Sedjelmaci, H.; Cherkaoui, S. A Game Theory Based Efficient Computation Offloading in an UAV Network. IEEE Trans. Veh. Technol. 2019, 68, 4964–4974. [Google Scholar] [CrossRef]
Li, K.; Tao, M.; Chen, Z. Exploiting Computation Replication for Mobile Edge Computing: A Fundamental Computation-Communication Tradeoff Study. IEEE Trans. Wirel. Commun. 2020, 19, 4563–4578. [Google Scholar] [CrossRef] [Green Version]
Zhou, F.; Hu, R.Q. Computation Efficiency Maximization in Wireless-Powered Mobile Edge Computing Networks. IEEE Trans. Wirel. Commun. 2020, 19, 3170–3184. [Google Scholar] [CrossRef] [Green Version]
Tiwari, N.; Bellur, U.; Sarkar, S.; Indrawan, M. CPU Frequency Tuning to Improve Energy Efficiency of MapReduce Systems. In Proceedings of the 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Wuhan, China, 13–16 December 2016; pp. 1015–1022. [Google Scholar]
He, X.; Jin, R.; Dai, H. Peace: Privacy-Preserving and Cost-Efficient Task Offloading for Mobile-Edge Computing. IEEE Trans. Wirel. Commun. 2020, 19, 1814–1824. [Google Scholar] [CrossRef]
Min, M.; Xiao, L.; Chen, Y.; Cheng, P.; Wu, D.; Zhuang, W. Learning-Based Computation Offloading for IoT Devices with Energy Harvesting. IEEE Trans. Veh. Technol. 2019, 68, 1930–1941. [Google Scholar] [CrossRef] [Green Version]
Chen, X.; Jiao, L.; Li, W.; Fu, X. Efficient Multi-User Computation Offloading for Mobile-Edge Cloud Computing. IEEE/ACM Trans. Netw. 2016, 24, 2795–2808. [Google Scholar] [CrossRef] [Green Version]
Wang, B.; Hegde, N. Privacy-Preserving Q-Learning with Functional Noise in Continuous Spaces. arXiv 2019, arXiv:1901.10634. [Google Scholar]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M.A. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. 387–395. [Google Scholar]
MacKay, D.J. Introduction to Gaussian processes. NATO ASI Ser. F Comput. Syst. Sci. 1998, 168, 133–166. [Google Scholar]
Wei, D.; Xi, N.; Ma, J.; Li, J. Protecting Your Offloading Preference: Privacy-aware Online Computation Offloading in Mobile Blockchain. In Proceedings of the 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS), Tokyo, Japan, 25–28 June 2021; pp. 1–10. [Google Scholar]
Hall, R.; Rinaldo, A.; Wasserman, L.A. Differential privacy for functions and functional data. J. Mach. Learn. Res. 2013, 14, 703–727. [Google Scholar]
Kairouz, P.; Oh, S.; Viswanath, P. The Composition Theorem for Differential Privacy. IEEE Trans. Inf. Theory 2017, 63, 4037–4049. [Google Scholar] [CrossRef]
Patel, A.; Shah, N.; Limbasiya, T.; Das, D. VehicleChain: Blockchain-based Vehicular Data Transmission Scheme for Smart City. In Proceedings of the 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 661–667. [Google Scholar]
Sakuraba, A.; Sato, G.; Uchida, N.; Shibata, Y. Performance Evaluation of Improved V2X Wireless Communication Based on Gigabit WLAN. Lecture Notes in Networks and Systems. In Proceedings of the International Conference on Broadband and Wireless Computing, Communication and Applications, Yonago, Japan, 28–30 October 2020; Springer: Cham, Switzerland, 2020; Volume 159, pp. 131–142. [Google Scholar]
You, C.; Huang, K.; Chae, H.; Kim, B. Energy-Efficient Resource Allocation for Mobile-Edge Computation Offloading. IEEE Trans. Wirel. Commun. 2017, 16, 1397–1411. [Google Scholar] [CrossRef]
Cheng, X.; Lyu, F.; Quan, W.; Zhou, C.; He, H.; Shi, W.; Shen, X. Space/Aerial-Assisted Computing Offloading for IoT Applications: A Learning-Based Approach. IEEE J. Sel. Areas Commun. 2019, 37, 1117–1129. [Google Scholar] [CrossRef]

Figure 1. The offloading performance leakage problem in UAV-assisted online computation offloading.

Figure 2. The overview of the proposed DP-DQL approach.

Figure 3. The illustration of real-world experiment.

Figure 4. Convergence performance versus different

σ

.

Figure 4. Convergence performance versus different

σ

.

Figure 5. Cost efficiency of the proposed method versus different transmission rate

r_{t}^{n}

.

Figure 5. Cost efficiency of the proposed method versus different transmission rate

r_{t}^{n}

.

Figure 6. Cost efficiency of the proposed method versus different bits of a task

C_{t}

.

Figure 6. Cost efficiency of the proposed method versus different bits of a task

C_{t}

.

Figure 7. Cost efficiency of the proposed method versus different transmission rate

r_{t}^{n}

.

Figure 7. Cost efficiency of the proposed method versus different transmission rate

r_{t}^{n}

.

Figure 8. Cost efficiency of the proposed method versus different bits of a task

C_{t}

.

Figure 8. Cost efficiency of the proposed method versus different bits of a task

C_{t}

.

Figure 9. The proportion of time cost in the total cost versus (a) different transmission rate

r_{t}^{n}

and (b) different bits of a task

C_{t}

.

Figure 9. The proportion of time cost in the total cost versus (a) different transmission rate

r_{t}^{n}

and (b) different bits of a task

C_{t}

.

Figure 10. Convergence performance versus different H.

Table 1. List of major notations.

Notations	Descriptions
n	The index of BSs
N	The number of BSs
$N$	The set of BSs
$x_{n}, y_{n}$	The x-coordinate and y-coordinate of BS n
$x, y, h$	The x-coordinate, y-coordinate, and height of UAV
t	Time index
$f^{n}, f$	The CPU frequency of n-th BS
f	The CPU frequency of the UAV
$T_{t}, T$	The computation task in time slot t and the set of computation task
H	The reset factor of the DP-DQL
$P_{t}^{O}, E_{t}^{O}$	The time and energy consumption at time slot t in BSs
$C_{t}, D_{t}$	The bits and maximum execution time of task $T_{t}$
$P_{t}^{L}, E_{t}^{L}$	The consumed time and energy to locally process task at time slot t
$E P$	The transmit power of transmitting a bit from BS n to the UAV
$a_{t}, s_{t}, u_{t}$	The action, state, and reward of the DP-DQL in t-th time slot
$T P, V$	The number of training episode and the maximum learning steps within a training episode
$τ, γ$	The discount factor and learning rate of the proposed DP-DQL
$A, Z$	The mini-batch size and the replay buffer
$ξ$	The bits which be processed during a CPU cycle
$r_{t}^{n}$	The radio link transmission rate between the UAV and BS n
$Ψ$	The balance factor of the DP-DQL

Table 2. Parameter values.

Parameter	Value	Parameter	Value
h	5 m	$C_{t}$	{20, 40, 60} Mb
f	$1 \times 10^{9}$ cycles/s	$f^{n}$	$3 \times 10^{9}$ cycles/s
$D_{t}$	3 s	$ξ$	1000
$r_{t}^{n}$	{2, 6, 10} Mb/s	$β$	$1 \times 10^{- 11}$
$E P$	0.2 W	$T P$	100
V	50	N	3
$γ$	0.001	$τ$	0.999
$θ_{1}$	0.5	$θ_{2}$	0.5
$\| Z \|$	1024	A	128
$θ_{1}$	0.5	$θ_{2}$	0.5

Table 3. The results of t-test.

	$σ$ values	0	0.2	0.4	0.6	0.8
	p-values	0.197	1.23 × 10 $^{- 15}$	5.73 × 10 $^{- 75}$	7.54 × 10 $^{- 64}$	4.25 × 10 $^{- 32}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, D.; Xi, N.; Ma, J.; He, L. UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things. Remote Sens. 2021, 13, 4853. https://doi.org/10.3390/rs13234853

AMA Style

Wei D, Xi N, Ma J, He L. UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things. Remote Sensing. 2021; 13(23):4853. https://doi.org/10.3390/rs13234853

Chicago/Turabian Style

Wei, Dawei, Ning Xi, Jianfeng Ma, and Lei He. 2021. "UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things" Remote Sensing 13, no. 23: 4853. https://doi.org/10.3390/rs13234853

APA Style

Wei, D., Xi, N., Ma, J., & He, L. (2021). UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things. Remote Sensing, 13(23), 4853. https://doi.org/10.3390/rs13234853

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UAV-Assisted Privacy-Preserving Online Computation Offloading for Internet of Things

Abstract

1. Introduction

1.1. Related Works and Challenges

1.2. Contributions

2. Materials and Methods

2.1. Background Techniques

2.1.1. Differential Privacy

2.1.2. Deep Q-Learning

2.2. System Model and Problem Formulation

2.2.1. System Model

2.2.2. Threat Model and Privacy Issue

2.2.3. Design Goals

2.2.4. Problem Formulation

2.3. DP-Based Deep Q-Learning for Computation Offloading

2.3.1. Overview

2.3.2. Initialization (Lines 1–4)

2.3.3. Exploring (Lines 5–8)

2.3.4. Generating Differential Disturbance (Lines 9–11)

2.3.5. PER-Based Policy Updating (Lines 12–21)

2.4. Theoretical Analysis

2.4.1. Differential Privacy Guarantee

2.4.2. Minor Utility Loss

3. Results

3.1. Experiment Settings

3.2. Baseline Methods

3.3. The Convergence of the DP-DQL Method

3.4. The Privacy Protection of the DP-DQL Method

3.5. The Cost Efficiency of the DP-DQL Method

3.6. The Performance of the DP-DQL Method Deployed in a Realistic Scenario

4. Discussion

4.1. Impact of the Key Parameters on the Convergence of DP-DQL Method

4.2. Limitations and Future Works

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI