UAV-Assisted Privacy-Preserving Online Computation Ofﬂoading for Internet of Things

: Unmanned aerial vehicle (UAV) plays a more and more important role in Internet of Things (IoT) for remote sensing and device interconnecting. Due to the limitation of computing capacity and energy, the UAV cannot handle complex tasks. Recently, computation ofﬂoading provides a promising way for the UAV to handle complex tasks by deep reinforcement learning (DRL)-based methods. However, existing DRL-based computation ofﬂoading methods merely protect usage pattern privacy and location privacy. In this paper, we consider a new privacy issue in UAV-assisted IoT, namely computation ofﬂoading preference leakage, which lacks through study. To cope with this issue, we propose a novel privacy-preserving online computation ofﬂoading method for UAV-assisted IoT. Our method integrates the differential privacy mechanism into deep reinforcement learning (DRL), which can protect UAV’s ofﬂoading preference. We provide the formal analysis on security and utility loss of our method. Extensive real-world experiments are conducted. Results demonstrate that, compared with baseline methods, our method can learn cost-efﬁcient computation ofﬂoading policy without preference leakage and a priori knowledge of the wireless channel model.


Introduction
With the rapid development of unmanned aerial vehicles (UAVs), they are applied in the various applications, such as data collection and remote sensing among Internet of Things (IoT) sensors [1]. Although the benefits of high mobility, swift deployment, and low economic cost, large-scale application of UAV is limited by the computation capacity and energy. Recently, computation offloading is regarded as a promising solution for enabling UAV-assisted IoT to process huge data produced by IoT sensors [2,3].
Existing computation offloading methods for IoT focus on two main categories, i.e., one-shot optimization methods [4,5] and DRL-based methods [6,7]. Compared with one-shot optimization methods, the DRL-based methods can assist devices to learn computation offloading policy with higher energy efficiency and low time delay. Besides this benefit, DRL-based methods allow the devices to learn computation offloading policy without a priori knowledge of wireless channel model, which can be applied to solve the wireless channel dynamic between the UAV and IoT sensors [8].
Although the benefits of applying DRL into computation offloading, the vulnerabilities in DRL can be exploited by adversaries to interfere the UAV with learning policy [9], which hinders it from being applied to the real world. Figure 1 provides a case of computation offloading preference leakage over UAV-assisted IoT. The adversary misleads the UAV to offload tasks to malicious BSs by inversing the RL algorithm based on the observations of the offloading decision and the transmission radio link status.
world. Once the value function is obtained by adversaries, the adversaries can compromise the mobile devices to offload computation tasks to malicious BSs.
Challenge 2: Existing privacy-preserving works that are designed for the DRL-based computation cannot prevent the malicious BSs from inferring the value function of the DRL algorithm.

Contributions
To solve the aforementioned privacy issues in existing works, we propose a differential privacy (DP)-based deep Q-learning (DP-DQL) method to solve the computation offloading preference leakage issue over UAV-assisted IoT. Our contributions can be summarized as follows.

1.
We investigate a new privacy leakage issue within the online computation offloading over UAV-assisted IoT, namely computation offloading preference leakage.

2.
We propose a differential privacy-based deep Q-learning (DP-DQL) method to protect computation offloading preference over UAV-assisted IoT. In the proposed DP-DQL method, the DQL is adopted as the basic framework for efficiently learning computation offloading policy without the a priori knowledge of the wireless channel model. Then, a generated Gaussian noise is generated in the policy updating process of DQL, which can protect the computation offloading preference. Finally, the learning speed of DP-DQL is accelerated by the PER technique [21] by replying the experience with high temporal-difference error.

3.
We provide theoretical analysis for the differential privacy guarantee and utility loss. Then, the convergence, privacy protection, and cost efficiency of our method is demonstrated by extensive real-world experiments. The results show that our method can help UAV learn the cost-efficient computation offloading policy with the differential privacy guarantee.
The rest of this paper is organized as follows. Section 2 gives the necessary background, system model and problem formulation, details of the proposed method and theoretical analysis. Then, we design and conduct the experiments in Section 3. We discuss the impact of key parameters on the convergence and the limitations of the proposed method in Section 4. Finally, we conclude this paper in Section 5.

Materials and Methods
In this section, we first provide the background techniques. Then, we describe the system model and formulate the research problem of this paper. Finally, we show the proposed DP-DQL method and give the theoretical analysis in terms of privacy guarantee and utility loss.

Differential Privacy
Differential privacy [22,23] establishes a strong standard for privacy guarantees in knowledge transfer, which aims to disable data analysis algorithms from distinguishing between two neighboring inputs. The key definitions are provided in the following. Definition 1. For any two neighboring inputs z, z ∈ B and subset of outputs D ⊆ E, the (α, y)differential privacy can be guaranteed once the mechanism C : B → D satisfies the inequality The definition of output's global sensitivity is shown as follows.
Definition 2. Given ∀z, z ∈ B as neighboring inputs, the output's global sensitivity can be computed as where · represents the norm function in E.

Deep Q-Learning
Deep Q-learning [24] leverages the deep neural network to approximate the value function, which aims to find a policy Π * (·) that can minimize the Bellman error as follows.  Figure 1 shows a UAV-assisted MEC system for IoT, which contains N fixed base stations (BSs), denoted as a set N = {1, 2, ..., N} and a UAV. (x n , y n , 0) is the coordinate of BS n, and (x, y, h) is the coordinate of the UAV, where h indicates the flight height of UAV. Referring to Reference [25], UAV adopts Wi-Fi or LTE technology to communicate with BS and smart factories. At each time slot t, a computation task T t (e.g., pattern recognition) is collected by the UAV from a smart factory, where T t ∈ T. The task T t is described as T t = (D t , C t ), where D t is the maximum execution time, and C t is the bits of task T t . Morevoer, every ξ CPU cycles can process a bit in the task.
Due to the binary computation offloading is the special case of partial computation offloading [26,27], we investigate partial offloading in this paper for generality. To process a task, the UAV needs to decide how much of a task should be offload to BSs. Formally, s t represents the offloaded proportion of a task T t . To improve the performance of computation offloading decision, the UAV can adjust the offloaded proportion of a task. We define the CPU frequency of the BS and UAV as f n and f , respectively. We assume that the each BS has the same CPU frequency, which can avoid some of fallacies of computing during BS parallel processes tasks [28]. Hence, the cost model is shown in the following. The time spent for locally processing a task P L t is The local energy consumption E L t is where the β is a coefficient related to the CPU architecture [29]. The cost on offloading task to BS n consists of two parts, i.e., the time cost and the energy cost. The time P O t for offloading and processing task is where the C t s t /r n t is the time for transmitting the task to BSs, and ξC t s t / f n represents the processing time in the BS.
Based on Reference [30,31], the energy consumption on transmitting a task to BS n depends on the transmit power EP, which can be shown as In this paper, we assume that the BSs have sufficient energy. This is reasonable that fixed BSs are usually deployed in the area that has wired power supplied by grid. Hence, the energy consumption on the BS is not considered in this paper. For convenience, the major notations used in this paper are summarized in Table 1. The CPU frequency of n-th BS f The CPU frequency of the UAV T t , T The computation task in time slot t and the set of computation task H The reset factor of the DP-DQL The time and energy consumption at time slot t in BSs C t , D t The bits and maximum execution time of task T t P L t , E L t The consumed time and energy to locally process task at time slot t EP The transmit power of transmitting a bit from BS n to the UAV a t , s t , u t The action, state, and reward of the DP-DQL in t-th time slot TP, V The number of training episode and the maximum learning steps within a training episode τ, γ The discount factor and learning rate of the proposed DP-DQL A, Z The mini-batch size and the replay buffer ξ The bits which be processed during a CPU cycle r n t The radio link transmission rate between the UAV and BS n Ψ The balance factor of the DP-DQL

Threat Model and Privacy Issue
In this paper, we consider a new privacy leakage issue over UAV-assisted IoT, namely computation offloading preference leakage, which is shown in Figure 1. In this paper, we assume that the adversary knows the inputs and formats of UAV's computation offloading policy in advance. It is reasonable to make the assumption because: 1.
The BSs can provide customized services for UAV based on the formats of UAV's computation offloading policy and the inputs of the policy.

2.
Once the adversary induce the BS, it can monitor the inputs and formats of UAV's computation offloading policy.
In accordance with above assumptions, the adversaries monitor UAV's computation offloading decision and the input of computation offloading policy, e.g., radio link transmission rate between UAV and BSs. The adversaries utilize an algorithm, such as inverse reinforcement learning algorithm, to infer the UAV's computation offloading preference based on monitoring results. Furthermore, the adversaries construct specific inputs for UAV's computation offloading policy, e.g., improving radio link transmission rate with the help of malicious BSs, which can mislead UAV to offload computation tasks onto the malicious BSs.

Design Goals
To solve above privacy issues, our proposed method should reach the goals as follows.

1.
Differential privacy guarantee: DP-DQL method should provide (α, y)-differential privacy for UAV during learning process so that the value function of the UAV's computation offloading policy will not be inferred by the adversaries based on the system state and offloading decision.

2.
Minor utility loss: DP-DQL method should guarantee that, compared with the traditional DQL method, the performance of the DP-DQL method will not be significantly degraded by adding the differential privacy mechanism.

Problem Formulation
As claimed in Section 2.1.2, we formulate the problem of privacy-preserving computation offloading in UAV-assisted MEC network for IoT as the Markov decision process (MDP), which is defined as a tuple M = (S, A, P, R).
(1) System state: The system state is the offloaded proportion. Formally, s t ∈ S ranges from 0 to 1.
(2) Action space: The UAV adjusts the offloaded proportion of a task by increasing or decreasing from 0 to 0.25. Formally, a t ∈ [0, 0.25].
(3) Reward function: The weighted average of energy and time costs is adopted as the reward function, which is given as follows: where η is the normalization function. To meet the differential private requirements, the value domain of the reward function is constrained from 0 to 1. The proof will be given in Section 2.4.1.

DP-Based Deep Q-Learning for Computation Offloading
In this section, we firstly give an overview of the DP-DQL. Then, the details for the DP-DQL are provided.

Overview
The steps of online learning are shown in Figure 2, which consists of four stages.

1.
Initialization: Initializing the parameters used in DP-DQL approach.

2.
Exploring: The UAV executes offloading action and obtains reward from the environment.

3.
Generating differential disturbance: The UAV generates the specific Gaussian noise to prevent the computation offloading preference leakage.

4.
PER-based policy updating: The UAV updates computation offloading policy with the help of PER technique.  Algorithm 1 shows the details of the DP-DQL method, and its description is given as follows.
Algorithm 1 Differential Privacy-based Deep Q-Learning for computation offloading method 1: Initialize the parameters of DP-DQL method; 2: for j ∈ [1, TP] do 3: Reset the environment; 4: Reset differential dict l(·); 5: for t ∈ [0, V − 1] do 6: Conduct the action a t = arg max a l(s t ) + Π ζ (s t , a); 7: Reach to the state s t+1 and get a reward u t ; 8: Compute the maximum priority z t via Equation (9) and store it with transition; 9: if t ≡ 0 mod A then 10: for i ∈ [1, A] do 11: Generating differential disturbance δ t ; 12: Sample transition via Equation (13); 13: Compute importance-sampling weight via Equation (14); 14: Compute TD-error via Equation (9); 15: Update the priority of the transition; 16: Compute accumulated policy gradient ψ by Equation (15) The online policy Π(·) and target policy Π (·) are initialized with random weights ζ and ζ (Lines 1), where the target policy Π (·) is used to slow the updating rate of online policy Π(·) and, hence, improve the stability of the algorithm. The environment is constructed for learning computation offloading policy. Then, the differential dict l(·) is initialized and reset to NULL every TP/H training episodes. (Line 4). The differential dict l(·) is defined as a Dictionary structure, where the key size and value size are set to |A| and 2, respectively.

Exploring (Lines 5-8)
If a task T t is collected by UAV at time slot t, the UAV makes offloading decision by online computation offloading policy Π(s t ) under differential disturbance l(s t ) (Line 6). After receiving the reward u t (Line 6), the UAV obtains a new state s t+1 (Line 7). Based on old state s t , action a t , new state s t+1 , and reward u t , the UAV constructs a transition. Then, the UAV compute the priority z t and store it in replay buffer Z with the transition (Line 8). Based on Reference [21], the priority z t can be computed as follows: The Θ in Equation (9) is given as follows: where a t+1 = Π (s t+1 ).

Generating Differential Disturbance (Lines 9-11)
Once replay buffer is filled with the transitions (Line 9), a mini-batch of the transitions will be sampled from replay buffer to update online and target policies (Line 10). Gaussian process δ t ∼ Y(υ t , φ t ) generates a differential disturbance y t for each action a ∈ A (Line 11). The differential disturbance δ t is appended to differential dict l(s t ) ← δ t , then l(·) is sorted. The υ t and φ t are given below based on [32]: where Φ = (4(1 + Ψ)/A) −1 γ, Ψ represents the balance factor, = s t − s t−1 2 , Λ = s t+1 − s t−1 2 , and Ω = s t+1 − s t 2 .
2.3.5. PER-Based Policy Updating (Lines [12][13][14][15][16][17][18][19][20][21] In this stage, the accumulated policy gradient ψ is calculated based on PER technique. Specifically, the sampling probability of a transition is computed by (Line 12) where the g is the index of a transition. Then, the function of importance sampling weight (Line 13) is to decrease bias referred to Reference [21], which is given as follows: where P(·) is the sampled probability, |Z| represents the replay buffer size, θ 2 is used to determine how much priority affects the sampling probability, and Υ i is an importance factor of transition. Hence, the TD-error is calculated via Equation (9) (Lines 14), and the absolute TD-error value is adopted to update the ith transition priority (Lines 15). Finally, the accumulated policy gradient ψ is computed based on a chain rule in Reference [33] as shown below (Line 16): where ∇ ζ Π(s i , a i ) is the gradient of the online policy for vector (s n , a n ). Then, the online policy Π(·) is softly update as follows: where ω is a soft update coefficient (Line 18). Then, set ψ ← 0. Finally, the target policy Π (·) is updated (Line 18).

Differential Privacy Guarantee
To analyze the differential privacy guarantee of the DP-DQL method under the adversary model in Section 2.2.2, we firstly provide a necessary theorem. Theorem 1. Given the sample path k from Gaussian process F(0, ρ 2 K), max ∈[0,1] k() exists with high probability. For each w > 0 in Sobolev space G 1 , we have where the Φ = Then, we consider the base case that the sample path k contains two elements, i.e., k 2 = k(0), k(1). Therefore, the expectation E is which is based on Finally, we consider the case of |k( )| > 2. Specifically, we aim to bound the expectation E(max k p ) for all p > 1. Referring to Chernoff bound, we have where j i is the p independent Gaussian random variables based on F(0, ρ 2 j ). Let t = 2 ln p/ρ j , we have max i j i 2 ln p/ρ j . Let m p = E(max k 2 p 0 ). Due to k 2 p 0 ⊂ k 2 p+1 0 , the series m p is non-decreasing. Then, we derive the upper bound of m p+1 − m p . Referring to Reference [32], we have ∃j, E(max(0, max k 2 p 1 − k 2 p 0 )) E(max(0, j)). The bound of E(max(0, j)) is given as follows. Hence, Finally, we have Based on induction, we can get ∀p, Referring to Reference [34], E(max k) shares the same upper bound of m p almost surely when k is continuous with probability one. Hence, Theorem 1 follows.
Let Φ = B/(4γX(1 + Ψ)), and the Equation (27) can be rewritten as Referring to Reference [36], we can get by adding Gaussian disturbance l ∼ F(0, ρ 2 K) to Π(·) within a policy updating step on the basis of Equation (25). This conclusion can be generalized to multiple policy updating steps by the theorem in Reference [37]. Hence, Theorem 2 is proved.

Minor Utility Loss
Before giving the final proof of utility loss, we show a necessary theorem and its proof.

Theorem 3.
Assume that U # a is the optimal result of the inequality constraint problem and we can obtain Proof. Given u # a = u a + δ a , we can get Equation (32) based on the strong duality and non-negativity in (♠) and ( ), respectively.
Finally, we prove the convergence of the proposed method through Theorem 4.

Theorem 4.
Compared with vanilla DQL, the utility loss of the our DP-DQL method tends to be 0 even under the worst case, where H = 1.

Proof. We have Equation (33) by solving Equation (30)
according to Theorem 3. Further, according to Reference [32], the equation E[u # e T ] = E[∑ a u a U #T a ] holds. Based on the strong duality, we have ∑ a u a U * T a = u * e T . As Considering state space is infinity, the upper bound of E[ u * − u # ] 1 tends to be zero. Moreover, Reference [24] guarantees the convergence of vanilla DQL. Therefore, our DP-DQL method achieves minor utility loss compared with vanilla DQL and converges within finite training episodes.

Results
In this section, we design four experiments to evaluate the convergence, privacy protection and cost efficiency of the DP-DQL method.

Experiment Settings
Scenario: In this paper, the UAV flies around the area at a constant height to collect data and make computation offloading decision based on its offloading policy. The device for locally processing task is a Raspberry Pi 3B+, which is adopted as the airborne computer. Figure 3a shows the architecture of UAV used for experiments, and the flying area is shown in Figure 3b. In this paper, we firstly randomly deploy three laptops in Figure 3b to represent the BSs. Then, we deploy three actual rodeside units (RSU) to evaluate the variation of results. Because the computing power of actual RSU is similar to that of Raspberry Pi [38,39], the actual RSU is represented by the Raspberry Pi 4.  Parameter settings: The radio link transmission rate from n-th BS to the UAV is r n t ∈ {2 Mb/s, 6 Mb/s, 10 Mb/s}. At each time slot t, the values of data size C t for the task T t is C t ∈ {20 Mb, 40 Mb, 60 Mb}, and each task should be processed within D t = 3 s. During task processing, each bit needs ξ = 1000 CPU cycles to process [40]. As defined in Reference [41], the transmit power form BSs to the UAV EP is 0.2 W. The proposed DP-DQL method is implemented by Pytorch 1.1 and Python 3.6. We adopt a four-layer fully-connected feedforward neural network to implement the online and target policies. The learning rate γ is set to 0.001, while the discounted factor τ is set to 0.999. The replay buffer size |Z| is 1024. The size of mini-batch A is 128. During online learning, there are TP = 100 training episodes and V = 50 learning steps in every training episode. For convenience, the all values of parameters are summarized in Table 2.

Baseline Methods
We evaluate the efficiency of our DP-DQL method by comparing it with two baseline methods:

1.
Greedy: This method has been widely adopted as a baseline method, where all tasks are fully offloaded to the BSs.

2.
Deep Q-learning with non-differentially-private mechanism (DQL-non-DP) [19]: We adopt a model-free method designed for healthcare IoT network [19] and adjust it according to the system state space of this paper. This method can learn the costefficient computation offloading policy and serve as the baseline of cost efficiency for the DP-DQL method. The DQL-non-DP method shares the same hyperparameters with the DP-DQL method in the following experiments.

The Convergence of the DP-DQL Method
In this paper, the proposed DP-DQL method is modified at the action selection step in exploring and the accumulated policy gradient computing step in PER-based policy up-dating with the aim of protecting computation offloading preference. However, these two modifications may affect the learning performance of the proposed method. To evaluate the potential effect, we vary the σ to test the impact of the modifications on the learning performance of the proposed method. According to Theorem 2, once the other hyperparameters and experiment parameters are determined, σ will be an important parameter in the DP-DQL method to determine privacy level. In this paper, we set σ ∈ {0, 0.2, 0.4, 0.6, 0.8}. Note that σ = 0 is a special case that privacy-preserving mechanism is not applied. Figure 4 shows the results. The results indicate that, with σ increasing, DP-DQL method needs more training episodes to approximate the learning performance than the non-privacy case. It raises a problem that what value of σ should we choose to achieve the best learning performance while preserving computation offloading preference. From Figure 4, it can be seen that the DP-DQL method allows UAV to learn the a stability computation offloading policy within 20 TPs, 20 TPs, and 34 TPs under the case of σ = 0, σ = 0.2, and σ = 0.4, respectively. When σ > 0.4, the DP-DQL method will continue to oscillate with no sign of convergence. Hence, we can see that DP-DQL method performs well when σ 0.4.

The Privacy Protection of the DP-DQL Method
According to the threat model in Section 2.2.2, the adversary tries to increase the similarity of the distributions of the vanilla value function and recovered value function in the state space S. In this paper, we adopt the t-test to quantitatively evaluate the above similarity. To conduct the t-test, we firstly make a set of hypotheses, including K 0 and K W . Note that the subscript 0 is the null hypothesis, and W is the alternative hypothesis.

1.
K 0 : The distributions of the vanilla value function is the same as that of recovered value function in the state space S.

2.
K W : The distributions of the vanilla value function is not the same as that of recovered value function in the state space S.
Referring to Section 3.3, we set the value function with σ = 0 as the vanilla value function, while the value function with σ = 0, σ = 0.2, σ = 0.4, σ = 0.6, and σ = 0.8 is set as recovered value function. We randomly generate the twenty pairs (r n t , C t ), and input them to the vanilla value function and recovered value function, respectively. Then, we obtain two sets of values. Finally, we calculate the p-value of two sets. Table 3 shows the results. It can be seen that the p-value is less than 0.001 in most cases, except the case of σ − 0. Hence, the null hypothesis K W is accepted with strong evidence. It indicates that the value function of proposed DP-DRL method cannot be recovered by inverse RL.

The Cost Efficiency of the DP-DQL Method
In this experiment, our aim is to evaluate how much the privacy preserving mechanism in the proposed method affects its cost efficiency, compared to the baseline methods. We adopt the weighted average of the cost P L t , E L t , P O t , and E O t as the comparing metric, which is calculated based on Equation (8). Note that, due to the different ranges of the cost P L t , E L t , P O t , and E O t , they are normalized as Equation (8). According to Section 3.3, we set σ = 0.2 and σ = 0.4 in the experiments. To avoid statistical deviation, we perform each experiment with 10 random seeds. Figure 5 shows the influence of the radio transmission rate r n t on the cost efficiency of DP-DQL and baseline methods. The radio transmission rate r n t is set to be 2 Mb/s, 6 Mb/s, and 10 Mb/s, respectively, while the bits of a task is C t = 40 Mb. Compred with the baseline method, i.e., DQL-non-DP method, we can see that DP-DQL method has little cost efficiency reduction. For instance, we select the DP-DQL method with σ = 0.2 to compare with DQL-non-DP method. When two methods converge, the average cost of DP-DQL method is 15%, 18%, and 20% less than that of DQL-non-DP method in the case of r n t = 2 Mb/s, r n t = 6 Mb/s, and r n t = 10 Mb/s. Moreover, we observe that the proposed DP-DQL method requires more training episodes, which varies from 20 TPs to 35 TPs, to achieve the similar cost than DQL-non-DP method. The extra training episodes used by the proposed DP-DQL method indicate the tradeoff between privacy and cost. Furthermore, we can find that, with the increase of radio transmission rate r n t , the Greedy method can achieve better the cost efficiency of DP-DQL method in the cases of r n t = 6 Mb/s and 10 Mb/s. The reason is that promising wireless channel status reduces the transmitting cost. However, compared with DP-DQL method,the Greedy method cannot preserve computation offloading preference. Moreover, the cost efficiency of DP-DQL is evaluated under different task bits C t , compared with baseline methods. The bits of a task C t is set to be 20 Mb, 40 Mb, and 60 Mb, while the radio transmission rate r n t is 10 Mb/s. The results in Figure 6 show that DP-DQL method and DQL-non-DP method outperform the Greedy method in most cases, except the early learning stage of the case of C t = 20Mb. The largest improvement of rewards is in the case of C t = 60 Mb, which is 260%. We can see that the difference in rewards between the DP-DQL method and DQL-non-DP method varies relatively little with C t . The maximum difference is only 12%, indicating that the proposed method is not overly affected by privacy-preserving mechanisms in terms of cost efficiency and can learn a cost-efficient computation offloading policy. The reason is that offloading all of a task to the BS will not cause too much cost in transmission time when the size of a task is small. With the size of a task increases, transmission time becomes the bottleneck of the cost efficiency of the Greedy method.

The Performance of the DP-DQL Method Deployed in a Realistic Scenario
Through changing the laptops to the actual RSUs in the scenario, we re-examine the performance of the proposed method and DQL-non-DRL method in terms of cost efficiency. The parameters are set the same as Section 3.5. Figures 7 and 8 show the cost efficiency of the proposed method and baseline methods under different transmission rate r n t and bits of a task C t , respectively. It can be seen that the proposed method can still converge with finite TPs. However, by pairwise comparing Figure 5 with Figure 7, and Figure 6 with Figure 8, we can see that the experiments with actual RSUs increase the cost. The reason is that the weaker CPU computing power of the actual RSU increases the time cost and eventually leads to an increase in the total cost. Specifically, based on the assumption that the BS energy consumption is not considered in Section 2.2.1, the energy cost does not change when the actual RSUs are used to replace the laptops, which is still the local energy cost E L t plus the transmission energy cost E O t . However, the time P O t for offloading and processing will increase due to the weaker CPU computing power of the actual RSUs. To further verify our point, setting σ = 0.4, Figure 9 shows the proportion of time cost in the total cost. It can be seen that the proportion of time cost to the total cost increases after replacing with actual RSUs. Hence, changing laptops to the actual RSUs will increase the time cost and eventually lead to an increase in the total cost.

Discussion
In this section, we firstly discuss the impact of the parameter value on the learning performance of the proposed method. Then, we discuss the limitations of the proposed method.

Impact of the Key Parameters on the Convergence of DP-DQL Method
As shown in Line 4, Algorithm 1, the reset factor H is the updating frequency of the differential dict l(·), which can affect the convergence of the proposed method. In this experiment, we evaluate the influence of value selection of reset factor H on DP-DQL performance by setting the σ = 0.2. Figure 10 shows the results. In the figure, it can be seen that the average reward is not influenced by the value of reset factor H.

Limitations and Future Works
The proposed method supports only one UAV to offload the task to single BS at one time slot. This can cover most of the existing daily application scenarios, such as grid inspection, remote sensing, etc. However, the limited endurance of a single UAV limits its ability to perform more complex tasks. Multi-UAV collaboration provides a viable idea, but the proposed method is not able to support privacy-preserving computation offloading in multi-UAV-assisted IoT scenarios. In the future, techniques, such as distributed reinforcement learning and local differential privacy, offer potential solutions to the above needs.

Conclusions
In this paper, we propose a differential privacy-based deep Q-learning method for computation offloading over UAV-assisted IoT, which can protect UAV's computation offloading preference. The formal analysis shows that the proposed DP-DQL method meets the design goals, i.e., differential privacy guarantee and minor utility loss. Furthermore, we evaluate the convergence and privacy of DP-DQL method by the real-world experiment.
The results indicate that our DP-DQL method can achieve long-term energy performance under the privacy guarantee, compared with baseline methods. In the future, we will further investigate various privacy issues on DRL-based computation offloading methods.