# A Node Selection Strategy in Space-Air-Ground Information Networks: A Double Deep Q-Network Based on the Federated Learning Training Method

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. System Model and Problem Definition

_{i}, and its size is denoted as di. (x

_{i}, y

_{i}) represents a data sample from D

_{i}, where x

_{i}denotes the sample data and y

_{i}represents the sample label. During local model training, x

_{i}is used as the input to the model, and y

_{i}is used as the expected output. The cross-entropy between y

_{i}and the model’s predicted output y

_{i}’ is calculated as the local model’s loss function.

_{sub}, where S

_{sub}= {1, 2, …, n}. The ES distributes the global model to each selected device, which then performs τ iterations to update the model using its local dataset D

_{i}. After completing local model training, each device uploads its local model’s weight parameters to the ES. The ES performs the model aggregation algorithm to obtain the latest global model and then proceeds to the next round of federated training.

_{i}CPU cycles to train a single sample and operates at a frequency of f

_{i}, the computation time required for device i to execute local model training in the k-th round of federated training is given by

_{0}). Changes in environmental conditions can affect the data transmission rate and introduce uncertainty in transmission delays. The training time of device i in the k-th round of federated training is the sum of the computation time for local model updates and the model transmission time. Therefore, the training time of device i can be obtained using the following equation:

_{k}. The formulation of the optimization problem is as follows:

## 3. Design of Node Selection Algorithm for Minimum Training Cost in FL Based on DDQN

- (a)
- System state: This chapter considers a practical scenario with dynamic network bandwidth. However, it is assumed that the network state remains relatively stable within a short time slot and does not undergo drastic changes within a few tens of seconds. The state space S(t) for DRL is defined as the combination of the device’s data transmission rate β(t), operating frequency ζ(t), signal transmission power T
_{p}(t), and the number of samples owned by the device I(t). Thus, at time slot t, the system state can be represented by the following equation:

- (b)
- The action space, denoted as A(t), is a vector consisting of discrete variables (0 or 1). ${a}_{i}^{t}\in A\left(t\right)$ represents the selection status of device i at time slot t. ${a}_{i}^{t}$ = 1 indicates that device i is selected to participate in federated training at time slot t, while ${a}_{i}^{t}$ = 0 indicates that device i does not participate in federated training at time slot t.

- (c)
- The policy π represents the mapping from the state space S(t) to the action space A(t), i.e., A(t) = π(S(t)). The goal of DRL is to learn an optimal policy π that maximizes the expected reward based on the current state.
- (d)
- The reward function r is aligned with the optimization objective, which is to minimize the weighted sum of time cost and energy cost. Therefore, the reward function r can be expressed as follows:

- (e)
- The adjacent state S(t + 1) is determined based on the current state S(t) and the policy π. The specific expression is as follows:

_{t}, received from the environment after the agent takes action a(t) based on policy π at time t. The goal of DRL is to maximize the sum of immediate rewards and discounted future rewards. Based on the Bellman equation and temporal difference algorithm, to obtain the optimal policy using the value function approach in DRL, it is necessary to estimate the action-value function for future time steps and discount the action-value function for future time steps using a discount factor, γ. The temporal difference target is defined as ${r}_{t}+\gamma \cdot Q({s}_{t+1},{a}_{t+1})$, where Q(s

_{t}, a

_{t}) is the estimated action-value function at time t. Since r

_{t}is the true reward obtained at time t, it is considered more reliable than the estimate Q(s

_{t}, a

_{t}). Therefore, the goal is to approximate the action-value function estimated at time t to ${r}_{t}+\gamma \cdot Q({s}_{t+1},{a}_{t+1})$. The temporal difference error, ${r}_{t}+\gamma \cdot Q({s}_{t+1},{a}_{t+1})-Q({s}_{t},{a}_{t})$, is optimized using loss functions such as mean square error to improve the accuracy of the neural network’s estimation of the state value.

^{N}to N, where N is the total number of IoT devices. The Q network outputs the action a

_{t}based on the current state s

_{t}, which interacts with the FL environment. Subsequently, the state of the FL environment transitions from s

_{t}to s

_{t+1}, and a scalar is returned to the agent as an immediate reward, r

_{t}. A boolean variable “done” is defined as the termination flag for federated training. When the communication rounds of federated training reach the upper limit, “done” is set to true, and the training process is terminated. The tuple < s

_{t}, a

_{t}, r

_{t}, s

_{t+1}, done > is stored in the experience replay buffer as a record of the interaction between the agent and the environment. When the interaction data in the experience replay buffer reach a certain quantity, the Q network trains on the data from the buffer, updating the node selection strategy.

_{t}and action a

_{t}are inputted into the Q network, yielding the action value at time t, denoted as Q(s

_{t}, a

_{t}). Then, the next state s

_{t+1}is inputted into the Q network to obtain the q values for different actions, and the action corresponding to the maximum q value, denoted as a

_{maxq}, is selected. Next, s

_{t+1}is inputted into the target Q network to find the q value corresponding to the action a

_{maxq}, denoted as Q

^{target}(s

_{t+1}, a

_{maxq}). Finally, Q(s

_{t}, a

_{t}) is used as the predicted value of the network, and r

_{t}+ γQ

^{target}(s

_{t+1}, a

_{maxq}) is used as the actual value of the network. Mean square error is used as the loss function to perform backpropagation on r

_{t}+ γQ

^{target}(s

_{t}

_{+1}, a

_{maxq}) − Q(s

_{t}, a

_{t}). To ensure the stability of the training process, in practice, only the parameters of the Q network are updated, while the weight parameters of the target Q network are fixed. The detailed training flow of the LCNSFL node selection algorithm is as follows (Algoritm 1):

Algorithm 1. The training process of the FL node selection algorithm based on LCNSFL. |

Input: Q network, target Q network Q_{tar}, target network update frequency f_{tar}, greedy factor e, greedy factor decay factor β, minimum sample size of the experience pool mBatch, maximum communication rounds of FL T, total number of devices N. |

Output: The trained Q network Q and the trained target Q network Q_{tar}. |

1: Initialize the local models of the devices, initialize the Q network as Q, initialize the target Q network as Q_{tar}, and set the step counter as step = 0. |

2: for t = 1 to T do |

3: Collect the state s(t) |

4: done = False |

5: Generate a random number rd |

6: if rd > epsilon then |

7: a(t) = random(0,N) |

8: else |

9: a(t) = $\underset{a(t)\in A}{\mathrm{arg}\mathrm{max}Q}(s(t),a(t))$ |

10: end if |

11: The edge server selects a device based on the action a(t), performs local model training on the selected device, and updates the global model. |

12: Compute the instantaneous reward r(t) based on Formula (19) and update the state from s(t) to s(t+1). |

13: if t = T then |

14: done = False |

15: end if |

16: Store the tuple information <s(t), a(t), r(t), s(t+1), done> in the experience pool. |

17: if the number of samples in the experience pool > mBatch then |

18: Randomly sample mBatch number of samples from the experience pool |

19: Use the Q network to estimate the q value Q(s(t), a(t)) at time t. |

20: Use the Q network to estimate the q value Q(s(t+1), a(t+1)) at time t+1 and obtain the action a_{maxq} corresponding to the maximum q value. |

21: Use the target Q network to estimate the action value Q_{tar}(s(t+1), a_{maxq}) at time t + 1. |

22: Optimize the Q network based on the stochastic gradient descent (SGD) method using Formula (23). |

23: Update the greedy factor $e=e\cdot {e}^{-\beta}$. |

24: if step%f_{tar} = 0 then |

25: Update the parameters of the target Q network (Q_{tar}) using the parameters of the Q network. |

26: end if |

27: step = step + 1 |

28: end if |

29: end for |

30: return the trained Q network Q and the trained target Q network Q_{tar}. |

_{t+1}from the Q network, denoted as a*. The value of action a* at state s

_{t+1}is estimated using the target Q network. The calculation of the TD target in the DDQN is expressed as follows:

^{−}of the target Q network are frozen, meaning that the target Q network is not updated. The parameters of the Q network, denoted as θ, are copied to θ

^{−}every f

_{req}iteration. The loss function of the Q network is defined as follows:

## 4. Performance Evaluation

#### 4.1. Experimental Environment Configuration

#### 4.1.1. Dataset and Data Allocation Method

#### 4.1.2. Simulation Parameter Settings

#### 4.1.3. Comparative Algorithms

#### 4.2. DDQN Model Training and Effect Analysis

#### 4.3. Comparison of Training Costs for Different Equipment Selection Strategies

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Internet of Things (IoT)—Statistics & Facts. Available online: https://www.statista.com/topics/2637/internet-of-things/#topicOverview (accessed on 28 April 2023).
- Chen, S.Z.; Liang, Y.C.; Sun, S.H.; Kang, S.L.; Cheng, W.C.; Peng, M.G. Vision, requirements, and technology trend of 6 g: How to tackle the challenges of system coverage, capacity, user data rate, and movement speed. IEEE Wirel. Commun.
**2020**, 27, 218–228. [Google Scholar] [CrossRef] - Kodheli, O.; Lagunas, E.; Maturo, N.; Sharma, S.K.; Shankar, B.; Montoya, J.F.M.; Duncan, J.C.M.; Spano, D.; Chatzinotas, S.; Kisseleff, S.; et al. Satellite Communications in the New Space Era: A Survey and Future Challenges. IEEE Commun. Surv. Tut.
**2021**, 23, 70–109. [Google Scholar] [CrossRef] - Jiang, W.; Han, B.; Habibi, M.A.; Schotten, H.D. The road towards 6G: A comprehensive survey. IEEE Open J. Commun. Soc.
**2021**, 2, 334–366. [Google Scholar] [CrossRef] - Zhang, Z.Q.; Xiao, Y.; Ma, Z.; Xiao, M.; Fan, P.Z. 6G wireless networks vision, requirements, architecture, and key technologies. IEEE Veh. Technol. Mag.
**2019**, 14, 28–41. [Google Scholar] [CrossRef] - Li, H.F.; Chen, C.; Shan, H.G.; Li, P.; Chang, Y.C.; Song, H.B. Deep Deterministic Policy Gradient-Based Algorithm for Computation Offloading in IoV. IEEE Trans. Intell. Transp. Syst.
**2023**, 1–12. [Google Scholar] [CrossRef] - Nayak, S.; Patgiri, R. 6G: Envisioning the Key Issues and Challenges. arXiv
**2020**, arXiv:2004.04024. [Google Scholar] - Chen, C.; Wang, C.Y.; Li, C.; Xiao, M.; Pei, Q.Q. A V2V Emergent Message Dissemination Scheme for 6G-Oriented Vehicular Networks. Chin. J. Electron.
**2023**, 32, 1179–1191. [Google Scholar] [CrossRef] - Mohammadi, M.; Al-Fuqaha, A. Enabling cognitive smart cities using big data and machine learning: Approaches and challenges. IEEE Commun. Mag.
**2018**, 56, 94–101. [Google Scholar] [CrossRef] - Sun, Y.H.; Peng, M.G.; Zhou, Y.C.; Huang, Y.Z.; Mao, S.W. Application of machine learning in wireless networks: Key techniques and open issues. IEEE Commun. Surv. Tutor.
**2019**, 21, 3072–3108. [Google Scholar] [CrossRef] - Voigt, P.; Bussche, A.V.D. The EU General Data Protection Regulation (GDPR): A Practical Guide, 1st ed.; Springer: Berlin, Germany, 2017; pp. 315–320. [Google Scholar]
- Konen, J.; Mcmahan, H.B.; Yu, F.X.; Richtárik, P.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv
**2017**, arXiv:1610.05492. [Google Scholar] - Nguyen, D.C.; Ding, M.; Pham, Q.V.; Pathirana, P.N.; Le, L.B.; Seneviratne, A.; Li, J.; Niyato, D.; Poor, H.V. Federated Learning Meets Blockchain in Edge Computing: Opportunities and Challenges. IEEE Internet Things J.
**2021**, 8, 12806–12825. [Google Scholar] [CrossRef] - Chen, H.; Xiao, M.; Pang, Z. Satellite-based computing networks with federated learning. IEEE Wirel. Commun.
**2022**, 29, 78–84. [Google Scholar] [CrossRef] - Qin, M.; Cheng, N.; Jing, Z.; Yang, T.; Xu, W.; Yang, Q.; Rao, R.R. Service-oriented energy-latency tradeoff for IoT task partial offloading in MEC-enhanced multi-RAT networks. IEEE Internet Things J.
**2020**, 8, 1896–1907. [Google Scholar] [CrossRef] - Jing, Z.; Yang, Q.; Qin, M.; Kwak, K.S. Momentum-Based Online Cost Minimization for Task Offloading in NOMA-Aided MEC Networks. In Proceedings of the 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall), Victoria, BC, Canada, 18 November–16 December 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Chen, M.; Yang, Z.; Saad, W.; Yin, C.; Poor, H.V.; Cui, S. A Joint Learning and Communications Framework for Federated Learning Over Wireless Networks. IEEE Trans. Wirel. Commun.
**2021**, 20, 269–283. [Google Scholar] [CrossRef] - Hou, X.; Wang, J.; Jiang, C.; Meng, Z.; Chen, J.; Ren, Y. Efficient Federated Learning for Metaverse via Dynamic User Selection, Gradient Quantization and Resource Allocation. IEEE J. Sel. Areas Commun.
**2023**. [Google Scholar] [CrossRef] - Sun, J.; Li, A.; Duan, L.; Alam, S.; Deng, X.; Guo, X.; Wang, H.; Gorlatova, M.; Zhang, M.; Li, H.; et al. FedSEA: A Semi-Asynchronous Federated Learning Framework for Extremely Heterogeneous Devices. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, Boston, MA, USA, 6–9 November 2022; pp. 106–119. [Google Scholar]
- Cha, N.; Du, Z.; Wu, C.; Yoshinaga, T.; Zhong, L.; Ma, J.; Liu, F.; Ji, Y. Fuzzy Logic Based Client Selection for Federated Learning in Vehicular Networks. IEEE Open J. Comput. Soc.
**2022**, 3, 39–50. [Google Scholar] [CrossRef] - Wang, S.Q.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. When Edge Meets Learning: Adaptive Control for Resource-Constrained Distributed Machine Learning. In Proceedings of the IEEE Infocom 2018—IEEE Conference on Computer Communications, Honolulu, HI, USA, 16–19 April 2018. [Google Scholar]
- Tran, N.H.; Bao, W.; Zomaya, A.; Nguyen, M.N.H.; Hong, C.S. Federated learning over wireless networks: Optimization model design and analysis. In Proceedings of the IEEE Infocom 2019—IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019. [Google Scholar]
- Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Vincent, P.H. Federated learning for internet of things: A comprehensive survey. IEEE Commun. Surv. Tutor.
**2021**, 23, 1622–1658. [Google Scholar] [CrossRef] - Yang, W.; Xiang, W.; Yang, Y.; Cheng, P. Optimizing Federated Learning with Deep Reinforcement Learning for Digital Twin Empowered Industrial IoT. IEEE Trans. Ind. Inform.
**2023**, 19, 1884–1893. [Google Scholar] [CrossRef] - Zheng, J.; Li, K.; Mhaisen, N.; Ni, W.; Tovar, E.; Guizani, M. Exploring Deep-Reinforcement-Learning-Assisted Federated Learning for Online Resource Allocation in Privacy-Preserving EdgeIoT. IEEE Internet Things J.
**2022**, 9, 21099–21110. [Google Scholar] [CrossRef] - Sun, R.; Matolak, D.W.; Rayess, W. Air-ground channel characterization for unmanned aircraft systems—Part IV: Airframe shadowing. IEEE Trans. Veh. Technol.
**2017**, 66, 7643–7652. [Google Scholar] [CrossRef] - Yang, Z.; Chen, M.; Saad, W.; Hong, C.S.; Shikh-Bahaei, M. Energy efficient federated learning over wireless communication networks. IEEE Trans. Wirel. Commun.
**2020**, 20, 1935–1949. [Google Scholar] [CrossRef] - Huo, Y.H.; Song, C.X.; Zhang, J.; Tan, C. DRL-based Federated Learning Node Selection Algorithm for Mobile Edge Networks. In Proceedings of the 2022 IEEE 14th International Conference on Advanced Infocomm Technology (ICAIT), Chongqing, China, 8–11 July 2022. [Google Scholar]
- Wang, S.Q.; Tuor, T.; Salonidis, T.; Leung, K.K.; Makaya, C.; He, T.; Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE J. Sel. Areas Commun.
**2019**, 37, 1205–1221. [Google Scholar] [CrossRef]

Abbreviation | Definition |
---|---|

6G | Sixth Generation |

AI | Artificial intelligence |

Bnq | Best network quality device selection strategy |

DDPG | Deep Deterministic Policy Gradient |

DDQN | Double Deep Q Network |

DRL | Distributed Reinforcement Learning |

ES | Edge server |

FL | Federated learning |

IoT | Internet of Things |

LCNSFL | Low-Cost Node Selection in Federated Learning |

MDP | Markov Decision Process |

MEC | Mobile Edge Computing |

RLI | Reinforcement Learning Intelligence |

SAGIN | Space-Air-Ground Information Network |

SatCom | Satellite Communication |

UAV | Unmanned Aerial Vehicle |

Variable | Definition |
---|---|

${c}_{i}$ | The number of cycles required for device i to train a single sample |

${d}_{i}$ | The number of samples that device i has |

${f}_{i}$ | CPU frequency of device i |

$\tau $ | Number of local iterations |

${N}_{0}$ | Noise power |

$k$ | Communication round number |

${B}_{\mathrm{i},\mathrm{k}}$ | The bandwidth allocated to device i in the k-th communication round. |

${p}_{\mathrm{i},\mathrm{k}}$ | The transmission power of device i |

${g}_{i,k}$ | The channel gain of device i |

${b}_{i}^{k}$ | The data transmission rate of device i in the k-th communication round |

${w}_{i}^{k}$ | The local model of device i |

$\sigma $ | The effective capacitance coefficient related to the computing chipset |

Parameter Type | Parametric Description | Setting |
---|---|---|

Local model and model parameters | Number of local iterations | 3 |

Local dataset size | 500 | |

Number of fully connected layers | 3 | |

Activation function | Relu | |

Learning rate | 0.01 | |

Optimizer | SGD | |

DDQN parameters | Training steps | 500 |

Reward discount factor | 0.9 | |

Q Network learning rate | 0.001 | |

Greedy factor | 1 | |

Greedy factor decay factor | 0.01 | |

Minimum sample size (mBatch) | 64 | |

Q Network linear layer count | 2 | |

Q Network hidden layer dimension | 128 | |

Target Q Network update frequency (freq) | 20 | |

Weighting factor (λ) | 0.6 | |

Device attributes | Number of devices | 50 |

Path Loss Index (η) | 3 | |

A | 100 | |

Model size | 10 MB | |

Node bandwidth | 6~8 Mbps | |

Slow node bandwidth | 0.1~0.3 Mbps | |

Server bandwidth | 100 Mbps | |

Transmission power | 0.2~0.4 w | |

Working frequency | 500 MHz~900 MHz | |

Energy consumption per bit for training | 0.02~0.04 J | |

Cycle required for training a unit of bit data | 6000~7000 |

Algorithm Name | Time Cost (s) | Energy Cost (J) | Weighted Cost |
---|---|---|---|

LCNSFL | 2821.1 | 22,816.5 | 12,818.0 |

Random Selection | 7769.0 | 30,449.0 | 19,109.1 |

Bnq Selection | 4879.0 | 26,587.8 | 15,733.4 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wang, W.; Li, S.; Zhang, J.; Shan, D.; Zhang, G.; Gao, X.
A Node Selection Strategy in Space-Air-Ground Information Networks: A Double Deep Q-Network Based on the Federated Learning Training Method. *Remote Sens.* **2024**, *16*, 651.
https://doi.org/10.3390/rs16040651

**AMA Style**

Wang W, Li S, Zhang J, Shan D, Zhang G, Gao X.
A Node Selection Strategy in Space-Air-Ground Information Networks: A Double Deep Q-Network Based on the Federated Learning Training Method. *Remote Sensing*. 2024; 16(4):651.
https://doi.org/10.3390/rs16040651

**Chicago/Turabian Style**

Wang, Weidong, Siqi Li, Jihao Zhang, Dan Shan, Guangwei Zhang, and Xiang Gao.
2024. "A Node Selection Strategy in Space-Air-Ground Information Networks: A Double Deep Q-Network Based on the Federated Learning Training Method" *Remote Sensing* 16, no. 4: 651.
https://doi.org/10.3390/rs16040651