The Framework of 6G Self-Evolving Networks and the Decision-Making Scheme for Massive IoT

: The increasingly huge amount of device connections will transform the Internet of Things (IoT) into the massive IoT. The use cases of massive IoT consist of the smart city, digital agriculture, smart trafﬁc, etc., in which the service requirements are different and even constantly changing. To fulﬁll the different requirements, the networks must be able to automatically adjust the network conﬁguration, architectures, resource allocations, and other network parameters according to the different scenarios to match the different service requirements in massive IoT, which are beyond the abilities of the ﬁfth generation (5G) networks. Moreover, the sixth generation (6G) networks are expected to have endogenous intelligence, which can well support the massive IoT application scenarios. In this paper, we ﬁrst propose the framework of the 6G self-evolving networks, in which the autonomous decision-making is one of the vital parts. Then, we introduce the autonomous decision-making methods and analyze the characteristics of the different methods and mechanisms for 6G networks. To prove the effectiveness of the proposed framework, we consider one of the typical scenarios of massive IoT and propose an artiﬁcial intelligence (AI)-based distributed decision-making algorithm to solve the problem of the ofﬂoading policy and the network resource allocation. Simulation results show that the proposed decision-making algorithm with the self-evolving networks can improve the quality of experience (QoE) compared with the lower training.

user experience rates, ultra-low latency, and so on. Compared to eight representative key performance indicators (KPIs) of 5G defined by the International Telecommunication Union (ITU) [3], 6G is expected to satisfy higher requirements. According to [4], the peak data rate of 6G is expected to reach 1 Tbps. The user experience data rate is expected to reach 10-100 Gbps. The energy efficiency (EE) and spectrum efficiency (SE) are expected to respectively reach 10-1000 times and 3-5 times compared to 5G. Furthermore, 6G is expected to provide 0.1 ms over-the-air latency, enable high mobility up to 500 km/h, and support a connection density of up to 10 7 devices/km 2 . In addition to the above traditional KPIs, some new KPIs will appear in the 6G era, such as computing power for compute first networking (CFN), security level for deterministic networks [5], and intelligence level for intelligent networking, which are to be defined in follow-up research. For clarity, we give the quantitative comparison of the KPIs between 5G and 6G in Table 1. In some of the massive IoT applications such as super-smart vehicles and extended reality (XR), the service requirements are constantly changing, and it is of crucial importance to achieve low latency. Traditional methods of manually configuring the network will cause the large delay and extremely high complexity. On the other hand, driven by the rapid development of mobile Internet, big data, super-computing, sensor networks, brain science, and other fields as well as the strong demands of economic and social development, AI technology is also accelerating its development [6]. Moreover, a consensus has been reached by the industry and academia that 6G networks are expected to have endogenous intelligence. In short, the 6G network with endogenous intelligence will be suited for massive IoT. In fact, network intelligence is not a new concept. In the 4G era, a self-organizing network (SON) was proposed to realize the partial network intelligence [7]. By means of self-configuration, self-healing, and self-optimizing, SON reduced the operator resource expenditure and operating expenditure to a certain extent. Additionally, research on network intelligence has been gradually developed in the 5G era because of the higher and higher service requirements and the increasingly complex networks. The integration of AI and mobile communication wireless technology is one of the main research directions in the field in the 5G system. Various standardization organizations play extremely important roles in the promotion of 5G network intelligence. 3GPP SA2 established the research project "Study of enablers for Network Automation for 5G (eNA)" and enhanced the network data analytic functions (NWDAF) of the 5G core network in May 2017 [8]. In February 2017, the European Telecommunications Standards Institute (ETSI) also formally approved the establishment of a new industry standard group, "Experiential Networked Intelligence (ENI)" [9]. ITU-T established the "Machine Learning for Future Networks including 5G (FG-ML5G)" working group during the SG13 meeting held in November 2017.
Moreover, there are many studies about the integration of AI and 5G in academia, which focus on the intelligent resource allocation [10][11][12][13][14][15][16], intelligent network slicing [17][18][19][20][21][22], intelligent network operation and management [23][24][25][26][27][28], etc. From the above survey, we can find that the integration of AI and 5G just focuses on applying AI techniques in the 5G system to enhance some network functions and only achieve partial network intelligence to a certain extent, since the wireless network environment is constantly changing, and the deployed AI components need to be continuously trained and updated, which will hinder the intelligentization of the network. It is expected that AI will be embedded in future networks.
In a word, the 6G system will undergo unprecedented innovations compared to the 5G system. AI abilities are expected to be endogenous in the 6G system. Table 2 gives the differences of AI abilities in 5G and 6G. The 6G network with the endogenous intelligence can realize autonomous sensing, autonomous decision-making, and autonomous control, which can better satisfy the service requirements of the massive IoT scenarios. In terms of intelligent networking for 6G, much research has also been conducted. For example, Huawei proposed the intent-driven network, which was expected to adjust the network configuration based on the prediction of the user intent [29]. S. Wang et al. [30] proposed a distributed and autonomous network architecture for 6G, which was enabled by pervasive distributed intelligence. M. Peng et al. [31] proposed an extreme-intelligent and extreme-concise system architecture of radio access networks to fulfill the requirements of ultra-high data rates and ultra-low latency in 6G networks. T. Zhang et al. [32] further introduced the concept of 6G endogenous intelligence networks which was driven by artificial intelligence and discussed the characteristics and the key technologies of the endogenous intelligence networks. H. Yang et al. [33] presented an AI-enabled intelligent architecture for 6G, which consisted of the following four layers: sensing layer, data mining and analytics layer, control layer, and application layer. Furthermore, the proposed architecture can realize smart resource management, automatic network adjustment and intelligent service provisioning with a high level of intelligence. Considering the typical scenarios in 6G, N. Kato et al. [34] proposed the deep-learning-based path selection to optimize the performance of Space-Air-Ground integrated networks. Y. Xiao et al. proposed the self-learning AI architecture for 6G edge intelligence [35]. Analyzing the above works, we found that the existing research on 6G intelligent networking is not completely free from manual intervention, or considering litter on the self-update of the AI models and the self-evolution of the networks.
However, in some massive IoT scenarios, the number of the active devices and the service requirements are constantly changing, which requires the network to be scalable and evolvable. Thus, we propose the framework of self-evolving networks, which is the closed-loop framework with the abilities of autonomous sensing, autonomous decisionmaking, and autonomous control, and the AI models can realize the autonomous updates to support the evolution of the networks. Table 3 gives the differences among our work and the other existing works. It can be seen that our proposed framework is more suitable for the massive IoT scenarios. Table 3. The differences of AI among the existing intelligent networks and our proposed 6G selfevolving networks.

Literature
Architecture Characteristic [29] Intent-driven network The network is driven by the user intent; in other works, the human involvement is still required.
[30] Distributed autonomous network The authors proposed the design principles of 6G networks, which consists of highly distributed, flat, and fully autonomous characteristics, but they did not expound how to realize the native AI.
[31] Intelligent-concise radio access networks The authors proposed to integrate AI with 4C (communications, computing, caching, and control).
[32] Endogenous intelligence network The authors proposed to induce AI to the all layers of the network.
[33] AI-enabled intelligent architecture The authors did not concentrate on the evolution of the AI abilities.
[34] Space-Air-Ground integrated networks The authors just concentrated on the path selection of Space-Air-Ground integrated networks. [35] Self-learning AI architecture The authors mainly focused on 6G edge intelligence.

Our work Self-evolving network
We propose the closed-loop framework, which has the abilities of autonomous sensing, autonomous decision-making, and autonomous control and can realize the self-evolution without the human involvement.
There are many researches on autonomous decision-making schemes. For example, reference [36] discussed the cell association problem in ultra-dense networks and proposed the Q-learning and deep Q networks (DQN)-based intelligent fast cell association algorithms. As for the task offloading and resource allocation problem, Reference [37] proposed a reinforcement learning (RL)-based joint task offloading and migration schemes to raise the total revenue of mobile users. Reference [38] proposed the DQN and deep deterministic policy gradient (DDPG)-based task offloading and resource allocation to reduce the sum cost of tasks. Reference [39] proposed an multi-agent deep reinforcement learning (MADRL)-based joint bit rate selection and radio resource allocation scheme in fog-computing based radio access networks. Reference [40] proposed the DDPG based computing offloading algorithm to reduce the total system delay cost. Additionally, many researchers focus on the autonomous decision-making of the vehicle platoon networks, and reference [41] proposed the DRL-based joint resource management algorithm to improve the communication and computing efficiency in maritime network. Reference [42] proposed the reinforcement learning-based task offloading scheme to optimize the price decision and computing resource allocation in the vehicle platoon networks. Reference [43,44] proposed DDPG algorithm for decision-making in vehicular networks. There are few works focusing on the supporting network and the distributed decision-making schemes for massive IoT scenarios. In this paper, we creatively propose the framework of 6G self-evolving networks which have the abilities of autonomous sensing, autonomous decision-making, and autonomous control and can realize the self-evolution without the human involvement. Then, we propose the distributed dueling double DQN (D3QN)-based decision-making algorithm to obtain the optimal task offloading and resource allocation policy. Compared with Q-learning, DQN and double DQN (DDQN) algorithms, and other centralized algorithms, our method can avoid the overestimation of the Q value, and the distributed decision-making mechanism is more scalable.
In the rest of the paper, we first propose the framework of 6G self-evolving networks and introduce the common decision-making mechanism in Section 2. Then, we propose the distributed decision-making scheme used in massive IoT, including the system model, the algorithm, and the simulation results in Section 3. Section 4 shows the use cases of the proposed framework of 6G self-evolving networks. Finally, the conclusions and the future direction are given in Section 5.

Edge-Computing-Based Framework
In order to fulfill the different requirements of the constantly emerging new services in massive IoT, 6G networks are expected to be deeply integrated with AI and to be able to autonomously adjust the network configuration, architecture, and other parameters to achieve the best match between the networks and the services. In this part, we propose the edge-computing-based framework 6G self-evolving network for massive IoT in Figure 1. As shown in the figure, the cloud server is deployed to manage the edge devices, and the mobile edge computing (MEC) servers are deployed to realize the data collection, computing, and communication, which can greatly reduce the response latency of the massive IoT devices. Benefiting from the development of sensors, the future IoT devices will evolve to the agents with the abilities of sensing and computation. Moreover, the AI components can be deployed in MEC servers, cloud servers, and the control center to realize autonomous decision-making and autonomous configuration. The self-evolving network includes four stages, which are autonomous sensing, autonomous decision-making, autonomous configuration, and evaluation.

Autonomous Sensing
In the stage of autonomous sensing, benefiting from the agents, the network can sense the parameters in terms of network environment, the service, etc. The dynamic sensing model based on AI techniques (such as DRL) is designed to achieve the usercentric dynamic sense of the network data and the service requirements. Compared to the traditional sensing schemes, which collect the large data of the environment, network, user, and service at a fixed time and frequency, the AI-based sensing can adjust the sensing sets of the network parameters according to the feedback of the autonomous decision-making stage and the evaluation stage for different scenarios so as to avoid sensing unnecessary network parameters and further avoid the waste of resources and improve the level and efficiency of sensing intelligence.

Autonomous Decision-Making
In the stage of autonomous decision-making, considering the scalability and flexibility of massive IoT scenarios, some distributed learning models (such as multi-agent reinforcement learning (MARL)) should be designed to realize the dynamic decision-making. First, we should judge whether the network can fulfill the service requirements, decide the direction of the network evolution, and quantify the difference between the target network and the current network. On this basis, the policy set is acquired based on the real-time sensing and analysis of the different network environments and service requirements, and networks can intelligently choose the best policy based on the prediction of the future service requirements. The output of the decision-making is used to perform the network configuration and is also feedback to the network sensing.

Autonomous Configuration
In the stage of autonomous configuration, based on the output of the stage of autonomous decision-making, the network can autonomously configure the network architecture, parameters, resources, etc., to ensure and optimize the user experience.

Evaluation
In the stage of evaluation, the status of network operation and the QoS of the users are evaluated to update the network environment and the ML models, as shown by the blue dotted line in Figure 1. The networks will autonomously evolved with the constantly updating of the ML models updating.
From the above framework, it can be seen that the decision-making scheme is one of the important parts of achieving the self-evolution of the network, and we will focus on the decision-making scheme in this paper.

Preliminaries for Decision-Making
In this section, we briefly introduce the decision-making methods. The decisionmaking methods can be roughly divided into two categories. One is traditionally based on numerical optimization, such as the fuzzy decision-making method, the game-theorybased decision-making method, etc. The other is based on AI, which can be divided into the following three categories according to the different training and decision-making mechanisms: centralized training and centralized decision-making, centralized training and distributed decision-making, and distributed training and distributed decision-making.

Centralized Training and Centralized Decision-Making
In the centralized training and centralized decision-making, the central node is set to manage all the nodes. The central node analyzes the whole data and jointly optimizes the policies of all the nodes and then sends the policies to the managed nodes. Therefore, on one hand, the central node needs to collect the information of the other nodes, which may cause the great overhead and processing delay. On the other hand, with the increase in the managed nodes, the computational overhead will become very large and unbearable. In addition, when there are new managed nodes added, the ML model should be updated, which reveals that the centralized training and centralized decision-making scheme is non-scalable.

Centralized Training and Distributed Decision-Making
As for the centralized training and distributed decision-making, the central node also needs to collect the information of the other nodes and complete the training of the AI models; then, it sends the trained AI models to the other nodes. The other nodes can thus achieve the real-time decision-making with the trained AI models. Compared to the centralized training and centralized decision-making, the centralized training and distributed decision-making can reduce the processing delay once the ML models are trained, which is more suitable for the delay-sensitive massive IoT scenarios.

Distributed Training and Distributed Decision-Making
With regard to the distributed training and distributed decision-making, the central nodes are not needed anymore, and the distributed nodes independently implement the decision-making according to the local information. Therefore, the distributed nodes only need to exchange a small amount of information with other nodes, or even no information exchanging, which greatly reduced the processing delay. On the other hand, the distributed training and distributed decision-making scheme is scalable when there are new nodes added.
For clarity, we succinctly summarize the pros and cons of the centralized and distributed decision-making mechanisms in Table 4.

The Distributed Task Offloading Scheme for Massive IoT
In some of the massive IoT scenarios, since the number of the devices is huge and there are constantly new terminal advices added, the centralized training and centralized decision-making scheme will cause large overhead. We will concentrate on the distributed decision-making schemes in this section as well as on research in the specific cloud-edgedevice scenarios.

System Model
Considering the cloud-edge-device network framework in Figure 1, we pay attention to the task offloading and the resource allocation problem. We consider the network where one cloud server and N edge servers, denoted as N = {1, 2, · · · , N}, are deployed. K users are active and generate J service requests in the network, which can be respectively denoted as K = {1, 2, · · · , K} and J = {1, 2, · · · , J}. Assume that one user generates only one service at a certain moment, and let (D j (t), F j (t), T j (t)) denote the attributes of the service j at time t, where D j (t) is the data size, F j (t) is the size of computing task, and T j (t) indicates the maximum tolerable delay of the service j.

Transmission Delay
Define b j,m (t) ∈ {0, 1}, m = {0, 1, 2, · · · , N} to indicate the connection relationship between the services and the servers, where b j,m (t) = 1, m = 0 indicates that the service j is offloaded to the cloud server at time t, and b j,m (t) = 1, 1 ≤ m ≤ N) indicates that the service j is offloaded to the m th edge server at time t.
When the service is offloaded to the edge servers, that is b j,m (t) = 1, 1 ≤ m ≤ N), the channel gain is denoted as h j,m (t), and the transmitting power is p j,m (t). Assuming that the transmitting channels are orthogonal and the noise obeys the Gaussian distribution with N (0, σ 2 ), we can obtain the transmitting rate which can be denoted as follows: where W is the allocated bandwidth, which is assumed as the same. Therefore, the transmitting time can be denoted as follows: When the service is offloaded to the cloud server, that is b j,m (t) = 1, m = 0, we assume that the service is offloaded to the cloud server through the edge server, which can be denoted as where Tr is the needed time transmitting from the edge servers to the cloud servers.

Computation Delay
Assume that the f j,m (t) is the allocated computing resource of the j th service. The computation delay can be denoted as Since the edge servers have the limited computing resource, we can obtain the following constraint: where C m (m = 0, 1, 2 · · · , N) denotes the computation resource of the cloud server and the edge severs.

QoE Model
From [45], the QoE for the user generating the jth service can be modeled as where α and β are the non-negative weights.

Problem Formulation
In the massive IoT scenarios, users make their own decisions according the local information without considering the policies of the other users. Thus, for the jth user, we aim to maximize the QoE by choosing the different offloading schemes and adjusting the resource allocation, and the problem is formulated as follows: where C 1 indicates that the service j can be offloaded to only one server, C 2 indicates the constraint of the computing server of the servers, and C 3 indicates that the total delay should not exceed the maximum tolerable delay of the service j.

The Distributed DQN-Based Algorithm
The above problem is non-convex since b j,m are binary variables. We then transform the problem into the Markov decision process (MDP), which is denoted by (S, A, P, R).

State
The state set S describes the computation load and the remaining resource of the servers. At time t, the computation load of the server m can be denoted as L m (t) = ∑ j j=1 b j,m (t), the remaining resource is F m (t) = C m − ∑ j j=1 f j,m (t), and the state at time t can be denoted as

Action
According to the current policy, the agent chooses an action from the action set. The action of the user generating the jth service, denoted as a j (t) ∈ A j , can be described as a j (t) = {b j,0 (t), b j,1 (t), · · · , b j,N (t), f j,0 (t), f j,1 (t), · · · , f j,N (t)} (9)

State Transition Probability
The state is transformed to the next state by taking the action, and we can use the state transition probability to describe the process, which can be denoted as follows: where a = {a 1 , a 2 , · · · , a J } denotes the action set of all the users generating service.

Reward
The reward of the user generating the jth service can be described as QoE utility minus the action-selection cost, denoted as φ j (t), which can be denoted as (11) and the long term reward can be formulated as follows: where π j (t) is the policy of the user generating the jth service at time t, and π −j (t) denotes the policies of the other users. γ ∈ [0, 1] is the discount rate to determine the weight of the future reward. Notice that the above problem is the game process, and there exists a Nash equilibrium (NE) policy a * j for the user j, that is We first use the DQN to get the NE policy, and the target value can be denoted as In DQN, the neural network (NN) is used to approximate the Q-value function, and the loss function can be denoted as Then, to solve the problem of overestimation in typical DQN, we used DDQN [46] to solve the game problem. The target can be denoted as Finally, we further explore the utilization of D3QN [47] to solve the game problem. Different from DQN and DDQN, the output of D3QN network consists of theses two parts: state value V(s) and advantage value A(s, a), and the target Q value can be denoted as where V(s) = E[R j + γV(s t+1 )], A(s, a) = A(s, a) − 1 |A| ∑ A(s, a ), and 1 |A| denotes the advantages of the entire action space. The training stage is completed in the centralized node and is described in Algorithm 1.

Require:
Input the training steps T, and the episode number T Initialize the experience replay memory D Initialize the main network Q(s, a; θ) and the target networkQ(s, a, θ − ), and set θ − ← θ Initialize the state s 0 1: while episode ≤ T do 2: Observe the current state s t

3:
while step t ≤ T do 4: Select the action a. based on the -greedy policy 5: Obtain the current immediate reward R(t) 6: Renew the state s → s Calculate the target Q value based on Equation (17) 10: Calculate the loss based on Equation (15) and update θ by performing gradient descent to minimize the loss function 11: Every T 0 steps, update the target networkQ 12: end while 13: end while Output the trained DQN and the optimal action a.

Simulation Results
This part shows the simulation results. We consider a square area of length 500 m. The -greedy policy is used when choosing the action, and we set ∈ [0, 0.9]. The learning rate is set as the experience value 0.01, and the discount rate is set as γ = 0.99. The non-negative weights α and β in Equation (6) are set as the experience values which are respectively 0.5 and 0.0005. The action-selection cost φ j in Equation (11) is set as 0.001. The size of the experience replay memory is 500. Figure 2 gives the comparison of convergence performance among different algorithms. We can see that D3QN converges faster than other algorithms, which shows that D3QN has the lowest training complexity, and D3QN is more suitable for massive IoT scenarios.  Figure 3 gives the QoE utility of the different algorithms. The horizontal axis represents the number of users who are active and have the service requests at a certain moment. On the other hand, Q-learning is the the most basic reinforcement learning algorithm, and our D3QN-based offloading and resource allocation scheme is evolved from the DQN algorithm, so we simply use the Q-learning algorithm and the DQN algorithm as the comparison methods, the same as reference [36]. From Figure 3, we can see that our proposed D3QN-based scheme can achieve the better performance in terms of the QoE utility, especially when the number of users are relatively large. The main reasons for this result are analyzed as follows: on one hand, the value of the resource allocation is continuous, and Q-learning needs do some discretization processing when meeting non-discrete action, which may cause some bias in decision-making; on the other hand, because of the overestimation of the Q valuein DQN, the output policy of DQN may not be the optimal policy. From Figures 2 and 3, we can conclude that our proposed algorithm on the basis of 6G self-evolving networks can achieve better performance. Because of the limited computing power of our personal computer, the number of the users is set as much smaller than the the actual massive IoT scenarios, but we can obtain the same conclusion as above because the performance of the total QoE utility is not affected by the number of users. Moreover, in the actual execution of the proposed decision-making scheme, the centralized training is completed in the control center, and the terminal devices can realize the real-time decision-making once the training is complete.

Use Cases
The proposed edge-based self-evolution framework and the autonomous decisionmaking algorithm can be well applied in many massive IoT scenarios. In this section, we will illustrate the application in the super-smart vehicle as an example.
Super-smart vehicle is one of the typical application scenarios of massive IoT in the 6G era, which is the upgrade of the current autonomous vehicles. Compared to the self-driving cars in the 5G era, super-smart vehicles will be more intelligent, which is reflected in the following aspects: (1) more diverse means of transportation will be used; (2) it is expected to realize point-to-point smart travel. On the other hand, the devices are always moving and the network environment is constantly changing, so it is necessary to embed the AI functions in 6G networks to realize the whole network intelligence.
Since the devices in super-smart vehicles have the features of high mobility and flexibility and the requirements for super-smart vehicles are constantly changing, it is necessary for the network to offer a flexible framework to support super-smart vehicles. By deploying AI components on the terminal devices and in the network, the super-smart vehicle systems have the abilities to autonomously sense the network environment (the states of cars, roads, people, etc.); to collect and process the traffic information, vehicle information and environmental information; and to analyze and predict traffic conditions, so that they can autonomously and quickly make decisions and control the transportation, which are fully supported by the proposed 6G self-evolving network. According to [48], the current network intelligence level is at L2-L3, and in future work, we will continue to discuss the application of the proposed 6G self-evolving network in super-smart vehicles and research the key technologies to realize the network intelligence level at L3-L4, and we expect to give the corresponding simulation results to prove it.

Conclusions and Future Work
In order to realize the future intelligent wireless network to well satisfy the service requirements of the massive IoT application scenarios, it is necessary to embed the essential capabilities of AI into the wireless system. In this paper, we first propose the edge-computing-based self-evolution framework for 6G massive IoT, which is expected to realize autonomous sensing, autonomous decision-making, and autonomous configuration. Autonomous decision-making is also one of the crucial parts in the proposed 6G self-evolution framework, so we introduce the decision-making mechanism and analyze the pros and cons of the centralized decision-making schemes and the distributed decision-making schemes. Then, we consider the task offloading problem and propose the distributed D3QN-based algorithm. The simulation results show that the proposed D3QN-based scheme can converge faster than the DQN-and DDQN-based algorithm and can acquire the better performance in terms of QoE of users compared with Q-learning and DQN algorithms.
Finally, there is still much research to be done in the future. For example, this paper concentrated on centralized training and distributed decision-making. As the network and devices for massive IoT are becoming increasingly dense, the complexity of centralized training will get higher and higher. Therefore, the distributed training and distributed decision-making schemes are being urgently researched, which is also part of our future work.
Author Contributions: The work presented in this paper corresponds to a collaborative development by all authors. X.S. defined the research line, B.L. developed the proposed algorithm and wrote the paper, and J.L. used the software to simulate the algorithm and analyze the simulation results. All authors have read and agreed to the published version of the manuscript.