An Intelligent Clustering-Based Routing Protocol (CRP-GR) for 5G-Based Smart Healthcare Using Game Theory and Reinforcement Learning

With advantages such as short and long transmission ranges, D2D communication, low latency, and high node density, the 5G communication standard is a strong contender for smart healthcare. Smart healthcare networks based on 5G are expected to have heterogeneous energy and mobility, requiring them to adapt to the connected environment. As a result, in 5G-based smart healthcare, building a routing protocol that optimizes energy consumption, reduces transmission delay, and extends network lifetime remains a challenge. This paper presents a clustering-based routing protocol to improve the Quality of services (QoS) and energy optimization in 5G-based smart healthcare. QoS and energy optimization are achieved by selecting an energy-efﬁcient clustering head (CH) with the help of game theory (GT) and best multipath route selection with reinforcement learning (RL). The cluster head selection is modeled as a clustering game with a mixed strategy considering various attributes to ﬁnd equilibrium conditions. The parameters such as distance between nodes, the distance between nodes and base station, the remaining energy and speed of mobility of the nodes were used for cluster head (CH) selection probability. An energy-efﬁcient multipath routing based on reinforcement learning (RL) having (Q-learning) is proposed. The simulation result shows that our proposed clustering-based routing approach improves the QoS and energy optimization compared to existing approaches. The average performances of the proposed schemes CRP-GR and CRP-G are 78% and 71%, respectively, while the existing schemes, such as FBCFP, TEEN and LEACH have average performances of 63%, 48% and 35% accordingly.


Introduction
The Internet of Things (IoT) and 5G have increasingly been integrated into various facets of daily life, from smart cities to smart agriculture and from traditional to smart healthcare applications [1]. IoT and 5G-based systems enable the development of more accurate diagnostic tools, more effective treatment, and devices that improve quality of life. When IoT is used in a medical setting, it is referred to as IoMT (Internet of Medical Things). IoMT has changed the medical field by enabling remote healthcare concerning social benefits and discernment by diagnosing diseases and patients monitoring with resourceefficient methods [2,3]. By utilizing pervasive computing methods based on the IoT, it is possible to monitor and control various things of importance in the medical domain, including medical devices, physician instruction, medication, drugs, and individuals. By integrating IoT and machine learning into remote healthcare monitoring, additional efficient medical-care methods can be discovered [4][5][6].
Advanced healthcare empowers telemedicine, telehealth, telesurgery and telerehabilitation, which permit remote monitoring and intensive care of subjects at hospitals/home [7][8][9][10]. The modern healthcare industry requires developing a network that integrates the human body and medical devices to form a body sensor network. Furthermore, the medical data are exchanged with the help of the IoMT framework to the medical cloud [11]. Figure 1 shows the generic architecture of smart healthcare based on 5G and IoMT. There are three major components in smart healthcare based on 5G and IoT: (a) cloud data center, (b) gateways (Base-stations) and (c) body sensor networks. IoT and 5G will play an important role in providing healthcare services to distant individuals (e.g., patients, physicians, and insurance companies). The information generated by the medical devices related to the human body can be provided to relatives and medical staff to check up on the patient anywhere, at any time. Furthermore, in 5G and IoT-based smart healthcare, gateways are utilized as central hubs between medical devices (i.e., sensors nodes) and a cloud data center. The gateway acts as a hub to collect data and perform computation in a network for health monitoring. In addition, the gateway connects the nodes present in the network to clinic sites. These features can be successfully utilized by equipping gateways with networking, processing, and appropriate intelligence to develop smart gateways for remote healthcare monitoring.
In 5G and IoT-based smart healthcare applications, the wearable/ambient nodes are constrained in resources, including battery, processing power, and memory. Therefore, designing a framework with energy-efficient communication for medical devices to improve the network lifetime is necessary. Clustering-based routing may be a viable technique that can perform energy-efficient communication in WSN by managing various devices in the network properly [12]. It plays a crucial part to decrease the requirements of numerous nodes, which takes part in the transmission [13]. The medical nodes are organized in groups with the help of a clustering mechanism. Each group must have at least one central coordinator node designated as the Cluster Head (CH), and the remaining nodes are designated as cluster nodes (CN). All cluster nodes in the group transmit the medical information through CH. If the cluster head (CH) selection is not optimal, addi-tional communication may occur, resulting in increased energy consumption in a smart healthcare network.

Our Contribution
Several researchers have presented a clustering-based routing scheme. However, all the existing approaches have different issues when considering 5G-based smart healthcare applications. The optimal cluster head selection and best routing path for data transmission are challenging tasks due to the high density and high heterogeneous network with different parameters such as different energy levels, mobility speeds, and low transmission ranges. The main contributions of this research are: 1.
Clustering-based routing protocol using game theory and reinforcement learning to reduce energy consumption and increase network lifetime for smart healthcare scenarios.

2.
An algorithm to select the best optimal cluster head (CH) from the available cluster heads (CHs) to avoid the situation of more cluster head (CH) occurrence.

3.
Reinforcement learning-based route selection algorithm for data transmission.

4.
Comparison of existing approaches with the proposed method.
The rest of the paper is organized as follows. In Section 2 the brief overview of existing protocols is presented. In Section 3 network model, energy model, clustering game and Theoretical analysis are presented. In Section 4 the detail of our clustering algorithm is given. In Section 5 the detail of our routing algorithm is presented. In Section 6 the time complexity of our proposed algorithms is presented. In Section 7 results analysis are discussed. Finally, the conclusion is presented in Section 8.

Literature Review
In recent years, IoT has been used for a wide range of applications, including smart healthcare, smart homes, and many others. Due to the limited capabilities of sensor nodes, the utilization of available resources efficiently is the main challenge for IoT-based systems. Therefore, several researchers are motivated to design energy-efficient approaches for IoT-based WSN. In [14], a joint reliable and energy-efficient technique is presented, where a game-theoretical approach is used to provide secure communication in a wireless sensor network. The overhead associated with the trust-based technique is mitigated by employing a game-theoretical approach. The results show that the presented technique is suitable for IoT-based applications in terms of security and energy efficiency. In [15], an under-water channel model is presented. The proposed method does not consider channelaware energy saving (i.e., duty cycle) for IoT-based smart healthcare. In [16], a joint product life-cycle and IoT management algorithms are presented to save power and extend battery life. The proposed approach is based on duty-cycle management and power transmission control to optimize battery charge consumption. The researchers did not consider energy parameters, wireless channels, and clustering techniques at physical and network layers. In [17], a delay balancing approach is presented for transmission of data and energy utilization for wireless sensor networks. Besides, a work-load management policy is also considered for the network. In [18], a sensing layer-based technique is presented to analyze the power dissipation of numerous nodes in the network. Additionally, a novel architecture is presented based on base-station, sensing, and control layers. The study did not entertain duty-cycle and channel-aware energy saving. In [19], a novel framework is presented for IoT-based applications. The study focuses mainly on three layers, e.g., sensing, information processing, control, and presentation. However, they did not consider energy-saving strategies based on clustering and the use of numerous energy-optimizing parameters across the network.
In [20], a Low Energy Adaptive Clustering Hierarchy (LEACH) is proposed, known as a node selection and data transmission protocol of wireless sensor networks. In this protocol, a random technique is used to save CH energy in the network. CH selection is based on standard rules that how many times a node can be a CH. In [21], a k-mean clustering approach is presented for wireless sensor networks. In this approach, the main issue is discovering the centroid vector, which leads to nodes group separation, guaranteeing the disconnection of connectivity. However, the benefits of these approaches give an overhead of cluster designing and selection of CH [22][23][24]. The overhead issues can be addressed with the help of medium access control (MAC). MAC introduces the concept of sleeping nodes when they have no data for transmission. MAC is defined by two categories: contention-free and contention-oriented [25]. These techniques initiate a part of collision when trying to use the concurrent channel. Additionally, packet loss is also increased when the network and sensor density are increased. Subsequently, the protocol that has conflict in route selection is not suitable for a dense network. Therefore, timedivision multiple access (TDMA) is proposed to solve this issue [26,27]. TDMA uses a time slot schedule that assigns an individual slot to each cluster member, resulting in increased network energy efficiency. Although the clustering algorithms described in [28,29] are based on TDM, they do not take into account the potential of data failure in the network. In addition, deploying clustering involves a planning overhead, which has an impact on network resources.
In [30], an interference aware self-optimizing (IASO) is proposed, which minimizes the interference of the network. The technique has the ability of multichannel sensing and the capability of control gain. In [31], a greedy model small world (GMSW) is proposed for applications based on IoT to improve the robustness of the topological structure.
To our knowledge, no one has combined QoS and energy consumption challenges in a clustering-based routing model for 5G and IoT-based smart healthcare. Because of the variety of the network and the limited energy available to nodes, balancing QoS and energy consumption is a difficult challenge. Table 1 shows the summary of different clustering-based routing schemes.

Network Model
The proposed network model consists of base stations, sensor nodes and cluster head (CH). The end-user is connected to the base station through the internet. The sensor nodes in the network gather all the information needed and forward it to the base station through the cluster head (CH). Figure 2 shows the framework of clustering. The considered system model makes the following assumptions: • All the sensors nodes in the network are mobile after the deployment. • All three different categories of nodes have different energy levels. We consider picocells as advanced cells, femtocells as intermediate nodes, and other nodes as normal. • All sensor nodes in the network have different mobility speeds and energy levels. • The battery in each node in the network has a different initial energy level, and it is not rechargeable or replaceable. • All sensor nodes in the network have their unique ID.

Energy Model
Our proposed algorithm follows first order radio model [32][33][34][35] to handle energy dissipation as shown in the Figure 3. The distance between receiving and transmitting "l" bits of data is presented in the Equations (1)-(3) respectively. E Ra elec and E Ta elec are the consumed energy via receiver and transmitter circuits per bit. Similarly, ε f s and ε mp are multipath and free space radio modes amplification factor. Also, threshold distance d 0 can be obtained as: Next, for the data bits "l", the aggregate energy is E A is calculated as where "N" is the nodes in the cluster, "l" represent bits number and E 1 represent the aggregate energy of one bit. The total energy consumed by a node E c is calculated as follows: where (E i,CH ) and (E CH,sink ) represent the required transmitting energy from "i" members of the cluster to CH and from CH to sink (Base-station). E A is the aggregate energy of a data, and E m is mobility energy.
The distance between two nodes in the network is given by where x i and y i are node "i" coordinates, while x i+1 and y i+1 are sink or neighbour node coordinates. The parameters considered for CH election are mobility speeds of nodes, the distance between nodes and the base station, and the nodes' remaining energy.

Cluster Game Modelling
The clustering game (CG) is a non-cooperative game that is used is used by the nodes to select the cluster head in the network. Every cluster in the network has a cluster head as a result of the clustering game. The cluster head collects data from the available nodes and sends it to the base station.
In the scenario where the nodes differ in any attributes or have found any difference among each other, then synchronization cannot be achieved. This leads to a cluster size of one, where each node in the network declares itself as cluster head (CH). In such an equilibrium, the expected payoff of each node of being cluster head (CH) is equal to the expected payoff of not being cluster head (CH) [36].
The non-cooperative clustering game (CG), is defined as CG = {N, A, U}, where "N" number of nodes, "A" is action set and "U" is a utility function. The proposed game for cluster head selection can be modeled with a mixed strategy game with the following elements: • Player: N number of nodes • Action: Cluster head (CH) and non-cluster head (NCH) are sets of actions for each player. • Utility: Utilities of each player are denoted by the expected payoff function value, which is "0" which means no node declares as cluster head (CH).
Regarding payoffs, if none of the players (i.e., node) in the network declares itself as cluster head CH, then the payoff is zero and the player will be unable to send the data to the base station. If at least one player declares itself as cluster head CH, then the player's payoff is z (i.e., the successful delivery of data to base station). Finally, if the player declares themselves as the cluster head CH, then the payoff z is reduced by subtracting the cost c, being cluster head CH from the payoff z (i.e., z − c).
For the analysis of possible equilibria for two nodes case, the expected payoffs of these two nodes (2 × 2) are presented in Table 2. The payoff shows that the game is symmetrical and the payoff only depends on the node's strategies. The strategies (z − c j , z − c j ) (i.e., both the nodes declares itself as cluster head (CH)) is not a Nash equilibrium. This is because the node can get a better payoff if they change their strategies to Non-cluster head NCH (i.e., z > z − c). Similarly, the strategies (0, 0) do not follow a Nash equilibrium either because any node will prefer to deviate and declare itself as cluster head CH, leading to a positive payoff. The remaining conditions such as (z − c j , z) and (z, z − c j ) are Nash equilibria (i.e., one of the nodes declares itself as cluster head CH and another declares itself as non-cluster head NCH such that none of the nodes have any incentive to change its strategy). The utility of function of this game is an optimal selection of energy-efficient cluster head (CH), which is given as U CG for node "i" as follows:

Expected Payoff
To reach the equilibrium, the nodes play mixed strategies. This means any node has permission to declare itself as cluster head (CH) with probability p and non-cluster head (NCH) with probability p = 1 − q. Theorem 1. The mixed strategies Nash equilibrium exists for the symmetrical clustering game, and the probability p for the player as cluster head (CH) in the equilibrium is given as Proof. To find out the Nash equilibrium in mixed strategies, which corresponds to the probability p of a node as cluster head (CH) by using the methodology as presented in the [36]. First of all, we need to find out the expected payoff for every available choice. When the node act as a cluster head (CH), the expected payoff is U CH = z − c, which is independent of other node strategies. When the node is playing non-cluster head (NCH) then the expected payoff is The payoffs are equal in the equilibrium. Therefore, no player has the incentive to change its strategy. Thus, We can compute the probability p by solving the above equation, which corresponds to the equilibrium.
Let us denote ω = c z < 1. Figure 4 shows the probability values as the nodes increases for different ω parameter values (i.e., 0.05, 0.1, 0.3, 0.5, 0.7 and 0.9). As the number of players increases, the probability "p" increases. When the attributes of the nodes (i.e., mobility speed, energy level and distance) are similar in a network, then the equilibrium condition is given by the probability as: The probability of being cluster head (CH) is always 1 if N = 1, i.e., that at least one node must play as cluster head (CH).
The average payoff of an arbitrary node "i" is specified as For the equilibrium strategy, the average payoff P NE is given as To understand the scenario with nodes consisting of different attributes (i.e., mobility speed, energy level and distance), consider two players' games as shown in Table 2. The probability of a node being a cluster head (CH) for two nodes (n 1 , n 2 ) with mixed strategy equilibrium is calculated as follows: let us suppose P i is the probability of every node in the network with the cost c i , for i = 1, 2, 3, 4, . . . n. The probability of n 1 node being a CH with cost c 1 is derived as follows on the basis of Equation (11) : where c 1 and c 2 are the costs to be a cluster head (CH). The probability for "N" nodes is specified as From the above equations, it can be seen that if nodes with differing attributes are found in the network, then every optimal probability depends on the costs of the others, i.e., being a cluster head (CH) costs more.
Being a cluster head (CH) in Nash equilibrium always depends on the neighbor node "j" cost. If "N" increases, probability becomes less, which means one node at least declares itself as cluster head (CH). Likewise, when "N" tends to 1, then "p" tends to 1, and then it constantly declares itself as a cluster head (CH). lim N→1 p n = 1.
The cost of being a cluster head (CH) is specified as: E int (i) is the initial energy of node "i" and E c is the energy consumed by node "i" in the data transmission to base station. "D" is distance between nodes and M s is the mobility speed of node.
From the analysis of the clustering game, it is found that if the benefit and cost are not dependent on "N" value. As the probability decreases, at least one node must be declared as cluster head (CH) in equilibrium. Therefore, our proposed algorithm claims one cluster head (CH) per cluster, avoiding additional competition for cluster head (CH) selection.

Clustering Algorithm
In this section, we introduce a clustering algorithm based on game theory for 5G-based smart healthcare.

Initialization
Our proposed protocol consists of two phases, i.e., the setup phase and the steadystate phase. Cluster head (CH) selection and cluster formation are performed in the setup phase, while data transmission is performed in the steady-state phase. Initially, the number of members and an optimal number of clusters is obtained for the given "N" value. Then, each node in the network broadcasts a message to the neighbors for the nomination of the cluster head (CH). Finally, all the information is collected by the base station and saved.

Setup Phase
Probability P k of node "i" in the "k" cluster is calculated as in the Equation (14). The probability of the node is compared with other cluster nodes, and the node with the highest probability is selected as cluster head (CH). Once the election is complete, the cluster head (CH) broadcasts the "CH message" in the network along with node ID and "Join-Request" set to "0". In the response, nodes send back the "Join-Response" field set to "1" to cluster head (CH) along with the information <node position, remaining energy, speed, Node ID> and declare itself as cluster member of that cluster. Although the nodes may join the nearby cluster as a cluster member and withdraw the nomination when they receive the "CH message". Hence, a cluster in the network is formed.

Steady-State Phase
When a node identifies an event in the network, the node transmits data to the cluster head (CH), and the cluster head (CH) transmits the data to the base station. In our proposed algorithm, reinforcement learning helps the cluster head (CH) and nodes to find the energy-efficient route for data transmission to the base station. It prevents the early death of cluster heads (CHs) and reduces the traffic in the network. Due to nodes, mobility in the network would change the topology, leading to the deletion and addition of member nodes, i.e., to hand over the control of the mobile node to another cluster head (CH) immediately. Whenever the node member goes out from the cluster region, the sink (base-station) estimates the new position of the node and is assigned as a member node to a new cluster. The proposed algorithm pseudo-code is given in Algorithm 1.

Routing Algorithm
In this section, we introduce a reinforcement learning-based routing algorithm for 5G-based smart healthcare. Figure 5 shows the data processing in the network.

Q-Learning
Q-learning is a well-known reinforcement learning technique that selects the optimal action based on the current state and receives the delayed reward with the highest value without using a specific environment model [37]. Based on inputs, three primary factors are measured within the Q-learning. s: State: i.e., energy level and node position. v: Action: i.e., accessibility of the available next hope node. R: Reward: i.e., successful data transmission, which is calculated by function reward. Whenever an agent performs an action a, it is awarded with a reward, R, instantly. Hence, we can utilize a set of factors to illustrate the working procedure: (s 0 ; v 0 ; R 1 ; s 1 ; a 1 ; R 2 ; s 2 ; . . . . . .), which shows that agent get a reward R 1 and changing from state s 0 to state s 1 with the action v 0 ; when an agent reached to state s 2 with the action v 2 , it get the reward R 2 and so on. Furthermore, Q(s t ; v t ) is called Q-value which is the actual value for the state-action pairs. To show the relationship between Q-values and reward, the following formula can be use.
Additionally, Q-values are updated by the core algorithm by calculating old and new information with by using Equation (18), given below.
In Equation (18), α is the learning rate which is (0 ≤ α ≤ 1). (s t+1 ) is available next hopping nodes, γ is a discount factor which is (0 ≤ γ ≤ 1), maxQ t (s t+1 , v t ) shows the most extreme approximation reward possible. Whenever the value is set to 0, new information is irrelevant, and previous information is only considered. While, when α is closed to 1, only the new information is considered and past information is discarded. Furthermore, the γ factor is vital for upcoming rewards. Whenever the value is set to 0, then the agent considers short-term rewards. When the value is set to 1, then the agent is interested in long-term rewards.
where D t defines the shortest best path, E t defines the highest energy best path, E pr shows the energy remaining in the N t node and d t is the distance between nodes.

Our Proposed Routing Algorithm
Different routing algorithms have been proposed in the literature to achieve Quality of Services (QoS), energy consumption and link heterogeneity in WSN. Some of the famous algorithms are E-TORA (energy-aware TORA), EBCRP (energy balanced chain-cluster routing protocol), HGMR (hierarchical geographic multicast routing), EADAT (energyaware data aggregation protocol) and many more, as mentioned above in Table 1. Due to different attributes, such as dynamic topology, distributed nature, high density and resource constraints of the 5G-based smart healthcare network, the above-mentioned cluster-based routing protocols are not suitable for this network. The earlier algorithms do not consider the delay in transmission, energy consumption, and unbalanced energy dissipation. To address these issues, we use a Q-learning-based algorithm. The network is divided into three grids (i.e., pico, femto and macro). In the learning phase, the grids periodically exchange information (i.e., the distance between nodes and the nodes' remaining energy). The value of each node state and topological relation of nodes with each other are stored in a Q-table. The agent makes a decision based on the Q-table for the next-hop selection. In this paper, we classified them based on Energy hop (E hop ) and Distance hop (D hop ), which improve the Quality of Services (QoS) and reduce energy consumption by maintaining energy balance between nodes. Furthermore, the values can vary considerably between the various range. Therefore, it is critical to scale them in a specific range. For the Y feature, we can scale it between range [0,1] as follows: In Q-learning based routing protocol, Q t and R t+1 at the hop can be calculated with Equations (23) and (22) which are associated with E hop and D hop . At this point, when the agent changes the state from s t to s t+1 , which means data are transmitted from N t to N t+1 .
Additionally, β(0 ≤ β ≤ 1) is an administrative factor that imitates the significance of delay and energy consumption. When the β factor is set to value 1, the algorithm emphasizes decreasing transmission delay. While, when the β value is set to 0, the algorithm tries to balance the remaining energy. Otherwise, both transmission delay and remaining energy are considered instantaneously. In this way, we can alter the value of β according to QoS requirements.
By combining Equations (18), (22) and (23), the algorithm for routing is updated as follows: (24) The sink is the end node of the network, which is the information collection unit within the network. Every sink has the next hopping node, known as BS. Thus, maxQ t (s t+1 , v t ) = 0. In this situation, the formula is updated as follows: Algorithm 2 presents a detailed explanation of the proposed routing algorithm. Each node keeps information of the network and a Q-value. Before information gathering, the base station allows tasks with broadcasting. All nodes in the network update their Q-value according to Equation (24). Based on the updated table of Q-values, the path with the highest reward in Algorithm 3 is selected. The node determines the next available hopping node that has the highest Q-value from the available subsequent hopping nodes if there are subsequent hop nodes available for a node, implying that the node is in the vicinity of the base station. Otherwise, the node will select a node with the highest energy as the next hopping node from all nodes available in the reachable range. Figure 6 shows a Q-value fluctuation with several rounds. The Q-value fluctuates in the initial couple of rounds, but it converges after a certain number of rounds. Figure 7 shows that link disconnection is lower at learning rate of α = 1.0. It is due to higher network density because more nodes in the network increase the number of hops.

Time Complexity of Proposed Algorithms
In this section, an analysis of the time complexity of the proposed algorithms is presented.

Time Complexity Clustering Algorithm.
The symbols used for calculation is given below: N: Number of nodes. n 1 , n 2 : Nodes among N. U i : Node utility function. p(n 1 ): n 1 node probability to become a cluster head (CH). Count: Total number of times as to be a cluster head (CH). C k : Number of optimal clusters. The time complexity can be calculated step-by-step and then combined to show the algorithm's overall complexity. Solution: • Use the if-else statement from lines 18-23 to check count = C k N and count ← N C k , taking a constant time due independent of data length. We denote this constant with C 1 . • p ← (c n , z) taking a constant time and denoting with C 2 . It executes once by receiving a pre-calculated value. • The statement if-else from lines 28-43 repeats for N number of nodes. This means that the probability p(n 1 ) will repeat n times and probability p(n 2 ) will also repeat n times.
The time taken will be n · n times. • Count = count + 1 will take n times to execute. • The if-else statement from lines 46-53 takes a constant time to execute. We denote it as C 4 and C 5 .
From the above time complexity at each step, we can then calculate By dropping constants and n, we are left with n 2 . The conclusion shows that the time complexity of the clustering algorithm is O(n 2 )

Time Complexity of the Q-Learning Algorithm
The symbols used for calculation are given below: Zeros-matrix −→ Q-table Distance between nodes −→ d t Maximum communication distance −→ R com Energy remaining of nodes −→ E pr The time complexity of an algorithm can be determined incrementally and then added together to determine the algorithm's overall complexity. Solution: • The for-loop statement from lines 10-13 takes n seconds to compute the next hopping node, and lines 14-20 take n seconds to check for available next hopping nodes. Therefore, it will take n · n times due to the inner loop. • For calculating the reward and Q- Changing state s t ← s t+1 takes a constant time. From the above discussion, the time complexity can be calculated as: n + n · n + n + C 1 .
By dropping constants and n times, we have n 2 . The conclusion shows that the time complexity of the QL-Algorithm is O(n 2 ).

Time Complexity of the Best Path Selection Algorithm
The symbols used for calculation are given below: Zeros-matrix −→ Q-table Energy remaining of nodes −→ E pr Distance between nodes −→ d t The time complexity can be calculated step-by-step and then combined to show the algorithm's overall complexity. Solution: • The for-loop from lines 7-9 will take n time. • The while-loop from lines 10-16 will take n time. • DThe inner loop form lines 7-21 is takes n · n times. n + n + n · n By dropping n times, we have n 2 left. The conclusion shows that the time complexity of the best path selection algorithm is O(n 2 ).

Results and Discussion
We used MATLAB to test the performance of the proposed algorithm and obtain simulation results.

Evaluation Metrics
We evaluated the proposed algorithm in terms of throughput, residual energy, packet delivery ratio, end-to-end delay and network lifetime, concerning different mobility speeds of the network and compare it with the existing protocol. We presented two schemes: CRP-GR (i.e., based on game theory and reinforcement learning) and CRP-G (i.e., based on game theory). The results were generated at a learning rate α = 0.9 and regulatory factor β = 0.9. Table 3 shows the simulation parameters. The throughput is the sum of the Pkt (received) by the base station in the T (period) , where the packet sizes Pkt (size) are in bits. Equation (26) shows the numerical form of the throughput.

Residual Energy
Residual energy is the average energy consumed, ENG (cons) , at each round by a node divided by the available total energy ENG (available) . Equation (27) shows a numerical form of the residual energy.

Packet Delivery Ratio
The packet delivery ratio is the proportion of received data packets, Pkt (received) , by a base station to the sum of sent data packets Pkt (send) . Equation (28) shows a numerical form of the packet delivery ratio.

Average End-to-End Delay
End-to-end delay can be defined as the average delay between the receipt and packet sources at corresponding delivery. Equation (29) shows the average end-to-end delay of data packets per round.

Network Lifetime
Network lifetime is the number of alive nodes in the network during simulation time or after a specified scenario comes to an end.

Results
The proposed schemes are evaluated and compared with FBCFP, TEEN and LEACH in terms of throughput, packet delivery ratio, residual energy, average end-to-end delay and network lifetime at different mobility speeds. The TEEN and LEACH protocols are selected for comparison because these are the standard routing protocols and many more protocols are created based on these protocols, such as FBCFP [38].

Throughput
Every cluster group has a cluster head in the proposed scheme based on various factors (i.e., node mobility, the distance between node and base station and residual energy). Figure 8a,b shows throughput versus the number of rounds with different mobility speeds of nodes. It can be observed from the graphs that the throughput decrease with the increasing mobility speed of the nodes. It is due to the nodes dropping out of the cluster, which leads to frequent link disconnection. Furthermore, our proposed schemes CRP-GR and CRP-G provide better throughput as compared to FBCFP, TEEN and LEACH. The throughput of our proposed scheme is better due to optimum cluster head selection and best next hopping node selection with reinforcement learning. Similarly, decreasing packet loss is also improving throughput. The average performance of the proposed schemes CRP-GR and CRP-G with respect to throughput is 78% and 71%, while the existing schemes, such as FBCFP, TEEN and LEACH are 63%, 48% and 35% accordingly.

Packet Delivery Ratio
Our proposed schemes CRP-GR and CRP-G have the best delivery packet delivery ratio compared to FBCFP, TEEN and LEACH. Figure 9a,b show packet delivery ratio versus the number of rounds with different mobility speeds of the nodes. The packet delivery ratio decreases due to the available fixed bandwidth of the network by increasing the number of nodes. The FBCFP, TEEN and LEACH delivery ratio is less due to selecting repeated routes for data transmission from source to destination. Our proposed schemes perform better than other schemes due to selecting the optimal cluster head (CH) for every cluster and better node distribution, making fluent data transmission between cluster head (CH) and base station. The average performances of the proposed schemes, CRP-GR and CRP-G, with respect to the packet delivery ratio were 67% and 49%. The existing schemes, such as FBCFP, TEEN and LEACH, had average performances of 42%, 25% and 19% accordingly.

Residual Energy
Our proposed scheme avoids the random selection of cluster heads (CHs) and uneven distribution in the network. Randomly selecting and distributing cluster heads (CHs) in a distributed and centralized network can lead to long-distance transmission and more energy consumption. Therefore, our schemes decrease energy consumption compared to FBCFP, TEEN and LEACH, as shown in Figure 10a,b with different mobility speeds of the nodes. In our proposed schemes, the cluster head (CH) was based on game theory by considering various factors (i.e., minimum distance, mobility and energy of nodes). Due to these, energy dissipation is reduced during communication, and nodes in the network consume less energy. The FBCFP, TEEN and LEACH spent more of their energy on cluster heads (CHs) selection at each round, data transmission and monitoring its cluster members. The average performances of the proposed schemes, CRP-GR and CRP-G, with respect to residual energy were 69% and 52%. The existing schemes, such as FBCFP, TEEN and LEACH had average performances of 37%, 24% and 19%, accordingly.

Average End-to-End Delay
Our proposed scheme has the best CHs selection, which does not need a communication overhead and decreases transmission of redundant information to cluster heads (CHs). The average end-to-end delay of our proposed schemes is minimal compared to FBCFP, TEEN and LEACH as shown in Figure 11a,b with different mobility speeds of nodes. It is because of the early identification of the next hopping node and the remaining energy of nodes for better route selection. While, in the other schemes, the cluster head (CH) selects the same route for data transmission in each round, leading to more congestion and delay. The proposed schemes CRP-GR and CRP-G decrease the average end-to-end delay by 9% and 5%, respectivley, as compared to FBCFP, 18% and 11% as compared to TEEN, and 25% and 17% as compared to LEACH.
(a) (b) Figure 11. End-to-End delay vs number of rounds.

Network Lifetime
The early identification and consideration of various factors (i.e., remaining energy, distance, next hopping node, and node mobility) in our proposed schemes result in the optimal cluster head selection for each cluster and data transmission, thereby increasing the network lifetime. It reduces clustering overhead, supporting the network nodes to stay for more time and stabilize the network. Our schemes remove extra duties from cluster heads, improve energy efficiency, and decrease the number of death nodes. Our proposed scheme network lifetime is better compared to FBFCP, TEEN and LEACH, as shown in Figure 12a,b with different mobility speeds. The average performances of the proposed schemes CRP-GR and CRP-G, with respect to network lifetime, were 84% and 71%. The existing schemes, such as FBCFP, TEEN and LEACH had average performances of 58%, 45% and 39%, accordingly.

Discussion
Previous works presented different techniques for cluster head (CH) selection and routing paths for transmission. Different protocols, such as FBFCP, TEEN and LEACH, considered various factors (e.g., minimum distance, available connections between nodes, and residual energy) to prolong the lifetime of the network. However, due to more processing and disconnection of links, these protocols consume more energy, decreasing the network lifetime and packet delivery ratio. In our schemes, early identification of the next hopping node with minimum distance, high energy level with reinforcement learning, and optimum cluster head (CH) selection with game theory improves network lifetime, packet delivery ratio, and decreased energy consumption and end-to-end delay.
In our proposed schemes, each cluster head (CH) selection is based on a distance between the node and base station, node mobility, link connection with other nodes, and residual energy. In this scheme, the optimum cluster head (CH) selection with game theory and best route path selection with reinforcement learning (RL) were used to prevent overhead in communication and allow smooth transmission of data between clusters heads (CHs) and the base station. In terms of throughput, residual energy, packet delivery ratio, end-to-end delay, and network lifetime, our scheme outperforms the FBFCP, TEEN, and LEACH schemes. This is because our scheme selects the optimal cluster head (CH) and optimal route path for data transmission from nodes to cluster head (CH) and from the cluster head to the base station. On the other hand, the random cluster head (CH) and route path selection need more calculation and a long transmission range, leading to energy wastage, which is prevented in our proposed schemes. Therefore, the proposed scheme minimizes the energy consumption of the nodes in the network. In this manner, the energy dissipation reduces and improves the success of the energy-saving of the nodes.
(a) (b) Figure 12. End-to-End delay vs number of rounds.

Conclusions
In heterogeneous 5G-based smart healthcare, the clustering-based routing protocol plays a vital role to transmit data from the source to the base station without delay. This paper proposed a clustering-based routing protocol based on game theory and reinforcement learning (i.e., Q-learning) for the heterogeneous 5G-based smart healthcare network. The cluster head (CH) selection probability is calculated using different attributes, such as the distance between nodes, nodes and the base station, mobility speed and remaining energy, using the symmetric game with mixed strategies. The multipath routing based on the Q-learning defined energy-efficient paths and distance with the help of derived iterative formula for the Q-table. The simulation result shows that our proposed clustering-based routing protocol improves QoS and energy optimization of the network compared to the existing schemes, i.e., FBCFP, TEEN and LEACH. Furthermore, network performance can be adjusted using the learning rate α and regulatory factor β. Choosing an appropriate value of α and β improves network lifetime and reduces the end-to-end delay for realistic demands. In the future, we will extend our research to investigate how to expand the network to adopt medical healthcare sensors to construct a flexible network to deal with actual medical data for emergencies. Institutional Review Board Statement: The study was conducted according to the guidelines of Sunway University, Malaysia.