A Q-Learning and Fuzzy Logic-Based Hierarchical Routing Scheme in the Intelligent Transportation System for Smart Cities

: A vehicular ad hoc network (VANET) is the major element of the intelligent transportation system (ITS). The purpose of ITS is to increase road safety and manage the movement of vehicles. ITS is known as one of the main components of smart cities. As a result, there are critical challenges such as routing in these networks. Recently, many scholars have worked on this challenge in VANET. They have used machine learning techniques to learn the routing proceeding in the networks adaptively and independently. In this paper, a Q-learning and fuzzy logic-based hierarchical routing protocol (QFHR) is proposed for VANETs. This hierarchical routing technique consists of three main phases: identifying trafﬁc conditions, routing algorithm at the intersection level, and routing algorithm at the road level. In the ﬁrst phase, each roadside unit (RSU) stores a trafﬁc table, which includes information about the trafﬁc conditions related to four road sections connected to the corresponding intersection. Then, RSUs use a Q-learning-based routing method to discover the best path between different intersections. Finally, vehicles in each road section use a fuzzy logic-based routing technique to choose the foremost relay node. The simulation of QFHR has been executed on the network simulator version 2 (NS2), and its results have been presented in comparison with IRQ, IV2XQ, QGrid, and GPSR in two scenarios. The ﬁrst scenario analyzes the result based on the packet sending rate (PSR). In this scenario, QFHR gets better the packet delivery rate by 2.74%, 6.67%, 22.35%, and 29.98% and decreases delay by 16.19%, 22.82%, 34.15%, and 59.51%, and lowers the number of hops by 6.74%, 20.09%, 2.68%, and 12.22% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively. However, it increases the overhead by approximately 9.36% and 11.34% compared to IRQ and IV2XQ, respectively. Moreover, the second scenario evaluates the results with regard to the signal transmission radius (STR). In this scenario, QFHR increases PDR by 3.45%, 8%, 23.29%, and 26.17% and decreases delay by 19.86%, 34.26%, 44.09%, and 68.39% and reduces the number of hops by 14.13%, 32.58%, 7.71%, and 21.39% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively. However, it has higher overhead than IRQ (11.26%) and IV2XQ (25%).


Introduction
There is a great demand for vehicular ad hoc networks in which vehicles can communicate with each other without any infrastructure. The purpose of these networks is to reduce delays in traffic flow and improve driving quality. These networks have attracted the attention of universities and industries because they can improve road safety and provide many services to drivers and travelers. VANET is the core of an intelligent transportation system In this paper, a Q-learning and fuzzy logic-based hierarchical routing algorithm (QFHR) is proposed in VANETs. In the method, the focus is on delay reduction in the routing process. QFHR utilizes a hierarchical routing technique. In this scheme, RSUs use a Q-learning-based routing operation to discover paths between different intersections in the urban environment. In the existing Q-learning-based routing algorithms, researchers often consider vehicles as the state space in Q-learning. Therefore, when increasing the density of vehicles, the state space is dramatically increased in the routing algorithm. This affects the convergence speed of this algorithm negatively. Therefore, QFHR considers intersections as the state set to manage the convergence speed of the routing algorithm. On the other hand, vehicles in each road segment use a fuzzy logic-based greedy routing technique to choose the most suitable next-hop node. As a result, the main contributions of our work are as follows: • In QFHR, a traffic detection algorithm is presented for identifying the traffic status of four road sections connected to each intersection. The algorithm provides new traffic information for the Q-learning-based routing process and inform RSUs of the traffic status in the network at any moment. • In QFHR, a Q-learning-based routing scheme called the intersection-to-intersection (I2I) routing algorithm is designed in accordance with a distributed strategy to obtain the best route between different intersections using traffic information. Moreover, The I2I routing algorithm manages network congestion and can quickly discover and replace congested paths. • In QFHR, a greedy routing technique is designed by vehicles to find the best route in each road section. This algorithm addresses the local optimum issue using a fuzzy path recovery algorithm.
In the following, the paper is organized as follows: Section 2 demonstrates the related works. Section 3 explains reinforcement learning algorithms, especially Q-learning and fuzzy logic because the proposed method utilizes these techniques for designing the routing process. Section 4 introduces the network system applied in QFHR. Section 5 describes QFHR in VANETs. In Section 6 evaluates the performance of QFHR based on packet delivery rate, end-to-end delay, hop count, and routing overhead. Ultimately, Section 7 presents the conclusions of our paper.

Related Works
Sun et al. [21] have presented a position-based Q-learning routing (PbQR) scheme for VANET. In this method, reliability and link stability are considered to choose relay nodes in the data transmission operation. PbQR regards vehicles as the state set in the learning algorithm. In this scheme, Hello messages are periodically exchanged between neighboring nodes to share their information. PbQR applies Q-learning in the routing process. Thus, the agent always takes a greedy action, meaning that it always chooses the most suitable action, which is available in Q-table. PbQR evaluates link quality based on two factors, namely stability and continuity to choose the next-hop node. The continuity factor is defined based on node degree. In this method, the reward function is equal to the sum of these factors in each node. PbQR defines a distance factor to specify the distance from source to destination. In Q-learning, the discount factor is implemented based on the distance factor.
Roh et al. [22] have offered the Q-learning-based load balancing routing (Q-LBR) protocol in VANETs. It is a UAV-assisted routing protocol. Thus, it provides line-of-sight (LOS) communications for ground vehicles. Q-LBR utilizes three mechanisms to balance the load in the network. In the first mechanism, the authors have suggested an optimized load estimation for ground vehicles. In this technique, ground vehicles disseminate hello messages to transfer their buffer queue information to UAVs. In the second mechanism, Q-learning is used to establish communication paths using a load-balancing manner. To achieve this end, it defines a new concept called the UAV routing policy area (URPA). Ultimately, the authors try to define a reward function that accelerates the convergence speed related to the learning model. This approach introduces different packets for three types of services, namely emergency, real-time, and connection-oriented. These messages have various priorities, including high, medium, and low. Q-LBR involves two sections. In the first section, UAV hears broadcast messages to collect the congestion conditions in the ground vehicles. Then, it detects the congestion level in the ground network. URPA information is broadcast in the second section. Q-LBR discovers paths similar to AODV and DSR. It supports multipath routing. Thus, it can manage the number of routing messages exchanged between nodes in the network.
Bi et al. [23] have introduced the reinforcement learning-based routing protocol in clustered networks (RLRC) for electric vehicles. RLRC utilizes a clustering process to divide the network into several clusters. RLRC uses an enhanced K-Harmonic Means (KHM) for the clustering process and considers two factors, namely the energy of vehicles and bandwidth when selecting the best cluster head (CH). KHM is another version of the K-Means clustering method. This scheme considers the harmonic mean as an alternative option instead of the minimum value. In the first step, it calculates partial derivatives to achieve the best position for the centroid. In each iteration, the algorithm improves the centroid. The clustering algorithm considers the relative distance to select CHs based on the least distance to the neighbors. Non-CH nodes calculate the distance between themselves and CHs and join the nearest CH. To reduce learning time, RLRC utilizes the state-action-reward-state-action (SARSA) algorithm to enhance the routing process. It regards the clustered network as the learning environment, and CHs play the agent role. In RLRC, Hello messages are disseminated in the network to refresh Q-values. In the learning algorithm, the reward function is defined with regard to the next-hop link state. It considers three scales, including hop count, link condition, and available bandwidth for calculating Q-values.
Yang et al. [24] have offered the heuristic Q-learning-based VANET routing (HQVR) protocol. It selects the intermediate vehicles based on link reliability. This scheme uses a distributed manner to implement the learning process based on the information extracted from beacon messages. However, the authors have not taken into account the road width. In this method, the convergence speed of the Q-learning algorithm is dependent on the beacon packet rate. HQVR considers the link lifetime as the learning rate to determine the convergence speed of the learning protocol. HQVR has a route discovery strategy, which depends on delay information. When a node compares the new path and the old path in terms of delay and realizes that the new path requires less time than the previous path, it switches to the new path. Feedback messages find several routes to the destination. As a result, the source vehicle chooses the best route among different paths.
Wu et al. [25] have offered the Q-learning-based VANET delay-tolerant routing protocol (QVDRP) for VANET. This method uses several gateways to transfer data from source to destination. In this method, RSUs play the gateway role to communicate with the cloud server. In QVDRP, vehicles disseminate their generated data to RSUs. This reduces delay and maximizes packet delivery rate when transferring data from a node to another. In QVDRP, the network is regarded as the learning system, and vehicles play the agent role. In the learning operation, data is exchanged between nodes to select the next action (i.e., the next-hop node). Each vehicle stores a Q-table, which involves Q-values of other vehicles. In the learning operation, vehicles disseminate hello messages to refresh Q-table. If the transmitter vehicle and the destination vehicle form a direct connection with each other, this link receives a positive reward. If the previous-hop node receives a message from another node after a threshold time, they obtain a discounted reward. Otherwise, its default reward is adjusted to 0.75. QVDRP predicts the input and output directions in each road segment to obtain a collision probability to reduce the duplicated packets.
Karp and Kung in [26] have designed the greedy perimeter stateless routing (GPSR) in ad hoc networks. The greedy routing strategy utilizes the local information of single-hop neighboring vehicles to deliver data packets through the nearest node to the destination. GPSR is a position-based routing approach, which merges greedy and perimeter strategies. It refreshes the neighbor table by exchanging beacon messages that increase routing overhead. However, it has an acceptable routing overhead and delay. GPSR deals with some limitations in the routing process in VANET because it ignores parameters such as delay, node speed, and movement direction.
Li et al. in [27] have suggested the Q-learning and grid-based routing protocol (QGrid) for VANETs. This protocol divides the network environment into several grids. Then, the Q-learning algorithm learns traffic flow to select the optimal grid based on Q-values. In each grid, the relay node is selected based on two techniques namely the greedy method and the second-order Markov chain prediction technique. In QGrid, the packet delivery issue from a vehicle to a fixed destination is investigated. In the routing process between different grids, a set of grids with the maximum Q-value is chosen. When increasing the number of packets on the network, this method has not suggested any mechanism for controlling network congestion. In addition, intersections and buildings may disrupt the data transfer process in urban areas. However, this issue has not been addressed in QGrid. Moreover, this method designs an off-line Q-table, which is fixed throughout the simulation process. Thus, QGrid cannot control the network load.
Lou et al. in [28] have suggested the intersection-based V2X routing via Q-learning (IV2XQ) for VANET. This hierarchical routing scheme uses a Q-learning-based routing algorithm at the intersection level to select the best routes between intersections. In IV2XQ, road intersections are regarded as the state space, and the road segments are considered as the action space. Thus, IV2XQ reduces the number of states in the Q-learning algorithm and improves its convergence speed. Moreover, vehicles use a greedy technique to choose the relay node based on the positions of vehicles in the road segments. In addition, one important task of RSUs is to monitor the network status to control congestion in the network. IV2XQ does not use any control packet for finding routes. Thus, it has low routing overhead and delay because IV2XQ only uses historical traffic information in the reinforcement learning-based routing process. However, it is very important to consider new information in the routing decisions.
Khan et al. in [29] have presented an intersection-based routing method using Qlearning (IRQ) in VANETs. This scheme defines two global and local views and proposes a traffic dissemination mechanism to create these views. This mechanism helps IRQ to update traffic information and provide a new and fresh global view for the network server. The global view is used for designing a Q-learning-based routing technique. This RL-based routing method finds the best routes between intersections and is executed by the central server. In the Q-leaning algorithm, the discount factor is determined dynamically based on distance and vehicle density in each road section. Thus, IRQ is compatible with the dynamic environment of VANET. IRQ introduces an evaluation mechanism to detect and penalize congested paths. This improves the packet delivery rate. The local view is used for designing a greedy routing strategy on each road segment to discover the best next-hop node. It considers the vehicle status, including location, distance, connection time, and delay. Table 1 expresses briefly the strengths and weaknesses of the related works.

Method Strengths Weaknesses
IV2XQ [28] Determining the discount factor with regard to the density and distance of vehicles on the road section, reducing communication overhead, designing a congestion control mechanism, appropriate convergence speed, reducing the number of states in the Q-learning algorithm Not considering factors, like speed, movement direction, and link lifetime in the routing process, not relying on new traffic information in the network, introducing a centralized reinforcement learning-based routing algorithm IRQ [29] Adjusting the discount factor with regard to the vehicle density and distance, high PDR, reducing latency in the routing process, presenting a evaluation mechanism for controlling the congested paths, suitable convergence speed, reducing the number of states in the Q-learning algorithm, considering new traffic information in the network High routing overhead, introducing a centralized reinforcement learningbased routing algorithm

Base Concepts
In this section, two techniques, namely the Q-learning algorithm and the fuzzy logic, are briefly explained because QFHR uses these techniques to find the best route in the data transmission process in VANET.

Reinforcement Learning
In artificial intelligence (AI), there is an important and useful tool called reinforcement learning (RL), which defines two main components, agent and environment. The agent chooses an action to interact with the environment. The aim of this interaction is to achieve an optimal solution for a certain issue. RL supports a specific framework called the Markov decision process (MDP) for solving optimization issues [30,31]. MDP manages the optimization problem using a random manner and defines four parameters such as (S, A, p, r). The first parameter is S and indicates the finite state space. The second parameter (i.e., A) is the finite action space. The third parameter (i.e., P) represents the transition function. This function determines the next state s after doing the action a in the current state s. The last parameter is r, which indicates the reward function in the learning issue. This function determines the reward obtained from the learning environment after taking the action a t by the agent in the state s t . Note that t indicates time. Figure 2 displays reinforcement learning. In this process, the agent searches the environment by taking the action a t in the current state s t . Now, the environment calculates the reward value (r) and the next state s t+1 by considering the performed action and the current state. The purpose of this learning framework is to reach the most suitable policy π and achieve the maximum reward. The long-term purpose of the agent is to boost the expected discounted reward (i.e., max T ∑ t=0 δr t (s t , π(s t )) ). δ illustrates the discount factor and indicates the effect of the reward on the Q-value. It is adjustable in [0, 1]. If δ = 1, the agent is only dependent on previous experiences. In contrast, if δ = 0, the agent considers only the last immediate reward received from the environment [30,31]. After determining reward values and transition probabilities, the Q-value is obtained from the Bellman function presented in Equation (1): (1) Figure 2. Reinforcement learning process.
In the Bellman equation, α is the learning rate. It is adjustable in [0, 1] and creates a tradeoff between exploration and exploitation. It determines whether the agent focuses on new or old information. If α = 0, the agent does not learn any new knowledge, and if α = 1, the agent only regards the last information.
The most efficient RL technique is Q-learning. In this technique, the agent explores an unknown environment using a trial and error technique. In this method, the agent stores a Q-table, which includes the optimal state-action pairs and corresponding Q-values. The Q-learning algorithm must maximize Q-value by adjusting the action selection process in accordance to the reward received from the environment and evaluating the selected action in the current state. In each iteration, the Q-learning algorithm refreshes Q-values according to Equation (1). Next, the agent exploits the environment and selects actions with maximum Q-value. This scheme is called -greedy, which defines a probability value ( ) for the exploration or exploitation processes [32,33].

Fuzzy Logic
Researchers' studies show that real and complicated processes cannot be modeled, measured, or managed accurately because they include various uncertainties such as uncompleted data, random data, noise data, outliers, and data loss. A robust solution for solving this issue is to use a useful mathematical technique called fuzzy logic (FL). It can describe human thinking in an approximate manner. Zadeh first introduced this theory in 1965. Note that the classical set theory presents a precise and certain definition of membership. Based on this definition, an element either belongs to a set or not. In contrast, the fuzzy theory emphasizes a novel concept called partial membership [34,35]. According to this concept, an element may partly be belonging to a set. Thus, the results are not right or wrong absolutely.
In the fuzzy theory, assume that X indicates a reference set, which includes a set of elements such as x. In this mode, Equation (2) defines the fuzzy set (i.e., A) based on X.
where, µ A : X → [0, 1] means the membership function (MF). It determines the membership degree (i.e., µ A (x i )) for each element (i.e., x) in A. "/" is a symbol for separating µ A (x i ) from x. Note that MF is a key component in fuzzy sets. For example, the triangular function, trapezoidal function, and Gaussian function are known as the most common MFs. Today, various applications utilize fuzzy inference mechanisms (FIMs) to improve their performance. The most famous FIMs are Mamdani and Sugeno (TSK). In each fuzzy system, there are four main parts, including the fuzzifier, defuzzifier, rule base, and fuzzy engine.
The fuzzifier produces fuzzy inputs based on crisp values and allocates a membership degree to each element using the defined membership functions. The fuzzy engine uses fuzzy rules stored in the rule base to simulate fuzzy inputs. Finally, the results obtained from the fuzzy engine are converted to crisp values using the defuzzifier. The most common defuzzification techniques are averaging and centroid [35,36].

Network Model
In QFHR, the network environment consists of various intersections, which are connected to one another through two-way roads. Furthermore, QFHR defines both communication links namely vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I). Network elements, like road sections, intersections, roadside units (RSUs), and vehicles have a unique ID. Figure 3 displays this network model. In the following, the task of vehicles and RSUs in the network is explained: • Roadside units (RSUs): These components are located at intersections, and their task is to monitor the network and control congestion in each road section. RSUs store a traffic table to record the traffic status of the four road sections connected to the corresponding intersection. This table is periodically updated. Moreover, each RSU holds a Q-table produced by the Q-learning-based routing algorithm to select the best routes between different intersections on the network. • Vehicles: Each vehicle periodically sends a hello message to its neighboring nodes. These vehicles establish a neighbor table in their memory to store information about neighboring nodes. Additionally, they can achieve their position and speed at any time using a positioning system.

Proposed Method
In this paper, a Q-learning and fuzzy logic-based hierarchical routing method (QFHR) is proposed in vehicular ad hoc networks. At each intersection, RSUs use a Q-learningbased routing algorithm to obtain the most suitable path between different intersections on the network. Furthermore, vehicles use a fuzzy logic-based routing technique to find the next relay in each road section. QFHR consists of three main phases:

•
Identifying traffic conditions; • Routing algorithm at the intersection level; • Routing algorithm at the road section level.

Identifying Traffic Conditions
For knowing traffic conditions in each road section, QFHR presents a traffic detection process in the network. Algorithm 1 offers a pseudo-code for identifying traffic conditions in QFHR. In this process, each vehicle (such as Vehicle i , where i = 1, 2, ..., N and N represent the number of vehicles) disseminates a hello packet to the neighbors periodically. In QFHR, the hello broadcast time (τ) is equal to one second. The hello message includes vehicle ID (ID i ), road ID (ID R ), queue status (Q i ), vehicle location (x i , y i ), and vehicle speed v x,i , v y,i . Note that the vehicle speed is unchanged at the timeframe τ and is updated after receiving each hello message. Vehicle i builds a neighbor table (Table neighbor i ) to record the information about neighboring nodes. This operation is represented in Figure 4. If Vehicle i receives a new hello packet, it searches the ID related to the packet in its neighbor    (3).
where, QL j and max QL j are the number of packets in the queue and maximum queue length of Vehicle j , respectively. • Connection quality: It represents the quality of the connection between Vehicle i and Vehicle j . It is measured based on two parameters, namely the connection time and the hello packet reception rate. Section 5. In QFHR, the quality of the connection between Vehicle i and Vehicle j , namely CQ i,j , is evaluated with regard to two factors, namely the connection time and the hello packet reception rate because the connection time indicates the stability of the link between the two nodes and the packet reception rate also evaluates their connection quality. It is difficult to calculate the connection quality because the nodes are mobile. In the proposed method, CQ i,j is obtained from the weighted mean of two parameters, including the connection time of Vehicle i and Vehicle j (i.e., λ ij ) and the hello packet reception rate (i.e., η ij ) based on Equation (4): So, Neighbor i represents the neighbors of Vehicle i . Moreover, ω 1 is a weight coefficient and 0 ≤ ω 1 ≤ 1.
In the following, we demonstrate how to obtain the connection time (λ ij ) and the hello reception rate (η ij ).

•
Connection time (λ ij ): The connection time of Vehicle i and Vehicle j is evaluated by Equation (5): where, R is the communication radius of vehicles, d ij = x i − x j 2 + y i − y j 2 is the Euclidean distance between the two vehicles and ∆V ij = v i − v j expresses the relative speed of Vehicle j with regard to Vehicle i . Moreover, (x i , y i ) and x j , y j represent the spatial coordinates of Vehicle i and Vehicle j , respectively. In addition, v i and v j are their speed. T max is the maximum connection time, which is a fixed value determined based on the simulation time. To accurately estimate the connection time in the dynamic environment of VANET, two configuration coefficients are added to Equation (6). As a result: where, α 1 and α 2 are the configuration coefficients that are determined as follows: • If Vehicle i and Vehicle j move at a similar direction (i.e., 0 ≤ ∆θ ij ≤ π 3 , so that ∆θ ij is the movement direction of Vehicle j with regard to Vehicle i ), then α 2 = 1. Now if Vehicle i is ahead of Vehicle j , then α 1 = 1. Otherwise, α 1 = −1. See Figure 5.

•
If Vehicle i and Vehicle j move in the opposite direction (i.e., 2π 3 ≤ ∆θ ij ≤ π) then, α 2 = −1. Now, if Vehicle i approaches Vehicle j , then α 1 = −1. Otherwise, α 1 = 1. See Figure 6. In addition, ∆θ ij is obtained from Equation (7): where v x,i , v y,i and v x,j , v y,j are the velocity vectors of Vehicle i and Vehicle j , respectively. Then, each vehicle updates the connection time (λ ij ) using the approach of window mean with exponentially weighted moving average (WMEWMA).
WMEWMA is an EWMA-based estimator that approximates the average value. It uses the linear combination of a limited exponentially weighed history [37,38]. Equation (8) is used to calculate this estimation recursively: where X t is the value of X at the moment t. Note that Equation (8) can be rewritten as Equation (9): WMEWMA has two control parameters, namely window size (L window ) and the control coefficient β so that 0 ≤ β ≤ 1. Each window records the last L window values. When β has a higher value, this estimation is more dependent on historical values. This means that the estimation is more stable. In contrast, if β = 0, the average value depends only on the last value. This scale can create a stable estimation and is calculated easily.
• Hello packet reception rate (η ij ): The quality of the link between Vehicle i and Vehicle j is measured based on the ratio of the hello packages received by Vehicle i to all hello packets transmitted by Vehicle j at the time interval Φ. Each node uses a counter to count the number of hello messages received from its neighbors. Moreover, the hello broadcast time is equal to the specified time frame τ. As a result, the number of hello messages sent by Vehicle j can be calculated in a certain time Φ. Therefore, each vehicle can calculate the hello reception rate according to Equation (10): where R hello (i, j) and S hello (i, j) are equal to the number of received hello messages and the number of sent hello messages, respectively. Then, each vehicle uses Equation (9) to refresh the hello reception rate (η ij ) using the WMEWMA method.  Table 3.

Forming a Traffic
where (x m , y m ) and (x n , y n ) are spatial coordinates of RSU m and RSU n , respectively.

-
Average road time (T R ): This field represents the average time required to travel the road section. RSU estimates T R using Equation (12), which considers two scales, the road length (L R ) and the average speed of neighbors in this road section: where V i is the speed of the neighbors obtained from the neighbor table and n R is all number of vehicles on the road section. RSU uses the WMEWMA scheme to update T R using Equation (9) Note that RSU refreshes this parameter in the traffic table using the WMEWMA scheme in Equation (9).

-Average connection quality (CQ R ):
This field represents the average connection quality of vehicles (i.e., CQ i,j ) in each road section. It can be calculated based on Equation (14) and recorded in the traffic table.
where n R is all number of vehicles in the road section R. RSU uses the WMEWMA to update CQ R in the traffic table using Equation (9).  for j = 1 to N i do 6: if Vehicle i receives a Hello message from Vehicle j then 7: Vehicle i : Compare the ID of Vehicle j with IDs recorded into Vehicle i : Update ID R , x j , y j , v x,j , v y,j , Q j , CQ i,j , and VT j into   16: end for 17: for k = 1 to N R do 18: if RSU k receives a Hello message from Vehicle j then 19: Vehicle i : Compare the ID of Vehicle j with IDs recorded into Table neighbor ; 20: if RSU k finds ID j in Table neighbor then 21: RSU k : Update ID R , x j , y j , v x,j , v y,j , Q j , CQ i,j , and VT j into   RSU k : Calculate the road connection quality (CQ R ) using Equation (14); 38: RSU k : Update L R , T R , D R , Q R , and CQ R using Equation (9) into

Routing Algorithm at Intersection Level
In this routing process, the source vehicle first calculates the positions of itself and the destination vehicle using GPS and a digital map. Then, it must specify two intersections namely the source intersection and the destination intersection. Note that each road section is connected to two intersections. Thus, the source vehicle should determine the source intersection among the two intersections before starting the routing process. The source vehicle selects the intersection that has less distance from the destination as the source intersection (Intersect S ). Moreover, the destination intersection should be determined among two intersections connected to the destination road section. The destination vehicle chooses the intersection that is closer to the source vehicle as the destination intersection (Intersect D ). Now, the source vehicle must send the data to Intersect S by using the routing algorithm at the road level described in Section 5.3. At this step, a Q-learning-based distributed routing protocol is introduced to obtain the most suitable route between various intersections using traffic information. This algorithm is also called the intersection-to-intersection (I2I) routing algorithm. Algorithm 2 describes the pseudo-code related to this routing process in QFHR. In the I2I routing model, each message is the agent and the entire network expresses the learning environment. The agent must discover the learning environment by performing different actions and experiencing various states. The state space contains a set of intersections, namely I = Interscet 1 , Intersect 2 , . . . , Interscet p . The action set represents the four road sections connected to each intersection, namely Figure 7. After selecting a road section, the packets are sent from Intersect t i to Intersect t+1 j . Now, the environment gives a reward to the learning agent based on the reward function presented in Equation (15).
where D R current (l), CQ R current (l), Q R current (l), and T R current (l) are the density of vehicles, the average connection quality, the average congestion status, and the average road time in the current road section R current . Based on the reward function, if the selected intersection is the desired intersection meaning that the packet reaches the destination intersection, the corresponding road section achieves the maximum reward (R max ). On the other hand, when the next intersection is a local minimum, meaning that the selected intersection has the minimum distance from the destination compared to other neighboring intersections, it receives the minimum reward (R min ). In other conditions, the reward function depends on vehicle density, connection quality, congestion status, and road traveling time. QFHR estimates the discount factor in accordance with the network conditions because if this parameter is considered a constant value, the routing algorithm cannot adapt to the dynamic network. In addition, QFHR empirically considers α as a default value (α = 0.1) according to [28]. Moreover, the discount factor is calculated using Equation (16) based on the two parameters, namely vehicle density and the distance to the destination: where D R (l) is the vehicle density on the current road. x intersect t+1 , y intersect t+1 , (x intersect t , y intersect t ), and (x D , y D ) are the spatial coordinates of the current intersection (intersect t ), the next intersection (intersect t+1 ), and the destination, respectively. After reviewing Equation (16), we find out that if the Euclidian distance between intersect t+1 and the destination is less than that between intersect t and the destination, and the corresponding road section includes a sufficient number of vehicles (i.e., it has a suitable vehicle density), δ has a large value.
The agent must discover the network environment by performing different actions and locating in various states. Thus, it calculates a Q-value for each action and the corresponding state and maintains it in Q-table to be used in the routing decisions. After converging the I2I routine algorithm, Q-table is stored in the memory of RSUs at the intersections. They use this table to select the next intersection with the highest Q-value. Then, RSU sends packets to the next intersection using the routing algorithm at the road level described in Section 5.3. This process continues until packets reach the destination vehicle.

Routing Algorithm at the Road Level
This section describes the QFHR routing algorithm at road sections. As stated in Section 5.1, each vehicle shares its position, speed, and queue status through hello messages with other neighboring vehicles. This information is stored in their neighbor table and used in the routing process. In QFHR, the routing algorithm at the road level includes three phases: • Vehicle-to-vehicle (V2V) routing algorithm; • Route recovery algorithm; • Vehicle-to-infrastructure (V2I) routing algorithm.
In the following, each of these three phases is explained exactly.

Vehicle-to-Vehicle (V2V) Routing Algorithm
In QFHR, each vehicle utilizes a greedy routing technique to single out a relay vehicle in the road section. Algorithm 3 shows the pseudo-code related to the V2V routing scheme. If Vehicle S wants to send a packet to Vehicle D , this vehicle first determines the location of Vehicle D . There are three modes: • First mode: Vehicle S and Vehicle D move on the same road section. In this case, Vehicle D is determined as a target point (P Target ). • Second mode: Vehicle S and Vehicle D move in different road sections. Vehicle S , which means the source vehicle in this case, considers Intersect S as P Target .
• Third mode: Vehicle S and Vehicle D move in different road sections. Vehicle S , which is an intermediate vehicle in this case, considers the intersection obtained from Algorithm 2 (i.e., Intersect t+1 j ) as P Target . Now, Vehicle S checks its neighbor table to determine whether P Target is its neighbor or not. If P Target is in the neighbor table, Vehicle S send its data packet to P Target directly. Otherwise, Vehicle S searches in its neighbor table to select the neighboring vehicle closest to P Target and sends data packets to this vehicle. If Vehicle S fails to find a vehicle closer to P Target than itself, meaning that it deals with a local minimum issue. In this case, Vehicle S goes to the route recovery phase described in Section 5.3.2 to select the next-hop node.

Route Recovery Algorithm
In this phase, a fuzzy logic-based route recovery algorithm is implemented to evaluate the chances of neighboring vehicles as the next hop node. The fuzzy logic-based route recovery process involves the following parts:

Fuzzy Inputs
The fuzzy model includes three inputs: • Distance to P Target (d i,target ): Vehicle S obtains the distance from a neighboring vehicle (such as Vehicle i ) to P Target with regard to the information stored in the neighbor table based on Equation (17). In this process, the vehicle that is closest to the destination compared to other neighbors gains more chance to be selected as the relay vehicle.
where (x i , y i ) and x target , y target display the spatial coordinates of Vehicle i and P Target , respectively. Moreover, L R indicates the length of the road related to Vehicle S . It is calculated using Equation (11). The membership function chart of d i,target is presented in Figure 8a. It involves three states: low, medium, and high. • Queue status (Q i ): Vehicle S extracts Q i for each neighboring vehicle (such as Vehicle i ) from the neighbor table. The purpose of this parameter is to lower the chance of vehicles with high traffic being selected as the relay vehicle. See the membership function chart of Q i in Figure 8b. This input considers three states: low, medium, and high. • Connection Quality (CQ i,prev−hop ): Vehicle S calculates CQ i,prev−hop for each neighboring node (such as Vehicle i ) using Equation (4) to be selected vehicle with the highest connection quality as the relay vehicle. The membership function chart of CQ i,prev−hop is represented in Figure 8c. This input involves three states: low, medium, and high.

Fuzzy Output
In the fuzzy route recovery process, the fuzzy output (i.e., S i ) determines the chances of neighboring vehicles being selected as the relay vehicle. In the proposed fuzzy system, a vehicle that has a short distance to P Target and includes low traffic, and a high connection quality compared to other neighboring vehicles, obtains a high chance to be selected as the relay vehicle. See the diagram of the fuzzy membership function related to S i in Figure 9. This fuzzy output consists of seven states: extremely low, very low, low, medium, high, very high, and extremely high. Rule Base Table 4 expresses the fuzzy rules introduced in the fuzzy route recovery algorithm. For example, "Rule 1" is defined below: Rule 1: d i,target is low AND Q i is low AND CQ i,prev−hop is Low THEN S i is High. In QFHR, when choosing the next-hop node in a road section, each vehicle uses a greedy routing technique for transferring packets to the intersection area. Then, RSU selects the next intersection by searching in Q-table calculated in the I2I routing algorithm described in Section 5.2. At this step, each RSU employs a greedy strategy to send packets to the related road section. In this case, there are two modes: • First mode: Vehicle D moves in this road section. In this case, Vehicle D is determined as a target point (P Target ). • Second mode: Vehicle D does not move in this road section. In this case, the next intersection is considered as P Target . Now, RSU checks its neighbor table to determine whether P Target is its neighbor or not. If P Target is in its neighbor table, the data packet is sent to P Target directly. Otherwise, RSU finds the vehicle closest to P Target in its neighbor table and transfers the packet to it. If RSU fails to find a vehicle to send the data packet, RSU stores this packet until it discovers the next-hop vehicle. P Target = Intersect t+1 j that is obtained using Algorithm 2; 5: end if 6: if RSU is neighbor of P Target then 7: RSU: Send the data packet to P Target directly; 8: else 9: RSU: Find the closest relay node to P Target from its Table neighbor ; 10:

RSU:
Send the packet to the relay node; 11: end if 12: if RSU cannot find the closest relay node to P Target then 13: RSU: Carry the packet until it discovers an appropriate relay node; 14: while the buffer queue of RSU is not empty do 15: RSU: Check its Table neighbor periodically; 16: if RSU finds a neighbor as the relay node then 17: RSU: Send the packet to the relay node; 18: end if 19: end while 20: end if End

Simulates and Evaluation of Results
In this section, Network Simulator version 2 (NS2) [39] implements QFHR for evaluating its performance. When simulating QFHR, various parameters such as packet delivery rate, delay, hop count, and routing overhead are analyzed. Next, the results are compared with IRQ [29], IV2XQ [28], QGrid [27], and GPSR [26]. To achieve this goal, the dimensions of the network are 3 km × 3 km. It has 38 two-way road sections and 24 intersections. There are 5-20 vehicles in each kilometer. In the network, there are 450 vehicles whose speed is 14 m per second. It is assumed that the communication range of each vehicle varies between 250 and 300 m, and the communication range of each RSU is a fixed value (i.e., 300 m). The simulation time is considered 1000 s. Help messages are broadcast in a fixed time interval (1 s) and the packet sending rate is equal to 1-6 packets per second. Moreover, the packet size is 512 bytes. In QFHR, the Q-learning algorithm has a fixed learning rate (α = 0.1) and uses an -greedy strategy in the exploration and exploitation processes, so that = 0.2. Table 5 describes the simulation parameters. Packet delivery rate (PDR) is defined as the ratio of packets received by the destination nodes to all packets transferred by the source nodes. In the simulation process, two scenarios are intended to evaluate the packet delivery rate. The purpose of the first scenario is to calculate PDR based on the packet sending rate (PSR). As shown in Figure 10, when PSR is high, PDR decreases in all methods and vice versa because the packet sending rate indicates the number of packets produced in the network per time unit. Therefore, high PSR increases the network load. In this case, the buffer capacity of vehicles is completed quickly, and it is very likely that some data packets will be lost due to high network congestion. Thus, PDR is reduced in the network. The purpose of the second scenario is to evaluate the packet delivery rate according to the signal transmission radius (STR) of vehicles. As shown in Figure 11, when STR is large, PDR will also increase because the number of single-hop neighbors of each vehicle increases in this case, and a vehicle can cover a wider communication range. In this case, vehicles have a higher chance to find a next-hop node. This reduces the probability of the local optimum issue. According to Figures 10 and 11, QFHR has an optimal PDR in comparison with other approaches. In Figure 10, which shows the evaluation of PDR based on PSR, QFHR increases the packet delivery rate by 2.74%, 6.67%, 22.35%, and 29.98% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively. Additionally, in Figure 11, which shows PDR based on STR, QFHR has improved PDR by 3.45%, 8%, 23.29%, and 26.17% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively. This is mainly rooted in the fact that QFHR performs the Q-learning-based routing process using a distributed manner to discover various paths between network intersections. However, IRQ, IV2XQ, and QGrid present centralized routing processes. The distributed routing method is more compatible with the dynamic environment of VANET and can find routes with less delay and high connection quality in the network. Thus, it improves the packet delivery rate. Moreover, if the network has high congestion, QFHR can quickly discover this issue and replace the new path. However, IRQ selects the best route based on traffic information from the central server. In this scheme, the central server should wait for receiving traffic messages from RSUs to detect the congestion on the network. However, this process causes a high delay and may lose some data packets on the network. Furthermore, in IV2XQ, the central server chooses the best route based on historical traffic information stored in its memory, and the route selection process depends only on the vehicle density in the road sections. In addition, QGrid selects the best gird (i.e., the gird with maximum density) using a greedy routing process. However, the high-density grid is not always desirable and may cause congestion on the network and packet loss. GPSR is a greedy routing process and deals with the local minimum problem, which rises packet loss.

Delay
End-to-end delay is defined as the time required for sending packets from the source node to the destination node. The first scenario tests delay based on the packet sending rate. It is displayed in Figure 12. According to this figure, high PSR leads to high delay in all routing approaches because high PSR causes congestion in the network. As a result, data packets wait longer in the buffer queue, meaning that the queuing delay increases. Therefore, delay increases in the data transmission process. The second scenario evaluates delay based on the signal transmission radius (STR) of vehicles. This is shown in Figure 13. When STR is large, delay decreases in all methods because the number of hops is reduced in the routing path. According to Figures 12 and 13, QFHR has the least delay compared to other approaches. In Figure 12, the proposed scheme lowers delay by 16.19%, 22.82%, 34.15%, and 59.51%, compared to IRQ, IV2XQ, QGrid, and GPSR, respectively. Moreover, in Figure 13, QFHR decreases delay by 19.86%, 34.26%, 44.09%, and 68.39% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively, because QFHR uses a distributed Q-learning-based routing process to choose the next intersection with regard to road congestion status and road connection quality. In addition, in the V2V routing process at each road section, a fuzzy logic-based route recovery process is considered to obtain the next-hop vehicle in accordance to the queue status and the connection quality of vehicles. These techniques can prevent congestion in the network and lower delay in the operation. On the other hand, IRQ, IV2XQ, and QGrid are centralized routing methods because the central server performs the route discovery process and sends it to vehicles on the network. This boosts the delay in the packet transmission operation. Moreover, IRQ must periodically send traffic messages to the central server to calculate routes based on these messages. This is another factor in increasing delay in IRQ. Note that IRQ and IV2XQ are equipped with a congestion control mechanism. Thus, they have an acceptable delay. However, QGrid has not designed any congestion control mechanism on the network. This is the most important reason for the high delay in this method. GPSR has also experienced the worst delay in comparison with other methods because it involves the local optimum problem.

Hop Count
Hop count is defined as the average number of intermediate nodes in the routing path between the source node and the destination node. The first scenario evaluates the number of hops based on the packet-sending rate. It is represented in Figure 14. Based on this figure, we can deduce that the number of hops has a direct relationship with the packet sending rate (PSR), meaning that high PSR leads to a high number of hops in the routing path because high PSR causes congestion in the network and has a negative impact on the routing process. QFHR has lowered the number of hops compared to IRQ, IV2XQ, QGrid, and GPSR, by 6.74%, 20.09%, 2.68%, and 12.22%, respectively. The second scenario evaluates the number of hops based on the signal transmission radius (STR). It is represented in Figure 15. According to this figure, when STR has a large value, the number of intermediate nodes in the path decreases. QFHR lowers the number of hops by 14.13%, 32.58%, 7.71%, and 21.39% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively.

Routing Overhead
Routing overhead is equal to the ratio of the total failed data packets in the data transfer process and control packets in the route discovery and maintenance processes to all the packets produced in the network. The first scenario tests the routing overhead based on the packet sending rate. It is displayed in Figure 16. According to this figure, there is a direct relationship between PSR and routing overhead, meaning that when the packet sending rate goes up, the routing overhead increases because high PSR increases the collision probability and the packet loss due to network congestion. As a result, the need for retransferring packets increases, and it increases routing overhead. The second scenario has evaluated the overhead in various methods based on the signal transmission radius (STR), and its results are shown in Figure 17. This figure shows that high STR leads to low overhead because the packet delivery rate is high, which reduces the need for retransferring data packets. According to Figure 16, which evaluates the overhead based on PSR, QFHR reduces the overhead by 30.23% and 41.4% compared to QGrid and GPSR, respectively. However, the proposed method increases the overhead by approximately 9.36% and 11.34% compared to IRQ and IV2XQ, respectively. Moreover, according to Figure 17, which shows the overhead based on the signal transmission radius, QFHR decreases overhead by 21.99% and 25.63% compared to QGrid and GPSR, respectively. However, it has a higher overhead than IRQ (11.26%) and IV2XQ (25%). IV2XQ uses historical traffic information stored in the server memory in the Q-learning-based routing process to discover paths between network intersections, meaning that it does not exchange any control packet for discovering these routes. As a result, the overhead is extremely low in this method. Whereas, in QFHR, RSUs are responsible for discovering routes between intersections, and traffic information is periodically updated. This has increased routing overhead in this scheme. GPSR has the worst routing overhead due to the local optimum problem. QGrid also has high routing overhead because it has not used any congestion mechanism. As a result, when increasing congestion in the network, packet loss increases. This increases the need for retransferring data packets.

Conclusions
In this paper, a Q-learning and fuzzy logic-based hierarchical routing approach (QFHR) was suggested for VANETs. In the first step, a traffic identification algorithm was presented so that each RSU obtains information about the traffic conditions of four roads connected to its intersection. Next, each RSU uses this traffic information to design the Q-learningbased routing algorithm to discover the most suitable path between different intersections. In the last step, the routing algorithm at the road level was introduced. It is a greedy routing algorithm that uses fuzzy logic to recover routes and solve the local optimum problem. QFHR was implemented using NS2. Then, the results were compared with IRQ, IV2XQ, QGrid, and GPSR in two scenarios. The first scenario analyzes the result based on the packet sending rate (PSR). In this scenario, QFHR increases PDR by 2.74%, 6.67%, 22.35%, and 29.98% and reduces delay by 16.19%, 22.82%, 34.15%, and 59.51%, and lowers the number of hops by 6.74%, 20.09%, 2.68%, and 12.22% compared with IRQ, IV2XQ, QGrid, and GPSR, respectively. However, it increases the overhead by approximately 9.36% and 11.34% compared to IRQ and IV2XQ, respectively. Additionally, the second scenario evaluates the results with regard to the signal transmission radius (STR). In this scenario, QFHR increases PDR by 3.45%, 8%, 23.29%, and 26.17% and decreases delay by 19.86%, 34.26%, 44.09%, and 68.39% and reduces the number of hops by 14.13%, 32.58%, 7.71%, and 21.39% compared to IRQ, IV2XQ, QGrid, and GPSR, respectively. However, it has a higher overhead than IRQ (11.26%) and IV2XQ (25%). These results show that QFHR has a good performance in terms of PDR, delay, and the number of hops. However, the proposed method has a greater routing overhead than IRQ and IV2XQ. In future research directions, our focus is on reducing the routing overhead in QFHR. This can be achieved in two schemes: (1) Providing a clustering technique to increase scalability and reduce routing overhead and (2) adjusting the hello broadcast interval dynamically in accordance with road traffic conditions.