A Q-Learning-Based Delay-Aware Routing Algorithm to Extend the Lifetime of Underwater Sensor Networks

Underwater sensor networks (UWSNs) have become a hot research topic because of their various aquatic applications. As the underwater sensor nodes are powered by built-in batteries which are difficult to replace, extending the network lifetime is a most urgent need. Due to the low and variable transmission speed of sound, the design of reliable routing algorithms for UWSNs is challenging. In this paper, we propose a Q-learning based delay-aware routing (QDAR) algorithm to extend the lifetime of underwater sensor networks. In QDAR, a data collection phase is designed to adapt to the dynamic environment. With the application of the Q-learning technique, QDAR can determine a global optimal next hop rather than a greedy one. We define an action-utility function in which residual energy and propagation delay are both considered for adequate routing decisions. Thus, the QDAR algorithm can extend the network lifetime by uniformly distributing the residual energy and provide lower end-to-end delay. The simulation results show that our protocol can yield nearly the same network lifetime, and can reduce the end-to-end delay by 20–25% compared with a classic lifetime-extended routing protocol (QELAR).


Introduction
For decades, underwater wireless sensor networks (UWSNs) have attracted significant interest. Many applications of UWSNs, including commercial exploitation, marine mammal studies and oceanography data collection [1,2] allow humans to sense the vast underwater domain and motivate research on UWSN design.
However, because of the harsh environment and limited spectrum source, communications in UWSNs are much more difficult than those in terrestrial sensor networks. One of the reasons is that the radiowaves employed in terrestrial sensor networks is not feasible in the underwater environment because of their rapid attenuation. For example, Berkeley Mica 2 motes have been reported to have only 120 cm communication range in an underwater environment at 433 MHz [3]. Currently, the only appropriate method for long distance communications is acoustic communication. The speed of sound in water is about 1500 m/s, five-orders slower than the speed of radiowaves, thus there is a long propagation delay in UWSNs [4]. Moreover, the sensor nodes are deployed under the sea, therefore, it is difficult to recharge their batteries [5]. Since the sensors are powered by batteries, the limited energy restricts the network lifetime of UWSNs. Network lifetime is the time span from the deployment to the instant when the network is considered nonfunctional [6]. In this paper, the network lifetime is defined The rest of this paper is organized as follows: in Section 2, related works on underwater routing protocols are discussed briefly. In Section 3, the basic Q-learning technique is introduced and adopted into our system model. In Sections 4 and 5, the QDAR algorithm is described in detail. The simulation results are shown and discussed in Section 6. Finally, we conclude this paper in Section 7.

Related Work
Underwater routing techniques are a hot research topic for UWSNs nowadays. There are several kinds of routing protocols that aim to improve energy efficiency, reduce end-to-end delay and prolong network lifetime [13,14]. In this section, we provide a review on research works that have been done on this topic.
Most energy-efficient routing protocols aim to reduce energy consumption and prolong network lifetime. A hierarchical routing algorithm called queen-bee evolution algorithm (QEGA) [15] works better in terms of energy consumption. QEGA has a high rate which results in premature convergence. Thus, the algorithm can find the optimal solution more quickly. However, QEGA does not consider the residual energy, which is important to extend network lifetime. The energy-saving vector-based forwarding (ES-VBF) protocol [16] defines a desirableness factor based on residual energy and location information. In the routing pipe, nodes with more residual energy are more possible to forward packets. Although the algorithm prolongs the network lifetime, it needs the location information of all the nodes, which is still a challenge to be solved.
The adaptive power controlled routing (APCR) [17] is an energy efficient routing schema that does not require any location information. In APCR, nodes are assigned to concentric layers according to the signal power of a received INTEREST packet broadcasted by sink nodes. Then, routing paths are decided based on layer numbers and residual energy. To improve the energy efficiency, nodes are able to adjust their transmission power to a set of values according to the information received during packet transmission. If forwarding nodes are found at multiple layers, the power is decreased. If no neighbor is found, the power is increased. Thus, APCR can achieve a high delivery ratio, but the number of forwarding nodes at each layer is not limited properly. If multiple nodes forward the same packet, the total energy consumption is increased.
The Q-learning-based adaptive routing (QELAR) protocol is proposed [18] and Q-learning is proved to perform well in UWSNs in several aspects. QELAR defines the reward function based on the residual energy of the sensor nodes. In this protocol, sensor nodes choose the node with more residual energy as the next hop, so that the network lifetime of the network can be extended. However, in QELAR protocol, each node takes the responsibility to learn the environment by metadata exchanging and decide the next hop, leading to a higher energy consumption for each node. Moreover, the protocol does not restrict end-to-end delay. When the number of the sensor nodes increases, the routing will detour with more and more nodes, then the end-to-end delay is prolonged. Thus, QELAR works inefficiently in some situations because of the long delay.
Many research works point out that the problems of latency in UWSNs are serious, especially for time-critical applications. In [19] the authors employ a probability model to describe the propagation delay of a link and select the next hop with lower delay. In [20], an underwater opportunistic routing (UWOR) is proposed. The forwarding set in which nodes can hear each other and prevent packet duplication is established. Each node in the forwarding set is assigned a relay priority which is related to the probability of successful transmission. The node with the highest priority and limited end-to-end delay can be chosen as the relay node. The simulation results show that UWOR can maximize good put while satisfying end-to-end latency requirements. However, it disables retransmission mechanisms, leading to a lower delivery ratio.
Moreover, there are protocols that can jointly reduce energy consumption and end-to-end delay. Modified energy weight routing (MEWR) protocol [3] is energy efficiency guaranteed using a minimum algorithm. In order to determine an optimal path with a low end-to-end delay as well as low energy consumption, the cost of a link is formulated as a mixture of energy weight and delay weight. In the path discovery phase, a node employs a greedy approach to find all its neighbors and determine an optimal one with the lowest cost. However, low energy consumption does not lead to long network lifetime effectively. Since MEWR does not take the residual energy of sensor nodes into account, it cannot optimize the energy distribution, which is crucial for network lifetime extension.

Q-learning Based Model
In this section, we give a brief introduction to the basic Q-learning technique, which is the fundamental theory of our algorithm. Next, we explain our Q-learning based system model.

The Basic Q-Learning Technique
Q-learning is a model-free reinforcement learning method. It provides agents with the capability of learning to act optimally. In this technique, an agent chooses an action at a particular state according to the reinforcement it receives. The reinforcement is composed of the direct reward or penalty and the future consequence the agent estimates. With the reinforcement, the agent can evaluate how good an action is in the current situation. The task of the Q-learning is to determine an optimal policy to get a higher reward. Once the agent receives the highest reward, it will stop at the state which is called absorbing goal. An agent in state x n at step n can go to the next state x n+1 by choosing action a n under the policy π. The probability of shifting to state x n+1 according to the law: P a n x n x n+1 = prob{x = x n+1 x n , a n } The agent receives rewards according both the values of direct reward and the future reward. A direct reward r n in state x n is received immediately after the agent performs action a n , only depending upon the state and action. A future reward is the total reward that the agent expects in the new state after action a n . The action-utility function Q π (x, a), which is used to describe the expected return of action a in state x under policy π, is critical to Q-learning: Q π (x n , a n ) = r n + γ ∑ x n+1 ∈X P a n x n x n+1 Q π (x n+1 , a) where: x n+1 ∈X P a n x n x n+1 R a n x n x n+1 (3) R a n x n x n+1 is the reward of action a n , which can make the agent go into state x n+1 from state x n . γ(0 ≤ γ < 1) is the discount factor. It means the reward received current step hence is worth less than that received now. Typically, to balance the direct and future reward, the value of γ is within (0.5, 0.99). If an optimal policy is performed thereafter, we can derive the optimal Q value. It is proved that at least one optimal stationary policy π * exists [21]. Under the optimal policy, Q * can be described as: Thus, in order to get an optimal Q, we have: Q(x n , a n ) = r n + γ ∑ x n+1 ∈X P a n x n x n+1 Q * (x n+1 ) (5) and: a n = arg maxQ(x n , a n ) where a n is the optimal action to get the optimal Q.

Q-Learning Based System Model
In the UWSNs, the sink nodes are on the surface while the source nodes are deployed underwater. The sink nodes can receive data from multiple distributed source nodes [22]. In our QDAR mechanism, when a source node has a packet to send to the sink node, it broadcasts to the sink node to request communication as well as collect information. This mechanism is explained in the next section. With the collected information, the sink node can lay out a virtual topology from the source node to the sink node and decide the routing path thereafter.
Each packet in the networks can be seen as an agent in the Q-learning technique. The Q-learning state is related to the node which holds the packet. When a packet is at node i, the state of the agent is x i . Packet forwarding from node i to node j is action a i . If the transmission is successful, the packet (agent) state shifts from x i to x j . The routing path is the policy π, directing the packet (agent) to take the proper action. Obviously, it is costly for a node to perform all actions by sending packets to all the neighbor nodes and get all these Q values. Therefore, in our mechanism, the sink assumes a virtual packet and sends it in the virtual topology. By doing this virtual experiment, the sink node can perform QDAR algorithm and determine the routing path because the sink holds the nodes information, such as the delay and the residual energy.

The QDAR Mechanism
The QDAR mechanism is designed for the overall routing process. There are five phases in QDAR mechanism: data_ready phase, routing decision, interest phase, packet forwarding and acknowledgement. The data_ready phase is responsible for information collection, after which the sink node can get the fundamental information for routing decision. Then, the sink node determines the path according to QDAR algorithm. During the interest phase, the sink node sends a packet to the source node along the determined path. Thus, the path can be constructed. Packets are sent along the path thereafter as long as the acknowledgement phase is successful. Otherwise, the whole mechanism restarts. The main communication procedure is briefly depicted as Figure 1. QDAR mechanism, when a source node has a packet to send to the sink node, it broadcasts to the sink node to request communication as well as collect information. This mechanism is explained in the next section. With the collected information, the sink node can lay out a virtual topology from the source node to the sink node and decide the routing path thereafter. Each packet in the networks can be seen as an agent in the Q-learning technique. The Q-learning state is related to the node which holds the packet. When a packet is at node i, the state of the agent is xi. Packet forwarding from node i to node j is action ai. If the transmission is successful, the packet (agent) state shifts from xi to xj. The routing path is the policy , directing the packet (agent) to take the proper action. Obviously, it is costly for a node to perform all actions by sending packets to all the neighbor nodes and get all these Q values. Therefore, in our mechanism, the sink assumes a virtual packet and sends it in the virtual topology. By doing this virtual experiment, the sink node can perform QDAR algorithm and determine the routing path because the sink holds the nodes information, such as the delay and the residual energy.

The QDAR Mechanism
The QDAR mechanism is designed for the overall routing process. There are five phases in QDAR mechanism: data_ready phase, routing decision, interest phase, packet forwarding and acknowledgement. The data_ready phase is responsible for information collection, after which the sink node can get the fundamental information for routing decision. Then, the sink node determines the path according to QDAR algorithm. During the interest phase, the sink node sends a packet to the source node along the determined path. Thus, the path can be constructed. Packets are sent along the path thereafter as long as the acknowledgement phase is successful. Otherwise, the whole mechanism restarts. The main communication procedure is briefly depicted as Figure 1.

Assumptions
The following assumptions apply: 1.
Nodes hold their own depth information and can embed it in the packets; 2.
The sink node keeps the successful and failed communication record of the nodes.
In the Source_initiated Query, if a source node has a packet to send to the sink, it first broadcasts a DATA_READY packet for both communication request and information collection. This phase is defined as the data_ready phase in this paper. During the data_ready phase, a node only forwards the packet if its depth is smaller than that of the previous node.
Once the sink node receives this packet, it can decide if it is interested in such data. If so, the sink node decides the routing path according to the QDAR algorithm and sends out an INTEREST packet to the source node along the decided path for path construction. This phase is defined as the interest phase. In this phase, each node records its previous hop and takes it as the next hop in the packet forwarding phase. Only nodes selected by QDAR algorithm join in the INTEREST packet forwarding, while the others turn into sleep state thereafter or take responsibilities for another communication task to save energy or improve network capacity.
After the INTEREST packet reaches the source node, the path is constructed. Then, the source node transmits the data packet through the constructed path. Finally, the sink node responds acknowledgement (ACK) packet after the successful data reception. If the source node receives ACK successfully, this communication episode concludes. The coming packets from the same source node are sent along the same path without repeated data collection and routing decision phases until transmission failure occurs.

The Packet Structures
Packet structures for DATA_READY packet and INTEREST packet, which can help establish the routing track, are designed as follows. DATA_READY packet contains the type and ID of the packet, source node address (SNA), time stamp (TS), current node depth (CND) and two arrays C and RES in which two costs and residual energy of this node (RE) are saved respectively. SNA is permanent during the network lifetime of the packet. Elements of C are calculation results of two cost functions: delay-related cost and energy-related cost. Denoting the residual energy of node i as e i res , we define the energy-related cost function ce e i res as: where e i ini is the initial energy of node i. The less energy a node remains, the more it costs to forward packets. A node with higher cost is more reluctant to communicate. A delay-related cost function ct t ij is also defined: where t ij represents the delay of sending packet between node i and node j. Clearly, the longer delay, the higher cost. On receiving a DATA_READY packet, the nodes have two tasks: 1.
Calculate. The nodes extract RE and TS. With these information, the nodes can compute their energy-related costs and delay-related costs depending on the delay and residual energy.

2.
Update and relay. The nodes update TS, C and RES packet fields with their own information or calculation results and relay the DATA_READY packet to neighbor nodes until the packet arrives at the sink node.
When the DATA_READY packet reaches the sink node, the sink node holds the relevant data and begins routing decision if it is interested in the packet. After the routing decision phase, the sink node creates an INTEREST packet composed of type and packet ID, destination node address (DNA), current node depth (CND) and path direction (PD). DNA is the source node address (SNA) of DATA_READY packet. The structures of both packets and their relationship are shown in Figure 2.

QDAR Algorithm
In this section, we describe the proposed QDAR algorithm in details. The important notations are listed in Table 1.

Name
Description Packet forwarding from node i to node j Constant cost , Weight of two costs The neighbor node set of i Delay sensitivity of communication between node i and j The modified Delay of communication between node i and j The time of failed communication between node i and j ce The energy-related cost of node i c The delay-related cost of node i γ The discount factor of future reward In order to design a Q-learning-based delay-aware routing protocol to extend the network lifetime of UWSNs, we define an action-utility function whose value is Q . The sink node has a matrix in which Q values of all the nodes are stored. These Q values are future rewards of packet forwarding and are used in the routing decisions. In our protocol, node i and packet forwarding from node i to node j are seen as state i and action j of Q-learning technique, respectively.
Firstly, we define a reward function for action j, which is related to both propagation delay and residual energy. If this transmission is successful, the reward of the action for node i is: where is action j. Because forwarding packet occupies channel bandwidth and disturbs other nodes, a constant cost is added into the function. is the weight of the sum of delay-related cost

QDAR Algorithm
In this section, we describe the proposed QDAR algorithm in details. The important notations are listed in Table 1. The neighbor node set of i ϕ ij Delay sensitivity of communication between node i and j ϕ ij The modified ϕ ij t ij Delay of communication between node i and j t ij The time of failed communication between node i and j ce e i res The energy-related cost of node i ct t ij The delay-related cost of node i γ The discount factor of future reward In order to design a Q-learning-based delay-aware routing protocol to extend the network lifetime of UWSNs, we define an action-utility function whose value is Q. The sink node has a matrix in which Q values of all the nodes are stored. These Q values are future rewards of packet forwarding and are used in the routing decisions. In our protocol, node i and packet forwarding from node i to node j are seen as state i and action j of Q-learning technique, respectively.
Firstly, we define a reward function R a j s ij for action j, which is related to both propagation delay and residual energy. If this transmission is successful, the reward of the action for node i is: where a j is action j. Because forwarding packet occupies channel bandwidth and disturbs other nodes, a constant cost β 0 is added into the function. β 1 is the weight of the sum of delay-related cost and energy-related cost. ϕ ij is the delay sensitivity. A higher ϕ ij means the delay is more repellent.
If the transmission fails, node i will resend the packet, which means the node should pay double energy-related cost and more delay-related cost. Thus, the reward function becomes: where t ij is the time that node i spends in the failed transmission. β 2 is the weight of the sum of delay-related cost and energy-related cost, in the same position as β 1 .
To further prolong the network lifetime, we design an adaptive detouring path strategy. We define a set N i for node i and an energy warning threshold e wth . The elements of N i are neighbor nodes of node i. When the residual energy of the next hop of node i, for example, node j, is lower than e wth , or the energy-related cost ce e j res > ce(e wth ), the sink node modifies the ϕ ih | h∈N, h =j values to In this way, the weight of delay-related cost of communication with node j is higher than those of the other nodes. With ϕ ih , the sink node can determine a detouring path by choosing nodes with more residual energy as the next hop of node i.
To calculate the direct reward, sink node keeps the communication record so as to estimate the state transition probabilities: P a j x ij and P a j x ii of each node. P a j x ij and P a j x ii are the probabilities of successful and failed packet forwarding respectively: Suppose there have been M t instances of communication up to time t and m t is the frequency of successful packet forwarding, we can define the direct reward function as: The highest Q value among all the actions is described as: If the next hop is the sink node, the Q of this node is much higher than the others. According to Equations (5) and (14), we can define the action-utility function for each neighbor node of node i as: Then, we choose the neighbor node with the highest Q as the next hop and update the previous Q storied with newly chosen Q * . Initially, if the next hop is not the destination node, Q values in the matrix are set to 0. Otherwise, Q values are set to 1. Algorithm 1 for the routing mechanism is conducted as below. nodes satisfy ce(e wth ) < ce e j res are saved in set N i0 ; end for if N i0 ! = ∅ then set ϕ ih to ϕ ih , h ∈ N i /(N i0 ); calculate the direct reward r; select the node x j with maximum Q value in set The QDAR mechanism can adapt to the dynamic underwater environment quickly. An energy-related cost function and a delay-related cost function are defined in the data_ready phase. After that, the sink node has the collected information and performs QDAR algorithm. Firstly, it defines two reward functions with the cost functions for both successful and failed transmission. In the reward functions, there is an alterable parameter. Based on this parameter, an adaptive detouring path strategy is designed. Thus, QDAR algorithm can work adaptively with different residual energy. Then, with the reward functions and the corresponding probability functions, the action-utility function (Q-value) is determined.
Finally, the sink node chooses the global optimal next hop and then determines the routing path. In the next section, the simulation results prove that the adaptive solution in QDAR ensures a longer network lifetime and a relatively shorter delay.

Performance Evaluation
In this section, we evaluate the performance of the QDAR algorithm. Firstly, the performance of the QDAR algorithm with different parameters is shown. Then, we compare QDAR with the Q-learning-based lifetime-aware (QELAR) routing protocol and the vector-based forwarding (VBF) routing protocol. Finally, the performances of QDAR and QELAR are compared with different underlying MAC protocols.

Experimental Framework
In our simulation, 80 sensor nodes are randomly deployed in a 5000 m × 5000 m × 1500 m three-dimensional space. These sensor nodes are identical in every feature, such as the initial energy, communication power, transmission range and so on. Five sink nodes are randomly deployed on the water surface. As it is convenient to maintain these sink nodes, we assume that these sink nodes have infinite energy and can communicate with each other by radio. Thus, packets can be sent to any sink node. We place a source node at the bottom layer of the network. The transmission speed is 1500 m/s. At the source node, packets are generated and prepared to send to the sink node following an independent Poisson process with a rate λ (packets/s). The initial energy, energy warning threshold are set to 150 J and 75 J. Because β 1 and β 2 are in equal positions, they are set to the same value: β 1 = β 2 . The constant cost β 0 = 1 and ϕ ij = 1. We use ϕ 0 to restrict the detour routing. The underlying MAC protocol is the underwater power control (UPC) protocol [24]. Moreover, the preamble signal length should be considered in the end-to-end delay as it is long in UWSNs. For example, the preamble of the Aqua-sent OFDM is 0.49 s, significantly increasing the end-to-end delay. In accordance with the hardware specifications of underwater OFDM modems [25], the simulation parameters are listed in Table 2.

Evaluation with Different Parameters
To evaluate the influence of ϕ 0 , β 1 and β 2 , we simulate the variance of residual energy and the average latency of each packet as these values change. β 1 and β 2 are the weights of the sum of delay-related cost and energy-related cost for successful and failed transmission respectively. ϕ ij is the delay sensitivity. ϕ 0 is the coefficient to modify ϕ ij . Higher β 1 and β 2 can distribute energy more evenly because the residual energy is more important in routing decision. Higher ϕ 0 means delay is more important so that the detouring routing is restricted.
In the simulation, ϕ 0 , β 1 and β 2 vary between 0.1 and 0.9 with a step of 0.2. Comparing the variance of residual energy in Figure 3 with varying parameters, we can conclude that with larger β 1 , β 2 and smaller ϕ 0 , the residual energy distributes more uniformly. When β 1 = β 2 = 0.9 and ϕ 0 = 0.1, the residual energy variance is about 18, less than 1/3 of that with β 1 = β 2 = 0.1 and ϕ 0 = 0.9. That is to say less nodes will drain the energy early. Thus, the network lifetime can be prolonged.

Evaluation with Different Parameters
To evaluate the influence of , and , we simulate the variance of residual energy and the average latency of each packet as these values change.
and are the weights of the sum of delay-related cost and energy-related cost for successful and failed transmission respectively. is the delay sensitivity.
is the coefficient to modify . Higher and can distribute energy more evenly because the residual energy is more important in routing decision. Higher means delay is more important so that the detouring routing is restricted.
In the simulation, , and vary between 0.1 and 0.9 with a step of 0.2. Comparing the variance of residual energy in Figure 3 with varying parameters, we can conclude that with larger , and smaller , the residual energy distributes more uniformly. When = = 0.9 and = 0.1, the residual energy variance is about 18, less than 1/3 of that with = = 0.1 and = 0.9. That is to say less nodes will drain the energy early. Thus, the network lifetime can be prolonged. However, as shown in Figure 4, the end-to-end delay is increased. When = = 0.9, = 0.1, the delay is 42 s, about 25% longer than that with = = 0.1, = 0.9.
This is because that the path detours to avoid nodes close to the sink node but with relatively lower energy. On one hand, higher and mean higher weight of the energy-related cost. Nodes with more residual energy cost less to communicate. Therefore, they are more favorable to packet forwarding, making the variance of residual energy lower. On the other hand, as 0 < < 1, indicates that the sink node attaches less importance to delay during the routing decision. Although a node is far away from the sink node, it can be the next hop as long as the residual energy is high enough. As a result, detour routing with more residual energy and longer latency is chosen as the routing path. However, as shown in Figure 4, the end-to-end delay is increased. When β 1 = β 2 = 0.9, ϕ 0 = 0.1, the delay is 42 s, about 25% longer than that with β 1 = β 2 = 0.1, ϕ 0 = 0.9. This is because that the path detours to avoid nodes close to the sink node but with relatively lower energy. On one hand, higher β 1 and β 2 mean higher weight of the energy-related cost. Nodes with more residual energy cost less to communicate. Therefore, they are more favorable to packet forwarding, making the variance of residual energy lower. On the other hand, as 0 < ϕ 0 < 1, ϕ 0 indicates that the sink node attaches less importance to delay during the routing decision. Although a node is far away from the sink node, it can be the next hop as long as the residual energy is high enough. As a result, detour routing with more residual energy and longer latency is chosen as the routing path. Furthermore, as ϕ 0 increases, the rates of the incremental variance and decreasing average latency reduce. The reason is that, when ϕ 0 increases to a certain degree, the sink node determines routing path with the lowest latency thereafter, paying little attention to the residual energy. However, when ϕ 0 is too small, the average latency increases significantly because of the detour routing. Thus, ϕ 0 can not increase or decrease infinitely. In the following evaluation, we set ϕ 0 to 0.3, 0.5 and 0.7 respectively and set β 1 , β 2 to 0.7.
Furthermore, as increases, the rates of the incremental variance and decreasing average latency reduce. The reason is that, when increases to a certain degree, the sink node determines routing path with the lowest latency thereafter, paying little attention to the residual energy. However, when is too small, the average latency increases significantly because of the detour routing. Thus, can not increase or decrease infinitely. In the following evaluation, we set to 0.3, 0.5 and 0.7 respectively and set , to 0.7.  Figure 5 depicts the average latency of each package with different and the comparison with the QELAR protocol, a Q-learning-based lifetime-extended protocol. As λ increases, more packets need to be forwarded and the average latency is prolonged. For example, when = 0.3, the latency increases about 23% with λ grows from 0.02 to 0.1. This is because both QDAR and QELAR choose detour routing to distribute the residual energy evenly. When there are more packets sent in UWSNs, nodes closer to the sink node have relatively less residual energy. Thus, the nodes far away from the sink node but with more residual energy are chosen to relay packets.   Figure 5 depicts the average latency of each package with different ϕ 0 and the comparison with the QELAR protocol, a Q-learning-based lifetime-extended protocol. As λ increases, more packets need to be forwarded and the average latency is prolonged. For example, when ϕ 0 = 0.3, the latency increases about 23% with λ grows from 0.02 to 0.1. This is because both QDAR and QELAR choose detour routing to distribute the residual energy evenly. When there are more packets sent in UWSNs, nodes closer to the sink node have relatively less residual energy. Thus, the nodes far away from the sink node but with more residual energy are chosen to relay packets. Furthermore, as increases, the rates of the incremental variance and decreasing average latency reduce. The reason is that, when increases to a certain degree, the sink node determines routing path with the lowest latency thereafter, paying little attention to the residual energy. However, when is too small, the average latency increases significantly because of the detour routing. Thus, can not increase or decrease infinitely. In the following evaluation, we set to 0.3, 0.5 and 0.7 respectively and set , to 0.7.  Figure 5 depicts the average latency of each package with different and the comparison with the QELAR protocol, a Q-learning-based lifetime-extended protocol. As λ increases, more packets need to be forwarded and the average latency is prolonged. For example, when = 0.3, the latency increases about 23% with λ grows from 0.02 to 0.1. This is because both QDAR and QELAR choose detour routing to distribute the residual energy evenly. When there are more packets sent in UWSNs, nodes closer to the sink node have relatively less residual energy. Thus, the nodes far away from the sink node but with more residual energy are chosen to relay packets.  It is also observed that, in UWSNs with a lower λ, QDAR can reduce latency more significantly. Especially when ϕ 0 = 0.7, the latency is reduced by nearly 25%. The reason is that at the initial stage of communication, although the residual energy reduces, it is still sufficient and does not worth detour routing. However, QELAR detours from the outset at the initial stage, while QDAR waits until the residual energy of some nodes is lower than a threshold. The higher the ϕ 0 is, the more attention is paid to latency. Another reason is that, in QDAR, it is the sink node that takes responsibility to construct routing path. Whereas in QELAR, every node in the network should learn the environment and choose an optimal action, leading to a longer delay. In this way, QDAR outperforms QELAR in the average latency by 20-25% as shown in Figure 5.

Comparison with QELAR and VBF
Since the network lifetime is another critical performance of network, we compare the network lifetime of QDAR, QELAR and VBF protocol [23]. The network lifetime is defined as the total performing time until the first node drains its energy. The ratio of QDAR network lifetime and QELAR network lifetime as well as the ratio of QDAR network lifetime and VBF network lifetime is shown in Figure 6. As depicted, the ratio of QDAR network lifetime and VBF network lifetime is about 1.2 with ϕ 0 = 0.7. When ϕ 0 = 0.3, the ratio increases to 1.3. This is because in VBF, all the nodes in the virtual pipe between a source and a destination have chances to relay packets. As a result, many packets are sent repeatedly by different nodes, making them drain energy more quickly. With λ growing from 0.02 to 0.1, more packets are sent and the nodes within the pipe die earlier. On the contrary, QDAR takes the residual energy into account and extends the network lifetime by avoiding to choose nodes with relative less energy. The ratio of QDAR network lifetime and QELAR network lifetime is smaller than 1. The reason is that QELAR can distribute routing traffic to nearly every node in UWSNs, extending the lifetime greatly. While QDAE also take delay into account, avoiding choosing nodes that are too far away from the sink node to forward packets. Thus, some nodes may exhaust energy earlier than that in QELAR. However, even for the worst case, the network lifetime can still achieve 82% of that in QELAR. The factor attributes to the result is our adaptive detouring path strategy: if the residual energy of a node is less than e wth , an adequate path with nodes remaining more energy is chosen to relay packets, to postpone energy depletion of the node.
It is also observed that, in UWSNs with a lower λ, QDAR can reduce latency more significantly. Especially when = 0.7, the latency is reduced by nearly 25%. The reason is that at the initial stage of communication, although the residual energy reduces, it is still sufficient and does not worth detour routing. However, QELAR detours from the outset at the initial stage, while QDAR waits until the residual energy of some nodes is lower than a threshold. The higher the is, the more attention is paid to latency. Another reason is that, in QDAR, it is the sink node that takes responsibility to construct routing path. Whereas in QELAR, every node in the network should learn the environment and choose an optimal action, leading to a longer delay. In this way, QDAR outperforms QELAR in the average latency by 20-25% as shown in Figure 5.
Since the network lifetime is another critical performance of network, we compare the network lifetime of QDAR, QELAR and VBF protocol [23]. The network lifetime is defined as the total performing time until the first node drains its energy. The ratio of QDAR network lifetime and QELAR network lifetime as well as the ratio of QDAR network lifetime and VBF network lifetime is shown in Figure 6. As depicted, the ratio of QDAR network lifetime and VBF network lifetime is about 1.2 with = 0.7. When = 0.3, the ratio increases to 1.3. This is because in VBF, all the nodes in the virtual pipe between a source and a destination have chances to relay packets. As a result, many packets are sent repeatedly by different nodes, making them drain energy more quickly. With λ growing from 0.02 to 0.1, more packets are sent and the nodes within the pipe die earlier. On the contrary, QDAR takes the residual energy into account and extends the network lifetime by avoiding to choose nodes with relative less energy. The ratio of QDAR network lifetime and QELAR network lifetime is smaller than 1. The reason is that QELAR can distribute routing traffic to nearly every node in UWSNs, extending the lifetime greatly. While QDAE also take delay into account, avoiding choosing nodes that are too far away from the sink node to forward packets. Thus, some nodes may exhaust energy earlier than that in QELAR. However, even for the worst case, the network lifetime can still achieve 82% of that in QELAR. The factor attributes to the result is our adaptive detouring path strategy: if the residual energy of a node is less than e , an adequate path with nodes remaining more energy is chosen to relay packets, to postpone energy depletion of the node.

Evaluation with Different MAC Protocols
We next examine how the underlying MAC protocol layer affects the performance of QDAR and QELAR in terms of total energy consumption. The slotted floor acquisition multiple access (SFAMA) Figure 6. Lifetime ratios of QDAR and different routing protocols (QELAR and VBF) with different packet generation rate λ (packets/s).

Evaluation with Different MAC Protocols
We next examine how the underlying MAC protocol layer affects the performance of QDAR and QELAR in terms of total energy consumption. The slotted floor acquisition multiple access (SFAMA) MAC protocol [26] and underwater power control (UPC) MAC protocol [24] are used for QDAR and QELAR, respectively. The results are shown in Figure 7, where the total energy consumption is normalized to (0, 1). As QDAR restricts routing decision with end-to-end delay, some packets may detour with more hops in QELAR than that of QDAR. More nodes forwarding packets means more total energy consumption. Thus, QELAR consumes more energy than QDAR regardless of the MAC protocol used. Besides, we can see that routing algorithms with UPC can achieve lower energy consumption than those with SFAMA. The reason is that UPC can reduce the transmission power level to achieve a better energy efficiency as well as smaller interference range. Moreover, as UPC can deal with the noise, interference and spatial channel reuse, QDAR do not need to pay much attention to these problems. Therefore, using UPC as the underlying MAC protocol, QDAR can prolong the network lifetime more efficiently. MAC protocol [26] and underwater power control (UPC) MAC protocol [24] are used for QDAR and QELAR, respectively. The results are shown in Figure 7, where the total energy consumption is normalized to (0, 1). As QDAR restricts routing decision with end-to-end delay, some packets may detour with more hops in QELAR than that of QDAR. More nodes forwarding packets means more total energy consumption. Thus, QELAR consumes more energy than QDAR regardless of the MAC protocol used. Besides, we can see that routing algorithms with UPC can achieve lower energy consumption than those with SFAMA. The reason is that UPC can reduce the transmission power level to achieve a better energy efficiency as well as smaller interference range. Moreover, as UPC can deal with the noise, interference and spatial channel reuse, QDAR do not need to pay much attention to these problems. Therefore, using UPC as the underlying MAC protocol, QDAR can prolong the network lifetime more efficiently.

Conclusions
In this paper, we have designed a novel delay-aware routing (QDAR) mechanism based on the Q-learning technique to extend the network lifetime of underwater sensor networks. The QDAR mechanism can reduce average latency as well as extend the network lifetime. The data_ready phase and the packet structures are designed for data collection. Then, the sink node applies the QDAR algorithm to determine the routing path. In QDAR, we extend the action-utility function with both residual energy-related cost and delay-related cost. In order to have a better tradeoff between residual energy of nodes and delay, an adaptive detouring path strategy is designed. When the residual energy is sufficient, a path with shorter delay is chosen. When the residual energy of a node is lower than a threshold, the weight of the delay-related cost is decreased so as to construct an adequate path avoiding nodes with relatively less energy, even though these nodes may be nearer to the sink node. Thus, QDAR can distribute the residual energy more evenly, which is crucial to extend the network lifetime. Moreover, as QDAR takes both direct rewards and future rewards into account, it can choose a global optimal next hop, whereas greedy algorithms only pay attention to the direct reward. After a routing decision, the path is constructed during the interest phase. Then, packets are forwarded and the communication ends with an acknowledgement. The QDAR mechanism can work adaptively and distributively in the dynamic underwater environment. We evaluate the performance of QDAR with different parameters and compare it with QELAR and VBF. The simulation results show that QDAR reduces the total energy consumption effectively and decreases the average latency

Conclusions
In this paper, we have designed a novel delay-aware routing (QDAR) mechanism based on the Q-learning technique to extend the network lifetime of underwater sensor networks. The QDAR mechanism can reduce average latency as well as extend the network lifetime. The data_ready phase and the packet structures are designed for data collection. Then, the sink node applies the QDAR algorithm to determine the routing path. In QDAR, we extend the action-utility function with both residual energy-related cost and delay-related cost. In order to have a better tradeoff between residual energy of nodes and delay, an adaptive detouring path strategy is designed. When the residual energy is sufficient, a path with shorter delay is chosen. When the residual energy of a node is lower than a threshold, the weight of the delay-related cost is decreased so as to construct an adequate path avoiding nodes with relatively less energy, even though these nodes may be nearer to the sink node. Thus, QDAR can distribute the residual energy more evenly, which is crucial to extend the network lifetime. Moreover, as QDAR takes both direct rewards and future rewards into account, it can choose a global optimal next hop, whereas greedy algorithms only pay attention to the direct reward. After a routing decision, the path is constructed during the interest phase. Then, packets are forwarded and the communication ends with an acknowledgement. The QDAR mechanism can work adaptively and distributively in the dynamic underwater environment. We evaluate the performance of QDAR with different parameters and compare it with QELAR and VBF. The simulation results show that QDAR reduces the total energy consumption effectively and decreases the average latency significantly by 20-25% at the cost of only a little reduction in network lifetime. Therefore, QDAR is more adequate for time-critical applications.