Next Article in Journal
Investigating the Genesis and Migration Mechanisms of Subsea Shallow Gas Using Carbon Isotopic and Lithological Constraints: A Case Study from Hangzhou Bay, China
Previous Article in Journal
Study on the Influence of 3D Printing Material Filling Patterns on Marine Photovoltaic Performance
Previous Article in Special Issue
LT-Sync: A Lightweight Time Synchronization Scheme for High-Speed Mobile Underwater Acoustic Sensor Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Q-Learning-Based Link-Aware Routing Protocol for Underwater Wireless Sensor Networks

1
Ocean Acoustic Technology Laboratory, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
Beijing Engineering Technology Research Center of Ocean Acoustic Equipment, Beijing 100190, China
4
State Key Laboratory of Acoustics, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2025, 13(12), 2374; https://doi.org/10.3390/jmse13122374
Submission received: 10 November 2025 / Revised: 3 December 2025 / Accepted: 11 December 2025 / Published: 14 December 2025
(This article belongs to the Special Issue Underwater Acoustic Communication and Marine Robot Networks)

Abstract

In Underwater Wireless Sensor Networks (UWSNs) with mobile nodes, the mobility of the nodes leads to dynamic changes in the network topology. Thus, pre-established routing paths may become invalid and next-hop nodes may be unavailable due to link disruptions. This implies that routing decisions for mobile UWSNs that do not account for changes in the connectivity state of communication links cannot guarantee reliable packet delivery. In this study, a Q-learning-based link-aware routing (QLAR) protocol designed for mobile UWSNs is proposed. The proposed QLAR protocol introduces the Link Expiration Time (LET) into the reward function of the Q-learning algorithm as a critical decision metric, thereby guiding the agent to prioritize more stable communication links with longer expected lifetime. In addition, multiple decision metrics are dynamically predicted and updated by actively perceiving and acquiring information from neighbor nodes through periodic control packet interactions. To achieve a balance among these metrics, the Entropy Weight Method (EWM) is employed to adaptively adjust their weights in response to real-time network conditions. Comprehensive simulation results demonstrate that QLAR outperforms existing routing protocols in terms of various performance metrics under different scenarios.

1. Introduction

Underwater Wireless Sensor Networks (UWSNs), as a promising network communication technology, show great potential for applications in environmental monitoring, navigation assistance, and military reconnaissance [1,2,3]. The routing performance [4] directly affects the communication reliability and energy efficiency of the network. Due to the narrow bandwidth, large propagation delay and restricted energy of underwater acoustic communication, the routing design is more challenging than that of terrestrial wireless networks.
In mobile UWSNs, node mobility [5] causes changes in link connectivity, and the pre-established path is easy to fail [6,7]. Existing location-based or depth-based routing protocols [8,9] perform well in static or low dynamic scenarios, the reliability of routing decisions is significantly degraded in environments with sparse node distribution, large positioning errors, or high mobility. Although opportunistic routing [10,11] can improve the delivery rate, it will bring additional energy consumption and channel contention.
In recent years, reinforcement learning (RL) [12,13] and machine learning (ML) [14,15] have been introduced into routing protocol research, especially Q-learning [16], which has become an important method for routing research in UWSNs. Related work [17] has demonstrated that Q-learning can optimize the forwarding strategy of nodes through interactive learning. However, existing research [18,19] focuses on static or weakly mobile networks, lacking consideration for link stability in mobile scenarios. In addition, the weights of multiple decision metrics are fixed or artificially set, which are difficult to adapt to the time-varying network environment and limit the ability of the algorithm in complex scenarios. To bridge these gaps, this work introduces a link-aware routing protocol that integrates mobility prediction with dynamic, data-driven metric weighting for adaptive decision-making in mobile UWSNs.
The proposed Q-learning-based link-aware routing (QLAR) protocol adopts a distributed routing decision-making framework organized into three phases: neighbor discovery, link awareness, and data transmission. During the neighbor discovery and link maintenance phases, nodes predict the mobility states of neighbor nodes using an Extended Kalman Filter (EKF) and adjust the Hello message broadcast interval according to the link expiration time (LET). This allows nodes to effectively obtain neighbor information and update their neighbor tables. In the data transmission phase, nodes construct a set of candidate actions based on the previously obtained neighbor information and employ the entropy weight method (EWM) to dynamically evaluate multiple decision metrics, thereby enabling adaptive routing decisions. The main contributions of this study are summarized below:
  • Link-aware routing: Node mobility is predicted, and the link expiration time is calculated to dynamically adjust the neighbor maintenance frequency, so as to realize the awareness of link changes.
  • Dynamic weight allocation: The link expiration time is taken as one of the decision-making metrics, and the entropy weight method is used to dynamically assign the weights of multiple decision metrics according to the network state.
  • A network of fully mobile nodes: Existing routing protocols are primarily designed for fixed-topology networks that assume a stable network topology. This paper investigates a network consisting of mobile underwater sensor nodes, where these nodes concurrently generate data traffic. The study provides a novel approach to adaptive routing in dynamic topology networks.
The rest of this paper is organized as follows. In Section 2, an overview of existing routing protocols for UWSNs is provided. Section 3 presents the network scenario and the network assumptions adopted in this paper, along with the Q-learning framework applied to UWSNs routing. Section 4 details the proposed QLAR algorithm. In Section 5, the workflow of the proposed routing protocol is described. In Section 6, the performance of QLAR is evaluated through simulations. Section 7 concludes the paper.

2. Related Work

This section reviews the development of routing protocols in UWSNs [20,21], with a focus on traditional routing protocols [22,23,24,25,26,27,28] and Q-learning-based routing protocols [18,29,30,31,32]. A comparison of the protocols is listed in Table 1.
Traditional routing protocols in UWSNs can be broadly categorized into location-based and localization-free categories. The Vector-Based Forwarding (VBF) protocol [22] limits the number of nodes participating in forwarding by establishing a virtual pipe. It has high scalability, but its performance depends on the radius of the virtual pipe. Routing holes may occur in low-density or mobile networks. HH-VBF [23] further improves robustness by dynamically building virtual pipes at each hop. But it still struggles to adapt to highly dynamic topology changes. Compared to location-based routing, depth-based routing does not need coordinate information. The Depth-Based Routing (DBR) protocol [24] uses depth-based greedy forwarding, but broadcast forwarding leads to redundancy overhead. EEDBR [25] improves DBR by incorporating residual energy, thus alleviating the problem of energy exhaustion. WDFAD-DBR [26] introduces a two-hop depth difference when calculating forwarding delay, which improves delivery rate, but ignores energy consumption and may cause an imbalance in local energy consumption. The ALRP [28] proposed in recent years reduces redundancy and improves energy efficiency by limiting the forwarding area, but its performance still depends on node distribution and mobility model.
Q-learning has been introduced into UWSNs to overcome the limitations of traditional protocols by enabling nodes to learn forwarding strategies from interactions. QELAR [18] is the first to apply the Q-learning algorithm to underwater routing, but its reward function only considers energy efficiency, and its performance is restricted to the single optimization objective. The QDAR [30] protocol considers both energy and latency, but it uses centralized routing decisions, which limits flexibility under dynamic topologies. QLFR [29], ROEVA [32], RLOR [31], and DROR [31] combine Q-learning with opportunistic routing to address routing holes in sparse scenarios. These protocols introduce multiple metrics, such as distance, the number of neighbors, or forwarding probability, to improve the quality of decision-making, but they all use fixed weights, making it difficult to adapt to time-varying environments. Furthermore, they rely on previously observed neighbor information and lack proactive prediction of node movement and link state. In terms of application scenarios, RLOR, ROEVA, and DROR target static network environments, while QLFR evaluates performance in mobile scenarios.
Compared to these prior works, the proposed approach differs in three aspects. First, it integrates LET prediction via an EKF, enabling proactive adaptation to node mobility. Second, it adjusts multi-metric routing weights using EWM, improving adaptability to time-varying network conditions. Third, the combination of LET-based prediction, adaptive weights, and Q-learning-driven forwarding enables the protocol to jointly optimize reliability, energy efficiency, and delay in mobile UWSNs.

3. System Model

In this section, the preliminary knowledge relevant to this study is introduced, including the motivation scenario, network model and assumptions, as well as the Q-learning framework for routing in UWSNs.

3.1. Motivation Scenario

A mobile multi-hop underwater wireless sensor network scenario for underwater reconnaissance missions is investigated in this paper. The network consists of a sink node positioned on the water surface and multiple sensor nodes randomly distributed within a specific 3D underwater region. Each underwater sensor node is equipped with sensing devices and acoustic modems, enabling it to continuously monitor environmental data, record information, and transmit the collected data. The sink node, which is equipped with both acoustic and radio frequency modems, receives the data from the sensor nodes and subsequently forwards it to the land datacenter.
As illustrated in Figure 1, the state information of nodes is exchanged regardless of whether there is data to be transmitted through Hello packets. When node N 8 detects sensitive information, it enters the data transmission phase and assumes the role of the source node. Then, it constructs a routing path to the destination (e.g., N 8 N 9 N 10 N 1 sink ) using the neighbor information and transmits the data packet to the far sink node.

3.2. Network Model and Assumptions

The whole network is modeled as a dynamic graph G ( V , E ) , where V = { N i | i = 1 , 2 , , N } denotes the set of sensor nodes, defined as V, and E denotes the set of edges, representing the communication links between nodes. Each underwater node is equipped with an inertial measurement unit and a navigation system to obtain real-time geographic location and mobility information. For the sake of simplicity and without loss of generality, this study is conducted under the following assumptions:
  • Each node is assigned a unique ID for identification.
  • All sensor nodes have the same initial energy and maintain a uniform communication range. The energy consumption of sensor nodes is considered to include the forwarding and receiving of data packets, excluding node mobility.
  • The sink node remains stationary and has an unlimited energy supply.
  • All sensor nodes have the same buffer size, which is quantified as the total number of data packets that can be stored.
  • The communication links between nodes are symmetrical and reliable.

3.3. Q-Learning Framework for Routing in UWSNs

The purpose of routing is to determine a path from the source to the destination. In distributed routing, routing decisions are composed of multiple single-hop routing decisions, and each forwarding action will influence whether the data packet can be delivered to the destination successfully. Since nodes can only obtain local network information, it is necessary to select the optimal relay node from the available neighbor nodes as the next-hop.
Routing problem can be modeled as a finite-state Markov Decision Process (MDP) [33,34]. It is composed of the state space S, the action space A, the state transition probability P ( s | s , a ) , and the reward function R ( s , a ) . S represents the nodes participating in packet forwarding, A is the available neighbor nodes, and R ( s , a ) represents the direct reward obtained by the data packet when it selects action a (the next-hop neighbor) in state s (the current node). The goal of MDP is to find the optimal strategy π to maximize the expected cumulative reward.
To measure the advantages and disadvantages of different strategies, the V-value and Q-value are introduced. V π ( s ) is the expected cumulative reward of following policy π in state s, while Q π ( s , a ) describes the expected cumulative reward of taking action a in state s. The relationship between them is
V ( s ) = max a Q ( s , a ) .
In dynamic underwater acoustic networks, the accurate state transition probability P ( s | s , a ) cannot be obtained. A model-free Q-learning algorithm is used to iterate the Q-values without explicitly modeling the state transition process of MDP. Its update rule is:
Q t + 1 ( s t , a t ) ( 1 α ) Q t ( s t , a t ) + α [ R ( s t , a t ) + γ max a A t + 1 Q t ( s t + 1 , a ) ] ,
where α ( 0 , 1 ) is the learning rate, γ ( 0 , 1 ) is the discount factor, and R ( s t , a t ) denotes the direct reward obtained after forwarding the packet. By interactively updating the Q-value, the optimal forwarding policy can be obtained through π ( s ) = arg max a Q ( s , a ) .
In the Q-learning framework [35], the agent finds an optimal policy that maximizes long-term cumulative reward by continuously interacting with the environment. In the routing problem [36,37], the routing decision is also aimed at long-term reward, but its meaning is different. The “long-term reward” in the routing problem is the optimal end-to-end path reward, rather than the optimal reward of a point-to-point link. For example, a link may have a high throughput or a low latency, but if the reliability of the path in which the link is located is poor, the end-to-end transmission performance will be severely affected. By employing the Q-learning mechanism, routing strategy can comprehensively consider local link characteristics and global path performance, dynamically balancing immediate and overall benefits.
The relationship between routing in UWSNs and the Q-learning algorithm is shown in Figure 2. Each node maintains its own local Q-table [38] and updates the Q-table based on local information and interactions with its neighbors. When a packet arrives at a node, the node selects the next-hop according to its local Q-table. Since each node only needs to store and update one row of Q-values corresponding to its own state, there is no need to maintain the complete Q-table of the entire network. Therefore, even in a relatively dense network environment, the action space of each state is still relatively small.

4. QLAR Algorithms

Q-learning algorithms encounter several challenges when applied to routing problems in UWSNs, such as poor robustness, slow convergence and inefficient exploration. To overcome these limitations, the QLAR algorithm is designed. This section provides a detailed introduction to QLAR, analyzing its three key components: decision metrics, reward function and Q-value initialization.

4.1. Decision Metrics of QLAR

Decision metrics significantly influence routing decisions. Based on the residual energy, the distance of neighbor nodes, and the link expiration time, three routing decision metrics are defined: energy, distance, and link. By taking into account these three routing decision metrics, a network can achieve a balance between energy efficiency and transmission performance, while minimizing the packet loss in terms of link outages.

4.1.1. Energy Metric

Residual energy indicates the capacity of a node to participate in network activities. Therefore, to optimize the energy allocation among nodes and extend the network lifetime, the QLAR algorithm adopts the residual energy of nodes as the energy metric. For the current node N i and its neighbor node N j , the energy metric is defined as:
F [ 1 ] ( N i , N j ) = E res ( N j ) E init ( N j ) , N j N e i g h b o r ( N i ) ,
where E res ( N j ) and E init ( N j ) respectively represent the residual battery energy and the initial energy of node N j , and N e i g h b o r ( N i ) is the set of neighbor nodes for node N i .

4.1.2. Distance Metric

In UWSNs, packets are required to be transmitted from underwater source nodes to surface sink nodes through multi-hop transmission. Nodes nearer to the destination typically have fewer hops and shorter transmission delays. Hence, in QLAR, the distance metric is employed as one of the decision metrics for routing decisions and is defined as follows:
F [ 2 ] ( N i , N j ) = Δ D ( N i , N j ) , N j N e i g h b o r ( N i ) ,
where Δ D ( N i , N j ) = D ( N i , sin k ) D ( N j , sin k ) , with D ( N i , sin k ) and D ( N j , sin k ) respectively denoting the physical distances from node N i and N j to the target node. As can be observed from the function, a larger D ( N i , N j ) indicates that node N j is closer to the destination than current node N i and thus has a more optimal transmission path.

4.1.3. Link Metric

Due to the mobility of nodes, the network topology undergoes dynamic and continuous changes over time, resulting in the frequent establishment and disconnection of communication links. To quantify the stability of links between nodes, the LET [39] is employed as the link metric, which is represented by
F [ 3 ] ( N i , N j ) = T LET ( N i , N j ) , N j N e i g h b o r ( N i ) ,
where T LET ( N i , N j ) denotes the remaining link duration between node N i and node N j . When T LET ( N i , N j ) 0 , it indicates that the link is broken.
Consider the link L i n k ( N i , N j ) between nodes N i and N j (where N i , N j V and i j ). To simplify the calculation of LET, the relative motion between N i and N j is considered. Specifically, N i is treated as a fixed node when N j enters its communication range, while N j is assumed to maintain a constant velocity [38] throughout the link duration time. Thus, the LET of link L i n k ( N i , N j ) can be calculated by
T LET ( N i , N j ) = ( a b + c d + e f ) a 2 + c 2 + e 2 + ( a b + c d + e f ) 2 ( a 2 + c 2 + e 2 ) ( b 2 + d 2 + f 2 R 2 ) a 2 + c 2 + e 2 ,
where the parameters in Equation (6) can be calculated by the following formulae:
a = v i cos θ i cos φ i v j cos θ j cos φ j , b = x i x j , c = v i cos θ i sin φ i v j cos θ j sin φ j , d = y i y j , e = v i sin θ i v j sin θ j , f = z i z j .
In the aforementioned formulae, R is the communication range of sensor nodes. P i = ( x i , y i , z i ) and P j = ( x j , y j , z j ) denote the current coordinates of node N i and node N j , respectively, while S i = ( v i , θ i , φ i ) and S j = ( v j , θ j , φ j ) represent the velocity of node N i and node N j . All these values can be obtained from the prediction model in Section 5.3. It is worth mentioning that a higher LET is more favorable for data routing.

4.2. Reward Function

In the Q-learning algorithm, the update of Q-value is determined by the direct reward and the future cumulative reward. Therefore, the design of the reward function directly affects the decision-making of the algorithm. In QLAR, all the aforementioned decision metrics are jointly considered when calculating the reward. Traditional methods assign fixed weights to these metrics, but in dynamic underwater environments, factors such as node movement, link break, and uneven energy consumption can cause the importance of metrics to change over time. To enhance the adaptability of the algorithm, the entropy weighting method [40,41,42] is introduced to dynamically adjust the weight of each decision metric.
The entropy weight method assigns weights based on the distribution of metrics. The greater the difference of a metric among its neighbor nodes, the more information it provides, and the higher its weight. Conversely, if a metric has little difference among its neighbors, its weight should be reduced.
For the current node N i and its neighbor node set N e i g h b o r ( N i ) , first calculate the original value F [ k ] ( N i , N j ) of each metric according to Equations (3)–(5) (where k = 1 , 2 , 3 and N j N e i g h b o r ( N i ) ). Due to the different dimensions of each metric, they need to be normalized using the following formula:
X [ k ] ( N i , N j ) = ( 1 α [ k ] ) + α [ k ] F [ k ] ( N i , N j ) min N j F [ k ] ( N i , N j ) max N j F [ k ] ( N i , N j ) min N j F [ k ] ( N i , N j ) ,
where X [ k ] ( N i , N j ) represents the normalized metric, with its value being constrained within the range [ 0 , 1 ] , and α [ k ] is the efficiency coefficient which satisfies k = 1 3 α [ k ] = 1 .
To measure each metric, the distribution of the normalized metric in the set of neighbor nodes needs to be calculated as follows:
p [ k ] ( N j ) = X [ k ] ( N i , N j ) N j N e i g h b o r ( N i ) X [ k ] ( N i , N j ) .
If the difference of a certain metric among its neighbors is significant, the distribution of this metric will deviate from the uniform distribution, indicating that its influence on routing selection is greater.
Based on the above distribution, the information entropy of the metric can be calculated:
E n t [ k ] ( N i ) = 1 ln M M j = 1 ( p [ k ] ( N j ) + δ ) ln ( p [ k ] ( N j ) + δ ) ,
where M is the size of N e i g h b o r ( N i ) , and δ is a small smoothing factor (set to 10 6 ), which is used to a void division by zero or logarithmic issues. Information entropy reflects the importance of a metric in decision-making. Take the energy metric as an example. If E n t [ k ] ( N i ) is close to 1, it means that the metric distribution is uniform, and the energy consumption difference between different neighbors is small, so the weight should be reduced; conversely, it indicates that the energy consumption difference between different neighbors is large, so the weight should be increased.
Finally, weights are assigned based on entropy values:
ω [ k ] ( N i ) = 1 E n t [ k ] ( N i ) k = 1 3 ( 1 E n t [ k ] ( N i ) ) .
It can be seen that the weight has an inverse relationship with entropy, meaning “the greater the difference, the higher the weight”. For instance, in the initial phase of a network, the energy of nodes is generally sufficient and the difference is small, the entropy value of the energy metric is high, and its weight is naturally small. During periods of rapid topology changes or increased fluctuations in link quality, the difference of the link metric among different neighbors increases, the entropy value decreases, and the corresponding weight increases, so that the routing decision pays more attention to the link stability.
The joint metric of the decision metrics is used to obtain the direct reward for the routing from node N i to node N j :
R ( N i , N j ) = 3 k = 1 ω [ k ] X [ k ] ( N i , N j ) .
Based on the above description, the reward function is defined as:
R ( s t , a t ) = R max , if s t + 1 is the destination R min , if s t + 1 has no neighbors R ( N i , N j ) , if s t + 1 is N j
where R max is the maximum reward, which occurs when the next-hop is the destination. This guarantees that packets are always delivered toward the destination. Conversely, R min corresponds to the minimum reward, assigned when the next state has no neighbors, thus preventing routing holes. In other cases, the reward for the packet transmission from N i to N j is calculated using (11).

4.3. Q-Value Initialization

Traditional Q-learning typically assigns identical initial Q-values to all nodes [18,31]. However, this uniform initialization fails to distinguish between neighbor nodes, particularly during the initial phase of the network. Consequently, it might result in ineffective exploration, excessive resource consumption, and ultimately, degraded routing performance.
To address this issue, QLAR proposes a topology-based Q-value initialization method [43], which uses prior knowledge about the network environment to guide the learning process. As shown in Figure 3, a hemispherical hierarchical structure is established, centering around the surface sink node, and integer multiples of the communication range define the layers. The Q-values are then initialized based on the relative positions of neighbors to the sink node. Specifically, the initial Q-value from node N i to its neighbor N j is defined as:
Q init ( N i , N j ) = D ( N j , sink ) R , N j N e i g h b o r ( N i ) ,
where Q init ( N i , N j ) denotes the initial Q-value of node N i to its neighbor N j , D ( N j , sink ) is the distance from the neighbor N j to the sink node, R is the communication distance, and · symbolizes the ceiling function.

5. Routing Protocol Design

The proposed routing protocol includes the following phases: neighbor discovery, link awareness, and data transmission. The overall framework of the protocol is shown in Figure 4. During the neighbor discovery phase, nodes exchange information by broadcasting Hello messages and dynamically adjust the Hello broadcast interval based on the value of LET. In the link awareness phase, each node utilizes the Extended Kalman filter-based Prediction Model (EKPM) to estimate the mobility of neighbors and refreshes the neighbor list accordingly. Based on the first two phases, the routing decision is determined by employing the QLAR algorithm.

5.1. Packet Structure

The network defines two types of packets: Hello packets and Data packets. Specifically, Hello packets are employed for neighbor discovery and routing information exchange. By broadcasting Hello packets, nodes exchange the metadata necessary for Q-learning-based routing. Moreover, during data routing, each node perceives the network state by listening to Hello packets and transmits Data packets based on routing information.
The structure of Hello packets is illustrated in Figure 5a. The packet identification comprises packet ID and source node ID, and the metadata contains information regarding the sender node, including residual energy, three-dimensional coordinates, velocity information, and V-value. Similarly, the structure of Data packets, which is depicted in Figure 5b, consists of packet header and data payload. The payload holds the data to be transmitted, whereas the packet header stores packet identification, routing information (specifically, the next-hop node ID).

5.2. Neighbor Discovery Phase

The QLAR protocol employs a neighbor discovery mechanism to ensure the exchange of information among network nodes. This mechanism is implemented through the periodic broadcast of Hello packets, which contain only the sender’s information and do not include any payload. Upon receiving a Hello packet, the receiver updates its local neighbor table with the information embedded in the packet and does not forward or broadcast the packet. The neighbor table holds information about the neighbor nodes, such as location information, mobility model, residual energy level, Q-value for reinforcement learning, and the estimated value of LET. By maintaining and updating this information in the neighbor table, each node remains aware of the network environment and its neighbor nodes.
Routing performance is correlated to the neighbor discovery capabilities of the Hello emission strategy [44,45]. Shorter Hello intervals enable faster detection of new neighbors and link breaks in UWSNs. However, too short intervals result in unnecessary protocol overhead. In contrast, a longer interval reduces both overhead and energy consumption but limits the effectiveness of neighbor discovery and link break detection. Therefore, an adaptive Hello interval strategy is essential for UWSNs. The interval for Hello packets is defined as follows [46]:
T H ( N i ) = τ min N j N e i g h b o r ( N i ) T LET ( N i , N j ) ,
where T LET ( N i , N j ) is the LET of link L i n k ( N i , N j ) given by Equation (6), τ denotes the perception frequency, and τ ( 0 , 1 ) . For high-mobility nodes, the broadcast interval of Hello packets is dynamically shortened to ensure neighbor discovery and link break detection. In contrast, the broadcast interval is lengthened appropriately for low-mobility nodes.

5.3. Link Awareness Phase

The link expiration time indicates the remaining duration of an available link. In Equation (6), LET is evaluated based on the location and velocity information of neighbor nodes, which is transmitted through Hello messages. This method calculates the maximum communication time between two nodes based on their relative mobility, thereby providing a basis for routing decisions. However, in dynamic UWSNs with unpredictable node mobility, merely relying on the information acquired from Hello messages is insufficient to predict link states in a timely manner. The EKPM is introduced to dynamically predict the trajectories of neighbor nodes.
Figure 6 illustrates the overall architecture of the predict model, which consists of two core components: the prediction step and the update step. By iteratively executing these two steps, the model continuously tracks and refines the mobility states of neighbor nodes. In the prediction step, each node predicts the current position of its neighbors based on their previous states (including position and velocity) and the underlying mobility model. The predicted results are then used to calculate the relative distance and relative velocity between nodes, thereby estimating the LET and providing proactive information for routing decisions. In the update step, when a node receives a new Hello message, the predict model is corrected using the mobility parameters of neighbor nodes carried in the message. Through the periodic execution of the prediction–update cycle, the EKPM enables nodes to perceive the link states of their neighbors.
In order to provide a balance between computational efficiency and complexity for routing decisions while ensuring prediction accuracy, the proposed prediction model is implemented using EKF. The detailed implementation of the EKF is as follows: State vector is defined as x k = [ x k , y k , z k , v k , φ k , θ k ] T , where x k , y k , and z k represent three-dimensional positions, v k , φ k , θ k respectively represent velocity, azimuth, and pitch angle. According to the Gaussian–Markov movement model [47], the updates to velocity and direction can be represented as
v k = α v k 1 + ( 1 α ) v ¯ + 1 α 2 ω v
φ k = α φ k 1 + ( 1 α ) φ ¯ + 1 α 2 ω φ
θ k = α θ k 1 + ( 1 α ) θ ¯ + 1 α 2 ω θ
And the position is updated as follows:
x k = x k 1 + v k 1 cos θ k 1 cos φ k 1 Δ t
y k = y k 1 + v k 1 cos θ k 1 sin φ k 1 Δ t
z k = z k 1 + v k 1 sin θ k 1 Δ t
Therefore, the state equation can be defined as: x k = f ( x k 1 ) + G w k 1 , where G k is the mapping matrix of process noise. The Jacobian matrix is:
F = f x = 1 0 0 cos θ cos φ Δ t v cos θ sin φ Δ t v sin θ cos φ Δ t 0 1 0 cos θ sin φ Δ t v cos θ cos φ Δ t v sin θ sin φ Δ t 0 0 1 sin θ Δ t 0     v cos θ Δ t 0 0 0 α 0     0 0 0 0 0 α     0 0 0 0 0 0     α
The process noise vector is defined as w k 1 = [ ω k 1 v , ω k 1 φ , ω k 1 θ ] T , representing random perturbations of velocity and direction, assumed to be independent white noise with variances σ v 2 , σ φ 2 , σ θ 2 , respectively.
Using direct position observation, z k = [ x k , y k , z k ] T + s k , the covariance of the measured noise is R = diag ( σ x 2 , σ y 2 , σ z 2 ) .
The prediction process of EKF:
x ^ k k 1 = F x ^ k 1 k 1
P k k 1 = F P k 1 k 1 F T + Q
The update process of EKF:
S k = H P k | k 1 H T + R
K k = P k | k 1 H T S k 1
x ^ k | k = x ^ k | k 1 + K k ( z k H x ^ k | k 1 )
P k | k = ( I K k H ) P k | k 1
The neighbor discovery phase cooperates with the link awareness phase to obtain an accurate action space for Q-learning-based routing decisions. The pseudocode for these two phases is presented in Algorithm 1, which details the procedures for broadcasting and receiving Hello packets at each node.
Algorithm 1 Neighbor discovery and link awareness
Input: Hello packet, Graph G = ( V , E )
Output: Neighbor tables
1: for  t = 1 : Δ t : t m a x  do
2:       for  N i V  do
3:             // Broadcast Hello packet
4:             if Hello timer is expired then
5:                  Broadcast Hello packet
6:                  Calculate Hello interval T H ( N i ) using Equation (14)
7:                  Reset Hello timer using T H ( N i )
8:             end if
9:             // Neighbor discovery
10:             if  N i receives Hello packet from N j  then
11:                   Get sender N j from Hello packet
12:                   if  N j in N T ( N i )  then
13:                        Update N j neighbor record
14:                        Perform update step of EKPM
15:                   else
16:                        Add a new record for N j in N T ( N i )
17:                        Calculate the initial Q-value using Equation (13)
18:                        Initialize a predict model for N j
19:                   end if
20:             end if
21:             // Link awareness
22:             for  N j N e i g h b o r ( N i )  do
23:                   Perform predict step of EKPM
24:                   Estimate T LET ( N i , N j ) using Equation (6)
25:                   if  T LET ( N i , N j ) 0  then
26:                        Remove N j from N e i g h b o r ( N i )
27:                   else
28:                        Update neighbor table
29:                   end if
30:             end for
31:       end for
32: end for

5.4. Data Transmission Phase

In the proposed routing protocol, routing decisions are made by the sender nodes. Upon receiving a Data packet, the node in the network executes the following processing flow: First, the node extracts the Next-hop ID field from the packet header for authentication. If the node ID does not match the Next-hop ID, it indicates that this node is not the expected data forwarding node. Therefore, this data packet will be directly discarded to conserve network resources. Only when the node ID matches the Next-hop ID does the node initiate the routing decision process.
The pseudocode for the routing decision is given in Algorithm 2, which is illustrated by the example of node N i receiving a Data packet. Here, N T ( N i ) denotes the neighbor table of node N i , which records the state information of neighbor nodes. In the routing decision-making process, the node first calculates the direct reward value according to Equation (12), and then adjusts the Q-value associated with each neighbor using the Q-value update rule of Equation (2). Following the Q-value update, the node selects the neighbor with the highest Q-value as the optimal next-hop forwarding node. However, a routing hole [48] occurs when the neighbor table is empty, indicating that no available neighbors have been discovered within the node’s communication range. In this case, the protocol activates the recovery mechanism: the current node temporarily stores the packet in its buffer and delays the forwarding process until the network topology is updated or new neighbors become available.
To further enhance routing reliability, QLAR introduces a timeout timer and an implicit ACK mechanism [49] to detect packet loss. Specially, when a node successfully transmits a data packet, it retains a copy of the data packet in its sending buffer and simultaneously initiates a configurable timeout timer. Due to the broadcast nature of the wireless channel, this timer can obtain implicit ACK confirmation information by listening to the forwarding of data packets from the next-hop node. When the downstream node forwards the packet, a specific field in its packet header implicitly indicate that it has been successfully received. If the corresponding implicit ACK is not detected within the timer’s timeout threshold, the node infers that the data packet may have been lost, and the retransmission mechanism is automatically triggered.
Algorithm 2 Data transmission
Input: The Data packet to be transmitted, current node N i , N T ( N i )
Output: The next-hop for the Data packet to be transmitted
1: // Routing decision using Q-learning
2: while node N i receives a Data packet do
3:       Extract Next-hop ID, Sender ID and Previous-hop ID fields from packet header
4:       if node N i is not the next-hop then
5:             Discard the Data packet
6:       else
7:             for  N j N e i g h b o r ( N i )  then
8:                   Derive decision metrics using Equations (3)–(5)
9:                   Derive decision weights using Equation (10)
10:                    Calculate the reward using Equation (12)
11:                    Update the Q-value using Equation (2)
12:           end for
13:           Select next-hop N j with the maximum Q-value
14:           if routing hole then
15:                    Cache the Data packet
16:           else
17:                    Forward the Data packet to next-hop N j
18:           end if
19:      end if
20: end while
21: // Retransmission mechanism
22: while packet transmission fails do
23:      if not reach the maximum retransmission limit then
24:           Forward the Data packet
25:      else
26:           Discard the Data packet
27:      end if
28: end while

6. Numerical Results and Discussion

In this section, the performance of the proposed routing protocol is evaluated using the discrete event simulator NS-3 [50,51].
In the simulations, a single sink node is positioned at the center of the water surface, serving as the destination. The underwater sensor nodes are distributed independently and uniformly within a 3D space measuring 500 m × 500 m × 500 m, following the Poisson Point Process, with mobility modeled by the Gauss–Markov mobility model [52,53]. The fixed connection range channel model [54] in NS-3 is used in the simulation. Any temporal overlap of data packets at the receiver is considered as a collision and results in packet loss. The maximum transmission distance of nodes is set to R = 200 m, and the propagation speed of acoustic signals in water is v 0 = 1500 m/s. The network data traffic follows a Poisson distribution, and all sensor nodes randomly become source nodes that generate Data packets. The energy model from [28] is adopted. The energy consumption of the sensor nodes in the transmission, receiving, and idle states is respectively P T = 2 W, P R = 0.75 W and P I = 0.008 W. All simulations are conducted based on a randomly generated network topology, and the results are averaged over 30 repeated trials. The specific simulation parameters are shown in Table 2.
The proposed routing protocol is analyzed and compared with other existing routing protocols: HH-VBF, ALRP, QELAR, and QLFR. They represent two distinct paradigms in underwater routing protocols. HH-VBF and ALRP are location-based routing protocols designed without the learning mechanism. HH-VBF is a classic vector-based forwarding routing protocol, while ALRP is a newe adaptive location-based routing protocol. QELAR and QLFR use the Q-learning algorithm to achieve intelligent routing decisions. QELAR focuses on the energy distribution of the network. Both QLAR and QLEAR use single-path routing, implicit ACK, and retransmission mechanism. QLFR combines opportunistic routing and a packet forwarding hold mechanism to maintain good performance in dynamic environments. This is consistent with the node mobile scenarios we focus on. The simulation parameter settings for the four comparison protocols are shown in Table 2.
In the following sections, packet delivery ratio (PDR), average end-to-end delay (E2ED), energy efficiency (total energy consumption and energy tax [26]), and collision performance (total collision and average collision [28]) are utilized to evaluate the performance of the QLAR routing protocol with respect to node density, node speed, and network load. For a fixed deployment area, node density is characterized by the total number of nodes in the network, while network load is quantified by the average packet generation interval.

6.1. PDR Performance

The PDR is defined as the ratio of the number of packets successfully received by the destination node to the total number of packets generated by the source nodes, which is used to measure the data transmission efficiency of a routing protocol.
The performance of PDR is compared under different numbers of nodes. The simulation environment is set as follows: average node moving speed is 2 m/s, the average packet generation interval is 30 s, and the number of underwater sensor nodes ranges from 20 to 60. The PDR performance of the five protocols under different network densities is shown in Figure 7. For QLAR and QELAR, PDR increases with the number of nodes. This improvement can be attributed to the fact that a higher node density reduces the probability of routing holes, ensuring more reliable packet delivery. In contrast, HH-VBF, ALRP, and QLFR exhibit a two-stage change with increasing node density. Under low-density conditions, insufficient network connectivity makes it challenging to establish reliable communication links between nodes, thereby resulting in low PDR. As node density increases, network connectivity improves and the PDR increases accordingly. However, beyond a certain density threshold, the increase in the number of packets will lead to a rapid increase in network traffic, which will lead to more packet collisions. The increased packet loss leads to a decrease in PDR. The confidence intervals in indicate that under low-density network conditions, the results of the five protocols fluctuate significantly among different experiments. As the network density increases, this fluctuation decreases and the performance tends to be more stable.
In addition, it is observed from Figure 7 that QLAR has a lower average PDR than QLFR when the number of nodes is below 40. Meanwhile, the average performance of QLAR in dense networks is 3.04%∼8.25% better than that of QLFR. This can be attributed to the following reasons: In QLFR, all qualified candidate neighbor nodes forward the received packets unless the packets have been previously forwarded. In QLAR, the forwarding of data packets is restricted to the selected next-hop forwarding node. Therefore, QLFR has a greater number of qualified forwarding nodes in sparse networks, which contributes to a higher PDR than that of QLAR. However, as the number of nodes increases, the PDR of QLAR improves and gradually exceeds that of QLFR. QLAR achieves an average performance improvement of 0.4%∼19.56% and 4.55%∼24.83% respectively compared with ALRP and HH-VBF. In HH-VBF and ALRP, forwarding decisions are made by the receiving nodes. As a result, the forwarder node cannot guarantee successful packet reception at the next-hop. By contrast, QLAR employs the proposed Q-learning-based algorithm to optimize routing decisions while incorporating an implicit acknowledgment and retransmission mechanism to ensure reliable data transmission.
In the case of varying node speeds, the number of nodes is set to 40 and the average packet generation interval to 30s. The node speed is varied between 1 m/s and 5 m/s to observe the PDR performance, as shown in Figure 8. The 95 % confidence intervals for QLAR, QLFR, ALRP and HH-VBF are narrow. The lower limit of the QLAR confidence interval overlaps with the upper limit of the QLFR confidence interval. The average PDR performance of QLAR is 16.93%, 8.37%, and 3.88% higher than that of HH-VBF, ALRP and QLFR respectively on average. Their PDR performance remains stable under varying node speeds. This is due to the fact that QLAR uses the entropy weight method to dynamically adjust the weights of joint metric in the reward function. As a result, QLAR can optimize routing decisions based on the network topology. This adaptive capability enhances the robustness of QLAR in dynamic environments, thereby contributing to its superior PDR performance. The PDR performance of QELAR is greatly affected by the speed, and the PDR performance decreases as the node speed increases.
Figure 9 shows the variation of the PDR performance of the five protocols with the network load (i.e., average packet generation interval) under the conditions that the number of nodes is fixed at 40 and the average moving speed is 2 m/s. As the packet generation interval increases, the PDR performance of all five protocols improves. This is due to the fact that as the network load decreases, the probability of channel contention and data collisions decreases accordingly. As a result, PDR increases due to fewer packet forwarding failures caused by collisions. The confidence intervals for all cases are narrow. When the average packet interval are 40 s and 50 s, the confidence intervals of QLAR and QLFR overlapped, and the PDR performance is similar. Compared with ALRP, HH-VBF and QELAR, the PDR performance of QLAR has increased by an average of 9.21 % , 17.51 % and 19.83 % respectively.

6.2. Average E2ED Performance

The average E2ED refers to the mean time taken for a packet to travel from the source node to the destination node, including all delays incurred during transmission, such as propagation, transmission, queuing, and contention delays.
Figure 10 presents the average end-to-end delay performance of the five protocols. As observed in Figure 10, QLAR has a narrower confidence interval and a more stable performance. QLAR and QELAR have similar latency performance, and their average E2ED is better than the other three protocols. Compared to ALRP, QLAR reduces the average E2ED by 0.75∼2.06 s, compared to HH-VBF by 1.67∼2.78 s, and compared to QLFR by 0.10∼0.73 s, while remaining stable under varying node densities. However, the average E2ED performance of HH-VBF, QLFR, and ALRP exhibits a different trend due to the forwarding mechanism employed by these protocols. Specifically, all three protocols utilize a holding time mechanism, wherein each receiving node holds the packet for a predefined duration before forwarding it. Furthermore, these protocols typically involve multiple nodes in the forwarding process at each hop. As network density increases, the number of forwarding nodes per hop also rises, consequently extending the total number of hops along the routing path, which increases the average E2ED as network density grows.
Figure 11 illustrates the average end-to-end delay as a function of node speed. It can be observed that for all routing protocols, the average E2ED remains relatively stable as node speed increases. Its narrow 95 % confidence interval indicates that this trend has a high degree of statistical reliability. This suggests that the movement speed of nodes has little impact on the delay performance of these protocols. As depicted in Figure 11, the QLAR routing protocol outperforms QLFR, ALRP, and HH-VBF in terms of average E2ED, achieving reductions of 0.55, 1.42, and 2.29, respectively. This is because QLAR considers both node velocity, and position in the data forwarding process, which prevents suboptimal routing path selection.
Figure 12 presents the average end-to-end delay of the five protocols under different network loads. It is observed that the average E2ED decreases as the average packet generation interval increases. This is because a reduction in packet generation frequency leads to a decrease in the network load, which lowers the probability of packet collisions. When collisions occur, they force packets to bypass the optimal forwarding path to the sink node, which introduces additional transmission delays. The reduced possibility of collisions reduces contention and waiting time during transmission, thereby decreasing end-to-end delay. Compared to QLFR, ALRP, and HH-VBF, QLAR reduces the average E2ED by 0.39∼0.76, 1.17∼1.76, and 2.13∼2.45 s, respectively. Notably, QLAR shows the lowest delay even in a high-traffic network. This can be attributed to the Q-learning method and the hop-by-hop transmission mechanism, which enable QLAR to select reliable links for data forwarding and minimize duplicate transmissions. Moreover, the proposed protocol reduces the time required for route construction through adaptive neighbor discovery and link awareness. Therefore, QLAR achieves a lower average E2ED.

6.3. Energy Consumption

The energy consumption performance of the protocol is evaluated using total energy consumption and energy tax. The energy tax measures the average energy consumption per node required for the successful delivery of a packet to the destination node. It is defined as:
E n e r g y T a x = E t o t a l N · P a c k e t r e c e i v e
where E t o t a l represents the total energy consumption of the network, N denotes the number of nodes in the network, and P a c k e t r e c e i v e is the total number of packets successfully received by the sink node. This metric evaluates the energy efficiency of routing protocols in energy-constrained networks by measuring the energy cost per successful transmission.
Figure 13 illustrates the energy efficiency of the five routing protocols for different node densities. In Figure 13a, as the number of nodes increases, the total energy consumption of the network increases, with QLAR and QELAR having similar energy consumption. It can be seen from Figure 13b that when the number of nodes is 20, the confidence intervals of each protocol are relatively large and overlap. The energy tax shows a decreasing trend as the number of nodes increases in QLAR, QLFR and QELAR. While a higher node density leads to an increase in the total energy consumption of the overall network, dense networks benefit from a reduced probability of void holes. Therefore, a greater number of packets are successfully delivered to the sink node compared to sparse networks. According to the definition of energy tax in Equation (28), the increased number of successfully received packets at higher node densities ultimately results in a decrease in energy tax. However, the energy tax of HH-VBF and ALRP exhibits distinct characteristics. Under low-density network conditions, the energy tax decreases as node density increases. With the increase in network density, the total energy consumption of the network has risen rapidly, becoming the main influencing factor of energy tax. Therefore, the energy tax of ALRP and HH-VBF increases as the node density increases further.
Furthermore, as illustrated in Figure 13b, although the total energy consumption of QLAR and QELAR is similar, the delivery rate of QELAR is lower than that of QLAR, and the energy tax of QELAR is higher than that of QLAR. The energy tax of QLAR is reduced by approximately 0.16∼0.24 J compared to QLFR, 0.04∼0.27 J compared to ALRP, and 0.18∼0.28 J compared to HH-VBF. These results demonstrate that QLAR achieves a lower energy tax than HH-VBF, ALRP, and QLFR. This reduction in energy consumption can be attributed to the following factors: The proposed routing protocol employs the QLAR algorithm for hop-by-hop routing selection, which helps reduce the energy consumption caused by duplicate transmissions at each node. In addition, the reward function of QLAR takes into account the energy metric of neighbor nodes, which guarantees energy distribution across the network.
Figure 14 illustrates the energy efficiency at different node speeds. With the increase in node speed, the energy consumption of QLAR, QELAR, ALRP and HHVBF increases, while the energy consumption of QLFR decreases as the speed increases. It can be seen from Figure 14a,b that the energy tax changes in the same trend as the total energy consumption. On average, the energy tax of QLAR is 0.26 J, 0.20 J, 0.05 J and 0.21 J lower than that of HH-VBF, ALRP, QELAR and QLFR, respectively.
Figure 15 shows the simulation results of the energy efficiency under different network loads. In Figure 15a, the total energy consumption in the network decreases as the average packet generation interval increases. In Figure 15b, energy tax increases as network load decreases. A longer packet generation interval reduces the total energy consumption and the number of packets received by the sink node. Since energy consumption declines more slowly than the number of received packets, the energy tax increases as the network load decreases. Furthermore, as shown in Figure 15b, QLAR reduces energy tax by 0.18 J, 0.23 J and 0.24 J on average, respectively, compared to ALRP, QLFR and HH-VBF.

6.4. Collision Performance

The total collision and average collision are used to evaluate the network congestion. A higher collision level typically suggests that the routing protocol can not effectively restrain redundant paths. The average collision is defined as the average number of collisions that occur at each node during data transmission. The calculation formula is as follows:
A v e r a g e C o l l i s i o n = C t o t a l N
where C t o t a l is the total number of collisions in the network, and N is the number of nodes in the network.
Figure 16 illustrates the collision performance of the five routing protocols for different node densities. As node density increases, the number of nodes participating in packet forwarding also increases. When more nodes compete for communication resources, the collision problem is further aggravated. As node density grows, collision increases accordingly due to higher data traffic from the higher number of nodes, leading to more frequent packet transmissions and a higher probability of collision. It can be observed that compared to QLFR, ALRP and HH-VBF, QLAR exhibits a lower collision level. Specifically, the average collision of QLAR performs 62.83 % , 80.80 % and 79.53 % better than that of QLFR, ALRP, and HH-VBF, respectively.
Figure 17 presents the impact of network load on the collision performance. As the network load decreases, the number of collisions for all five protocols decreases accordingly. In terms of confidence intervals, the intervals for QLAR, QLFR and QELAR are significantly smaller than those for ALRP and HH-VBF, indicating that they exhibit less performance fluctuation and higher stability in different experiments. The average collision of QLAR is reduced by 69.72 % , 67.32 % , and 52.52 % , respectively, compared to HH-VBF, ALRP, and QLFR. It can be observed that QLAR maintains a low collision performance in high data traffic networks. This can be attributed to the hop-by-hop routing mechanism, which can effectively reduce redundant routing paths.

6.5. Impact of Hello Broadcast Interval

To further verify the impact of the Hello broadcast policy on routing performance, simulation analysis is conducted on the performance of QLAR and QLFR under the same fixed Hello broadcast interval. As shown in Figure 18a, when the Hello broadcast interval is large, the packet delivery rate of QLAR decreases, significantly lower than that of QLFR under the same broadcast interval, indicating that QLAR is more sensitive to link-state information. When the Hello broadcast interval is small, QLAR performs better than QLFR. Furthermore, the performance of QLFR is less affected by the Hello broadcast interval because its forwarding process does not depend on time-related link-state information. However, the results in Figure 18b show that the end-to-end delay performance of QLAR is less affected by the Hello broadcast interval. Under the same broadcast interval, QLAR’s latency performance is better than that of QLFR.
A smaller Hello broadcast interval leads to higher energy consumption and a greater probability of collisions. The results in Figure 19 and Figure 20 further demonstrate that, due to the multi-path routing mechanism used by QLFR, its energy consumption and collision probability are greater than those of QLAR at the same Hello broadcast interval.

6.6. Impact of Parameters

6.6.1. Broadcast Interval Related Parameter τ

The following section investigates the impact of the τ parameter on the performance of QLAR. According to the definition in Equation (14), the value of τ determines the broadcast interval of Hello packets, which directly influences the topology awareness and link maintenance capability of the network. To evaluate the effect of the Hello broadcast interval on network performance, simulations are conducted with the value of τ set to 0.3, 0.5, and 0.7. As described in Equation (14), the broadcast frequency of Hello packets is related to node mobility. Therefore, in the experiments, the node density is fixed at 40, and the node average speed ranges from 1 m/s to 5 m/s.
Figure 21a shows the energy tax of the network under different τ values. As τ increases, the energy tax decreases, which exhibits a negative correlation. A larger τ leads to a longer Hello broadcast interval, which results in fewer Hello packets being transmitted by nodes and thereby reduces energy consumption in the network. Meanwhile, a lower Hello broadcast frequency also reduces the probability of data collisions during data transmission. Similar to the energy tax, the average collision shown in Figure 21b decreases as τ increases.
Figure 21c illustrates the relationship between τ and the average end-to-end delay. The results indicate that the average E2ED increases as the τ value increases. This can be attributed to the fact that larger τ values result in a delay in updating the network topology, which increases the packet forwarding delay as outdated neighbor information is employed in routing decisions.
The packet delivery ratio under different τ values is depicted in Figure 21d. It can be observed that the packet delivery ratio decreases as τ increases. When τ = 0.7 , particularly in high-mobility scenarios where the node speed is 5 m/s, the packet delivery ratio is significantly lower than that observed at τ = 0.3 and τ = 0.5 . This is because a longer Hello broadcast interval reduces the adaptability to topology changes, thereby affecting the packet delivery performance.

6.6.2. The Link Metric

In this section, we examine how the performance of QLAR is affected by the link metric. The weight of the link metric in the reward function is set to zero. In this configuration (denoted as QLAR-LET0), the next-hop selection only considers the energy metric and distance metric, while the link stability information provided by LET is completely ignored. All other simulation settings, including the network topology, traffic model, and parameter configurations, are kept identical to the baseline experiments. The performance of QLAR-LET0 is then compared with the original QLAR through multiple independent runs.
Figure 22 presents the performance comparison of the proposed protocol under two conditions: with LET weight enabled and with LET weight disabled (i.e., set to zero). The evaluation is conducted with respect to PDR and average E2ED under varying node densities and node speeds. The results show that incorporating LET weight leads to superior network performance, as it not only improves the packet delivery ratio but also reduces end-to-end delay to a certain extent. This outcome indicates that introducing the link metric enhances nodes’ ability to perceive link stability, enabling the routing process to select links with longer expected lifetime and higher reliability. Consequently, the protocol can effectively reduce route disruptions and the overhead associated with frequent path reconstructions.

7. Conclusions

In this paper, routing protocols in UWSNs are investigated, and a Q-learning-based link-aware routing protocol, QLAR, is proposed. The QLAR protocol considers residual energy, position, and link state within the Q-learning framework to ensure effective selection of next-hop forwarding nodes. To adapt to changes in network topology, QLAR adaptively adjusts the weights of decision metrics. This self-optimizing mechanism enables dynamic routing decisions between the source and destination. According to the simulation results, the proposed QLAR protocol performs better than existing protocols in terms of packet delivery ratio, delay, and energy consumption. The multipath routing mechanism has been shown to improve data transmission performance, particularly in low-density network environments. Future work will focus on the investigation of multipath-based routing protocols, as well as the development of load balancing and redundancy suppression strategies tailored to the specific challenges of UWSNs. In addition, future work will incorporate current dynamics into the node motion to achieve higher fidelity for actual underwater conditions.

Author Contributions

Conceptualization, X.L.; methodology, Y.W.; software, X.L.; validation, Y.W. and M.Z.; formal analysis, X.L.; investigation, X.L.; resources, X.L.; data curation, J.R.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and J.R.; visualization, X.L. and J.R.; supervision, Y.W. and M.Z.; project administration, Y.W. and M.Z.; funding acquisition, Y.W. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program of China under Grant 2021YFC2800200, the National Natural Science Foundation of China under Grant 61971472, and the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDA22030101.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Domingo, M.C. An overview of the internet of underwater things. J. Netw. Comput. Appl. 2012, 35, 1879–1890. [Google Scholar] [CrossRef]
  2. Kao, C.C.; Lin, Y.S.; Wu, G.D.; Huang, C.J. A comprehensive study on the internet of underwater things: Applications, challenges, and channel models. Sensors 2017, 17, 1477. [Google Scholar] [CrossRef]
  3. Lin, J.; Yu, W.; Zhang, N.; Yang, X.; Zhang, H.; Zhao, W. A survey on internet of things: Architecture, enabling technologies, security and privacy, and applications. IEEE Internet Things J. 2017, 4, 1125–1142. [Google Scholar] [CrossRef]
  4. Luo, J.; Chen, Y.; Wu, M.; Yang, Y. A survey of routing protocols for underwater wireless sensor networks. IEEE Commun. Surv. Tutor. 2021, 23, 137–160. [Google Scholar] [CrossRef]
  5. Caruso, A.; Paparella, F.; Vieira, L.F.M.; Erol, M.; Gerla, M. The meandering current mobility model and its impact on underwater mobile sensor networks. In Proceedings of the IEEE INFOCOM 2008-The 27th Conference on Computer Communications, Phoenix, AZ, USA, 13–18 April 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 221–225. [Google Scholar]
  6. Coutinho, R.W.; Boukerche, A.; Vieira, L.F.; Loureiro, A.A. Performance modeling and analysis of void-handling methodologies in Underwater Wireless Sensor Networks. Comput. Netw. 2017, 126, 1–14. [Google Scholar] [CrossRef]
  7. Sandeep, D.; Kumar, V. Review on clustering, coverage and connectivity in underwater wireless sensor networks: A communication techniques perspective. IEEE Access 2017, 5, 11176–11199. [Google Scholar] [CrossRef]
  8. Souiki, S.; Feham, M.; Feham, M.; Labraoui, N. Geographic routing protocols for Underwater Wireless Sensor Networks: A survey. arXiv 2014, arXiv:1403.3779. [Google Scholar] [CrossRef]
  9. Coutinho, R.W.; Boukerche, A.; Vieira, L.F.; Loureiro, A.A. Geographic and opportunistic routing for underwater sensor networks. IEEE Trans. Comput. 2015, 65, 548–561. [Google Scholar] [CrossRef]
  10. Darehshoorzadeh, A.; Boukerche, A. Underwater sensor networks: A new challenge for opportunistic routing protocols. IEEE Commun. Mag. 2015, 53, 98–107. [Google Scholar] [CrossRef]
  11. Coutinho, R.W.; Boukerche, A.; Vieira, L.F.; Loureiro, A.A. Design guidelines for opportunistic routing in underwater networks. IEEE Commun. Mag. 2016, 54, 40–48. [Google Scholar] [CrossRef]
  12. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
  13. Rodoshi, R.T.; Song, Y.; Choi, W. Reinforcement learning-based routing protocol for underwater wireless sensor networks: A comparative survey. IEEE Access 2021, 9, 154578–154599. [Google Scholar] [CrossRef]
  14. Patil, S.D.; Patil, P.S. A Hybrid PSO-GSA Approach for Cluster Head Selection and Fuzzy Logic Data Aggregation in DEEC-based WSNs. Int. J. Comput. Netw. Inf. Secur. 2025, 17, 48–70. [Google Scholar] [CrossRef]
  15. Juwaied, A.; Jackowska-Strumillo, L.; Sierszeń, A. Enhancing Clustering Efficiency in Heterogeneous Wireless Sensor Network Protocols Using the K-Nearest Neighbours Algorithm. Sensors 2025, 25, 1029. [Google Scholar] [CrossRef]
  16. Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  17. Wang, Z.; Du, J.; Hou, X.; Wang, J.; Jiang, C.; Zhang, X.P.; Ren, Y. Toward communication optimization for future underwater networking: A survey of reinforcement learning-based approaches. IEEE Commun. Surv. Tutor. 2024, 27, 2765–2793. [Google Scholar] [CrossRef]
  18. Hu, T.; Fei, Y. QELAR: A machine-learning-based adaptive routing protocol for energy-efficient and lifetime-extended underwater sensor networks. IEEE Trans. Mob. Comput. 2010, 9, 796–809. [Google Scholar]
  19. He, J.; Tian, J.; Pu, Z.; Wang, W.; Huang, H. Cross-Layer Routing Protocol Based on Channel Quality for Underwater Acoustic Communication Networks. Appl. Sci. 2024, 14, 9778. [Google Scholar] [CrossRef]
  20. Ahmed, M.; Salleh, M.; Channa, M.I. Routing protocols based on node mobility for Underwater Wireless Sensor Network (UWSN): A survey. J. Netw. Comput. Appl. 2017, 78, 242–252. [Google Scholar] [CrossRef]
  21. Ismail, A.; Wang, X.; Hawbani, A.; Alsamhi, S.; Abdel Aziz, S. Routing protocols classification for underwater wireless sensor networks based on localization and mobility. Wirel. Netw. 2022, 28, 797–826. [Google Scholar] [CrossRef]
  22. Xie, P.; Cui, J.H.; Lao, L. VBF: Vector-based forwarding protocol for underwater sensor networks. In NETWORKING’06: Proceedings of the 5th International IFIP-TC6 Conference on Networking Technologies, Services, and Protocols; Performance of Computer and Communication Networks; Mobile and Wireless Communications Systems, Coimbra, Portugal, 15–19 May 2006; Proceedings 5; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1216–1221. [Google Scholar]
  23. Nicolaou, N.; See, A.; Xie, P.; Cui, J.H.; Maggiorini, D. Improving the robustness of location-based routing for underwater sensor networks. In Proceedings of the Oceans 2007-Europe, Aberdeen, Scotland, 18–21 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–6. [Google Scholar]
  24. Yan, H.; Shi, Z.J.; Cui, J.H. DBR: Depth-based routing for underwater sensor networks. In Proceedings of the NETWORKING 2008 Ad Hoc and Sensor Networks, Wireless Networks, Next Generation Internet: 7th International IFIP-TC6 Networking Conference, Singapore, 5–9 May 2008; Proceedings 7. Springer: Berlin/Heidelberg, Germany, 2008; pp. 72–86. [Google Scholar]
  25. Wahid, A.; Kim, D. An energy efficient localization-free routing protocol for underwater wireless sensor networks. Int. J. Distrib. Sens. Netw. 2012, 8, 307246. [Google Scholar] [CrossRef]
  26. Yu, H.; Yao, N.; Wang, T.; Li, G.; Gao, Z.; Tan, G. WDFAD-DBR: Weighting depth and forwarding area division DBR routing protocol for UASNs. Ad Hoc Netw. 2016, 37, 256–282. [Google Scholar] [CrossRef]
  27. Noh, Y.; Lee, U.; Wang, P.; Choi, B.S.C.; Gerla, M. VAPR: Void-aware pressure routing for underwater sensor networks. IEEE Trans. Mob. Comput. 2012, 12, 895–908. [Google Scholar] [CrossRef]
  28. Wang, Q.; Li, J.; Qi, Q.; Zhou, P.; Wu, D.O. An adaptive-location-based routing protocol for 3-D underwater acoustic sensor networks. IEEE Internet Things J. 2020, 8, 6853–6864. [Google Scholar] [CrossRef]
  29. Zhou, Y.; Cao, T.; Xiang, W. Anypath routing protocol design via Q-learning for underwater sensor networks. IEEE Internet Things J. 2020, 8, 8173–8190. [Google Scholar] [CrossRef]
  30. Jin, Z.; Ma, Y.; Su, Y.; Li, S.; Fu, X. A Q-learning-based delay-aware routing algorithm to extend the lifetime of underwater sensor networks. Sensors 2017, 17, 1660. [Google Scholar] [CrossRef]
  31. Zhang, Y.; Zhang, Z.; Chen, L.; Wang, X. Reinforcement learning-based opportunistic routing protocol for underwater acoustic sensor networks. IEEE Trans. Veh. Technol. 2021, 70, 2756–2770. [Google Scholar] [CrossRef]
  32. Zhu, R.; Jiang, Q.; Huang, X.; Li, D.; Yang, Q. A reinforcement-learning-based opportunistic routing protocol for energy-efficient and void-avoided UASNs. IEEE Sensors J. 2022, 22, 13589–13601. [Google Scholar] [CrossRef]
  33. Mahajan, P.; Balamurugan, P.; Kumar, A.; Chalapathi, G.; Chamola, V.; Khabbaz, M. Multi-Objective MDP-based Routing In UAV Networks For Search-based Operations. IEEE Trans. Veh. Technol. 2024, 73, 13777–13789. [Google Scholar] [CrossRef]
  34. Jung, W.S.; Yim, J.; Ko, Y.B. QGeo: Q-learning-based geographic ad hoc routing protocol for unmanned robotic networks. IEEE Commun. Lett. 2017, 21, 2258–2261. [Google Scholar] [CrossRef]
  35. Dorri, A.; Kanhere, S.S.; Jurdak, R. Multi-agent systems: A survey. IEEE Access 2018, 6, 28573–28593. [Google Scholar] [CrossRef]
  36. Wang, P.; Wang, T. Adaptive routing for sensor networks using reinforcement learning. In Proceedings of the Sixth IEEE International Conference on Computer and Information Technology (CIT’06), Seoul, Republic of Korea, 20–22 September 2006; IEEE: Piscataway, NJ, USA, 2006; p. 219. [Google Scholar]
  37. Mammeri, Z. Reinforcement learning based routing in networks: Review and classification of approaches. IEEE Access 2019, 7, 55916–55950. [Google Scholar] [CrossRef]
  38. Cui, Y.; Zhang, Q.; Feng, Z.; Wei, Z.; Shi, C.; Yang, H. Topology-aware resilient routing protocol for FANETs: An adaptive Q-learning approach. IEEE Internet Things J. 2022, 9, 18632–18649. [Google Scholar] [CrossRef]
  39. Su, W.; Lee, S.J.; Gerla, M. Mobility prediction and routing in ad hoc wireless networks. Int. J. Netw. Manag. 2001, 11, 3–30. [Google Scholar] [CrossRef]
  40. Zhu, Y.; Tian, D.; Yan, F. Effectiveness of entropy weight method in decision-making. Math. Probl. Eng. 2020, 2020, 3564835. [Google Scholar] [CrossRef]
  41. Ishizaka, A.; Nemery, P. Multi-Criteria Decision Analysis: Methods and Software; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
  42. Gray, R.M. Entropy and Information Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  43. Bin, W.; Kerong, B.; Yixue, H.; Mingjiu, Z. SQMCR: Stackelberg Q-learning based Multi-hop Cooperative Routing Algorithm for Underwater Wireless Sensor Networks. IEEE Access 2024, 12, 56179–56195. [Google Scholar] [CrossRef]
  44. Hernandez-Cons, N.; Kasahara, S.; Takahashi, Y. Dynamic hello/timeout timer adjustment in routing protocols for reducing overhead in MANETs. Comput. Commun. 2010, 33, 1864–1878. [Google Scholar] [CrossRef]
  45. Alsaqour, R.; Abdelhaq, M.; Saeed, R.; Uddin, M.; Alsukour, O.; Al-Hubaishi, M.; Alahdal, T. Dynamic packet beaconing for GPSR mobile ad hoc position-based routing protocol using fuzzy logic. J. Netw. Comput. Appl. 2015, 47, 32–46. [Google Scholar] [CrossRef]
  46. Arafat, M.Y.; Moh, S. A Q-learning-based topology-aware routing protocol for flying ad hoc networks. IEEE Internet Things J. 2021, 9, 1985–2000. [Google Scholar] [CrossRef]
  47. Camp, T.; Boleng, J.; Davies, V. A survey of mobility models for ad hoc network research. Wirel. Commun. Mob. Comput. 2002, 2, 483–502. [Google Scholar] [CrossRef]
  48. Mohemed, R.E.; Saleh, A.I.; Abdelrazzak, M.; Samra, A.S. Energy-efficient routing protocols for solving energy hole problem in wireless sensor networks. Comput. Netw. 2017, 114, 51–66. [Google Scholar] [CrossRef]
  49. Khasawneh, A.; Latiff, M.S.B.A.; Kaiwartya, O.; Chizari, H. A reliable energy-efficient pressure-based routing protocol for underwater wireless sensor network. Wirel. Netw. 2018, 24, 2061–2075. [Google Scholar] [CrossRef]
  50. Riley, G.F.; Henderson, T.R. The ns-3 network simulator. In Modeling and Tools for Network Simulation; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–34. [Google Scholar]
  51. Martin, R.; Rajasekaran, S.; Peng, Z. Aqua-Sim Next generation: An NS-3 based underwater sensor network simulator. In Proceedings of the 12th International Conference on Underwater Networks & Systems, Halifax, NS, Canada, 6–8 November 2017; pp. 1–8. [Google Scholar]
  52. Hu, S.; Wang, G.; Liu, Z.; Gao, X. Application of Graph Signal Sampling in Underwater Distributed Cooperative Detection. In Proceedings of the 2025 5th International Conference on Sensors and Information Technology, Nanjing, China, 21–23 March 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 785–789. [Google Scholar]
  53. Ventura, G.; Ardizzon, F.; Tomasin, S. Authentication by location tracking in underwater acoustic networks. arXiv 2024, arXiv:2410.03511. [Google Scholar] [CrossRef]
  54. Morozs, N.; Gorma, W.; Henson, B.T.; Shen, L.; Mitchell, P.D.; Zakharov, Y.V. Channel modeling for underwater acoustic network simulation. IEEE Access 2020, 8, 136151–136175. [Google Scholar] [CrossRef]
Figure 1. Motivation network scenario.
Figure 1. Motivation network scenario.
Jmse 13 02374 g001
Figure 2. The framework of the proposed routing protocol.
Figure 2. The framework of the proposed routing protocol.
Jmse 13 02374 g002
Figure 3. Hemispherical hierarchical structure. The black, blue, and purple dashed lines represent the boundaries of the first, second, and third layers, respectively, matching the colors of the sensor nodes in each layer.
Figure 3. Hemispherical hierarchical structure. The black, blue, and purple dashed lines represent the boundaries of the first, second, and third layers, respectively, matching the colors of the sensor nodes in each layer.
Jmse 13 02374 g003
Figure 4. The flowchart of the proposed routing protocol.
Figure 4. The flowchart of the proposed routing protocol.
Jmse 13 02374 g004
Figure 5. Packet structure: (a) Hello packet for neighbor discovery and routing information exchange, and (b) Data packet for data transmission.
Figure 5. Packet structure: (a) Hello packet for neighbor discovery and routing information exchange, and (b) Data packet for data transmission.
Jmse 13 02374 g005
Figure 6. The flowchart of EKPM.
Figure 6. The flowchart of EKPM.
Jmse 13 02374 g006
Figure 7. PDR performance versus the number of nodes, with error bars ( 95 % CI) from 30 simulation runs. The average node speed and packet generation interval are fixed at 2 m/s and 30 s, respectively.
Figure 7. PDR performance versus the number of nodes, with error bars ( 95 % CI) from 30 simulation runs. The average node speed and packet generation interval are fixed at 2 m/s and 30 s, respectively.
Jmse 13 02374 g007
Figure 8. PDR performance versus speeds, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average packet generation interval are fixed at 40 and 30 s, respectively.
Figure 8. PDR performance versus speeds, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average packet generation interval are fixed at 40 and 30 s, respectively.
Jmse 13 02374 g008
Figure 9. PDR performance versus network load, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average node speed are fixed at 40 and 2 m/s, respectively.
Figure 9. PDR performance versus network load, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average node speed are fixed at 40 and 2 m/s, respectively.
Jmse 13 02374 g009
Figure 10. Average end-to-end delay versus the number of nodes, with error bars ( 95 % CI) from 30 simulation runs. The average node speed and packet generation interval are fixed at 2 m/s and 30 s, respectively.
Figure 10. Average end-to-end delay versus the number of nodes, with error bars ( 95 % CI) from 30 simulation runs. The average node speed and packet generation interval are fixed at 2 m/s and 30 s, respectively.
Jmse 13 02374 g010
Figure 11. Average end-to-end delay versus speeds, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average packet generation interval are fixed at 40 and 30 s, respectively.
Figure 11. Average end-to-end delay versus speeds, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average packet generation interval are fixed at 40 and 30 s, respectively.
Jmse 13 02374 g011
Figure 12. Average end-to-end delay versus network load, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average node speed are fixed at 40 and 2 m/s, respectively.
Figure 12. Average end-to-end delay versus network load, with error bars ( 95 % CI) from 30 simulation runs. The number of nodes and average node speed are fixed at 40 and 2 m/s, respectively.
Jmse 13 02374 g012
Figure 13. Comparison of energy efficiency with different numbers of nodes: (a) Total energy consumption and (b) Energy tax (average node speed: 2 m/s, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Figure 13. Comparison of energy efficiency with different numbers of nodes: (a) Total energy consumption and (b) Energy tax (average node speed: 2 m/s, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g013
Figure 14. Comparison of energy efficiency with different speeds: (a) Total energy consumption and (b) Energy tax (the number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Figure 14. Comparison of energy efficiency with different speeds: (a) Total energy consumption and (b) Energy tax (the number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g014
Figure 15. Comparison of energy efficiency with different network load: (a) Total energy consumption and (b) Energy tax (the number of nodes: 40, average speed: 2m/s; error bars represent the 95 % confidence interval).
Figure 15. Comparison of energy efficiency with different network load: (a) Total energy consumption and (b) Energy tax (the number of nodes: 40, average speed: 2m/s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g015
Figure 16. Comparison of collision performance with different numbers of nodes: (a) Total collision and (b) Average collision. (Average node speed: 2 m/s, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Figure 16. Comparison of collision performance with different numbers of nodes: (a) Total collision and (b) Average collision. (Average node speed: 2 m/s, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g016
Figure 17. Comparison of collision performance with different network load: (a) Total collision and (b) Average collision. (The number of nodes: 40, average speed: 2 m/s; error bars represent the 95 % confidence interval).
Figure 17. Comparison of collision performance with different network load: (a) Total collision and (b) Average collision. (The number of nodes: 40, average speed: 2 m/s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g017
Figure 18. Performance comparison under fixed Hello broadcast intervals: (a) PDR and (b) Average end-to-end delay. (The number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Figure 18. Performance comparison under fixed Hello broadcast intervals: (a) PDR and (b) Average end-to-end delay. (The number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g018
Figure 19. Energy efficiency comparison under fixed Hello broadcast intervals: (a) Total energy consumption and (b) Energy tax. (The number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Figure 19. Energy efficiency comparison under fixed Hello broadcast intervals: (a) Total energy consumption and (b) Energy tax. (The number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g019
Figure 20. Collision performance comparison under fixed Hello broadcast intervals: (a) Total collision and (b) Average collision. (The number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Figure 20. Collision performance comparison under fixed Hello broadcast intervals: (a) Total collision and (b) Average collision. (The number of nodes: 40, average packet generation interval: 30 s; error bars represent the 95 % confidence interval).
Jmse 13 02374 g020
Figure 21. Routing performance under different τ parameters versus average speeds, with the number of nodes fixed at 40: (a) Energy tax. (b) Average collision. (c) Average E2ED. (d) PDR.
Figure 21. Routing performance under different τ parameters versus average speeds, with the number of nodes fixed at 40: (a) Energy tax. (b) Average collision. (c) Average E2ED. (d) PDR.
Jmse 13 02374 g021
Figure 22. Performance Comparison with and without link metric weight: (a) PDR versus the number of nodes. (b) Average E2ED versus the number of nodes. (c) PDR versus average speed. (d) Average E2ED versus average speed.
Figure 22. Performance Comparison with and without link metric weight: (a) PDR versus the number of nodes. (b) Average E2ED versus the number of nodes. (c) PDR versus average speed. (d) Average E2ED versus average speed.
Jmse 13 02374 g022aJmse 13 02374 g022b
Table 1. Comparison of different protocols.
Table 1. Comparison of different protocols.
ProtocolsFeaturesChallenges
VBF [22]Robust and scalableNot for sparse networks
DBR [24]Improve packet delivery ratioMore energy consumption and redundant forwarding
ALRP [28]Adaptive forwarding areaRestricted to node distribution
QELAR [18]Q-learning-based, single metric (energy efficiency)Restricted to a single optimization objective
QDAR [30]Q-learning-based, energy efficiency and latencyFixed metric weights, centralized decision-making
QLFR [29], ROEVA [32], RLOR [31], DROR [31]Q-learning combined with opportunistic routing to overcome routing holes, multiple decision metricsAdditional holding time, fixed metric weights
Table 2. Parameter setup in the simulations.
Table 2. Parameter setup in the simulations.
ParametersValue
SimulatorNS-3
Deployment area500 m × 500 m × 500 m
Number of sensor nodes[20, 30, 40, 50, 60]
Number of sinks1
Channel modelBinary range-based model in [54]
Communication range200 m
The speed of sensor nodes[1, 2, 3, 4, 5] m/s
Average packet generation interval[10, 20, 30, 40, 50] s
Packet size80 bytes
Transmission power2 W
Idle power0.008 W
Receiving power0.75 W
Mobility modelGauss–Markov mobility model, tuning parameter 0.85, standard deviation 0.05, correlation time 30 s
Energy modelEnergy model in [28]
Packet generation modelPoisson distribution
Acoustic speed1500 m/s
AntennaOmni-directional
Learning rate0.9
Discount factor0.3
τ factor0.3
HH-VBFRadius of virtual pipeline 160 m
ALRPk = 7, T d e l a y = 0.5 s
QLFRThe difference of holding time 0.2 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Wu, Y.; Zhu, M.; Ren, J. A Q-Learning-Based Link-Aware Routing Protocol for Underwater Wireless Sensor Networks. J. Mar. Sci. Eng. 2025, 13, 2374. https://doi.org/10.3390/jmse13122374

AMA Style

Li X, Wu Y, Zhu M, Ren J. A Q-Learning-Based Link-Aware Routing Protocol for Underwater Wireless Sensor Networks. Journal of Marine Science and Engineering. 2025; 13(12):2374. https://doi.org/10.3390/jmse13122374

Chicago/Turabian Style

Li, Xinyang, Yanbo Wu, Min Zhu, and Jie Ren. 2025. "A Q-Learning-Based Link-Aware Routing Protocol for Underwater Wireless Sensor Networks" Journal of Marine Science and Engineering 13, no. 12: 2374. https://doi.org/10.3390/jmse13122374

APA Style

Li, X., Wu, Y., Zhu, M., & Ren, J. (2025). A Q-Learning-Based Link-Aware Routing Protocol for Underwater Wireless Sensor Networks. Journal of Marine Science and Engineering, 13(12), 2374. https://doi.org/10.3390/jmse13122374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop