Reinforcement Learning-Based Data Forwarding in Underwater Wireless Sensor Networks with Passive Mobility

Data forwarding for underwater wireless sensor networks has drawn large attention in the past decade. Due to the harsh underwater environments for communication, a major challenge of Underwater Wireless Sensor Networks (UWSNs) is the timeliness. Furthermore, underwater sensor nodes are energy constrained, so network lifetime is another obstruction. Additionally, the passive mobility of underwater sensors causes dynamical topology change of underwater networks. It is significant to consider the timeliness and energy consumption of data forwarding in UWSNs, along with the passive mobility of sensor nodes. In this paper, we first formulate the problem of data forwarding, by jointly considering timeliness and energy consumption under a passive mobility model for underwater wireless sensor networks. We then propose a reinforcement learning-based method for the problem. We finally evaluate the performance of the proposed method through simulations. Simulation results demonstrate the validity of the proposed method. Our method outperforms the benchmark protocols in both timeliness and energy efficiency. More specifically, our method gains 83.35% more value of information and saves up to 75.21% energy compared with a classic lifetime-extended routing protocol (QELAR).


Introduction
Nowadays, marine surveillance, water contamination detection and monitoring, and oceanographic data collection are indispensable to the exploration, protection and exploitation of aquatic environment [1]. Because of the huge amount of unexploited resources in the ocean, there is an urgent need for research in the field of sensors and sensor networks [2]. Underwater Wireless Sensor Networks (UWSNs) has become a main approach to gain information from previously inaccessible waters. Traditional wireless sensor networks (WSNs) consist of a large number of sensor nodes randomly distributed in a detection field, and these nodes are usually either stationary or moving in limited ranges. However, in many practical scenarios, the movement of nodes is relatively large, such as nodes in UWSNs, delay-tolerant networks, vehicular networks, etc. Nodes in UWSNs can be categorized as stationary nodes and moving nodes. Stationary nodes are anchored to the water bottom while moving nodes can move in a preset velocity, such as Autonomous Underwater Vehicles (AUVs). Nevertheless, only a few researchers take passive mobility of nodes into account. More specifically, nodes may move along internal currents or vortices. Underwater nodes have no access to GPS signals, and the network topology is completely time varying due to irregular mobilities of water currents, which is essentially different from terrestrial WSNs. Meanwhile, due to dynamic topology changes and poor communication conditions underwater, data packets cannot be delivered to the sink nodes deployed on the water surface rapidly.
A major challenge of UWSNs is real-time requirements. For instance, fishery surveillance and real-time monitoring of precious assets such as petroleum pipelines. Specifically, report delay of sea properties such as temperatures may lead to serious loss of temperature sensitive sea animals, e.g., sea cucumbers, because they dissolve fast in high temperatures. Moreover, the detection of leakages of coal oil in early stage prevents water contamination and further resource waste. Therefore, we adopt the concept of the value of information (VoI) which evaluates information in terms of timeliness [3]. Additionally, UWSNs are energy constrained due to the fact that they cannot be recharged or replaced, so their ability to route data diminishes when sensor nodes run out of energy. Network lifetime remains the performance bottleneck which perhaps is one main obstacle in the wide scale deployment of wireless sensor networks [4,5]. In this case, energy consumption is also a fundamental issue in UWSNs.
In conclusion, it is significant to consider the timeliness and energy consumption of data forwarding in UWSNs, along with the passive mobility of sensor nodes. Motivated by the timeliness demand and the energy constraint of UWSNs, we aim to explore data forwarding in UWSNs with passive mobility, jointly considering the timeliness of packets and the energy consumption of the sensor nodes. Due to irregular dynamics of water, the node movement is unpredictable, i.e., the future status has little relevance to its historical trajectories. Consequently, the determination of the relay node of a sensor node depends on its current status and its neighborhood relationship. A reinforcement learning method is proposed in this paper. To the best of knowledge, we are the first to jointly consider timeliness and energy consumption of data forwarding in UWSNs with passive mobility.
The main contributions of this paper are as follows. We first formulate the problem of data forwarding, by jointly considering timeliness and energy consumption under a novel passive mobility model for UWSNs. We then propose a reinforcement learning-based method for the problem. We finally evaluate the performance of the proposed method through simulations. Experimental results demonstrated the validity of the proposed method and they also demonstrated the efficiency, compared with two benchmark methods.
The rest of this paper is organized as follows. Section 2 will review the related work of the proposed method. Section 3 will introduce the preliminaries, including the system model, notations and problem definitions, and the proposed method. Section 4 will show the simulation results. Section 5 will present the discussion of the simulation results and the look out for future work.

Related Work
Data forwarding for underwater wireless sensor networks has drawn a lot of attention in the past decade. There are several kinds of routing protocols that aim to improve energy efficiency, timeliness and adaptability to node mobility of UWSNs. In this section, we review the related work on this topic.
Lloret et al. have pointed out the urgent need and significance of UWSNs [1,2]. To satisfy the demand of timeliness of UWSNs, a lot of research was dedicated to decreasing the latency of data forwarding. Bassagni et al. [6] devised a forwarding method named Multi-modAl Reinforcement Learning-based RoutINg (MARLIN) protocol. The MARLIN strategy selects the best relay node along with the best communication channel, and it can be configured to seek reliable routes to the final destination, or to provide faster packet delivery. Gjanci et al. [3] proposed a Greedy and Adaptive AUV Path-finding (GAAP) heuristic. The GAAP strategy proposed a heuristic algorithm which aims to find the path of the AUV so that the value of information of the data delivered to sink nodes is maximized. It showed that the GAAP strategy delivers much more value of information than Random Selection (RS), Lawn Mower (LM) and Traveling Sales Man (TSP) strategies do. Nevertheless, the advantage of the GAAP strategy over the TSP strategy decreases with the network size which enables TSP strategy to collect more packets, and the average end-to-end delay of GAAP strategy is higher than TSP strategy.
Meanwhile, many energy-efficient forwarding methods are devised to prolong the network lifetime. Hu et al. [7] proposed a Q-Learning-based Energy-Efficient and Lifetime-Aware Routing (QELAR) Protocol for Underwater Sensor Networks. QELAR adopted Q-Learning algorithm which defines the residual energy of sensor nodes as the reward function. Therefore, in QELAR protocol, sensor nodes select the node with the most residual energy as the relay node, thus the network lifetime can be prolonged. However, QELAR did not constrain the end-to-end delay, which resulted in longer delay when the number of sensor nodes was increasing. Coutinho et al. [8] devised an Energy Balancing Routing (EnOR) Protocol for Underwater Sensor Networks. The EnOR protocol adopted the idea of balancing the energy consumption among neighboring nodes in the forward set by rotating the priority of them so as to extend the network lifetime. However, a large candidate set results in high delay because the link quality of the high priority nodes is usually low given the long distance between the sender and the high priority nodes. In addition, Jin et al. [9] proposed a Q-Learning-based Delay-Aware Routing (QDAR) Algorithm to Extend the Lifetime of Underwater Sensor Networks. It took both timeliness and energy efficiency into account by defining delay-related cost and energy-related cost.
Moreover, several studies of mobility of sensor nodes dealt with topology changes due to node mobility. For instance, Liu et al. [10] proposed an Opportunistic Forwarding Algorithm based on Irregualar Mobility (OFAIM). OFAIM aims to maximize the network delivery ratio of UWSNs in a 3-D mobility model due to irregular movement. However, there are only sensor nodes but no sink nodes in the scenario of OFAIM, and no descriptions of how the data will be retrieved from underwater sensors.
Additionally, there are some approaches that reduce energy consumption in consideration of node mobility. Forster et al. [11] proposed a Role-Free Clustering with Q-Learning (CLIQUE) for WSNs, which determines the selection of cluster heads without control overhead. The number of hops to reach mobile sink nodes and the residual energy of sensor nodes are jointly adopted as the reward function, thus enhancing the energy efficiency. However, CLIQUE assumed that sensor nodes uniformly disseminate data without consideration of the limited storage of sensor nodes. Webster et al. [12] invented a clustering protocol for UWSNs based on the mobility model proposed by Caruso et al. [13], which aims to minimize the overall energy consumption.
We distinguish our work from the above-mentioned ones as follows. Existing studies dealt with either energy consumption or timeliness of data forwarding in stationary topology, or simply considered energy consumption in dynamic topologies. None of these studies jointly considered all of them. Therefore, we propose a data forwarding method in joint consideration of timeliness and energy efficiency in UWSNs with passive mobility.

System Model
The UWSN is represented by an undirected graph G(t) = (V, E(t)) at time slot t, where V is the set of sensor nodes and E(t) is the set of links between pairs of nodes within the communication range of each other at time slot t. As depicted in Figure 1, N sensor nodes are tethered to the water bottom via wires, and move passively due to internal currents or vortices.
The moving region is a semi-sphere with a radius of R i while the communication range of sensor v i is denoted by CR.
Meanwhile, we have M sink nodes deployed on the water surface and the set of sink nodes are denoted by S. Additionally, S(i, t) denotes the set of sink nodes which are within the communication range of v i . Sink node s m ∈ S is mounted on an autonomous draft so that s m can hold its position. In addition, they are equipped with acoustic modems for sensors and RF modems for satellites, along with access to GPS localization. Data packets are periodically generated and P i,t denotes the set of packets in v i at time slot t while p i,t denotes the p-th packet in v i at time slot t. Sensor nodes learn to forward packets to sink nodes in terms of Value of Information and the energy consumption of sensor nodes. Packets are supposed to be received by sink nodes via multi-hop relays. In order to leverage the broadcast property of the wireless channel, each packet is acknowledged implicitly. Specifically, after transmitting a packet, the sender starts listening to the channel. If it overhears the packet being retransmitted within a certain period of time, the packet is regarded as successfully transmitted; otherwise, the packet is considered to be lost and the sensor node will learn to retransmit it, which will be described in detail in Section 4.

Underwater Movement Model
The movement model is shown in Figure 2. We assume that the moving speed of v i is denoted as SP i (t) obeys the normal distribution N(µ 1 , σ 2 1 ) and its actual value range is (0, 2µ 1 ). (dθ i (t), dφ i (t)) denotes the movement direction of V i at time slot t, where dθ i (t) and dφ i (t) obey uniform distributions U(0, π) and U(0, 2π), respectively. The next location of v i from its current location C i (t) = (x i (t), y i (t), z i (t)) will be: where R i denotes the length of the tethering wire of v i , the node is held still by its tethered wire and C i (t + 1) can be written as

Value of Information
Immediate detection of regions of interest in early stage can provide sufficient time to take corresponding actions. Hence, we adopt the concept of value of information which evaluates information in terms of timeliness. Hence, the later a packet is forwarded to the sink, the lower its value is. Therefore, the VoI of a packet can be expressed as Equation (2), where p i,t represents the p-th packet in v i at time slot t, t l indicates the living time duration of packet p i,t since it is generated, α is the decay factor, k is the discount coefficient and TTL is the maximum life of the packet, i.e., time to live. VoI(p i,t ) is a key factor of the decision making of a sensor node as to which packet should be relayed. If the living duration of a packet approaches its TTL, it will be discarded immediately.

Energy Consumption
Each sensor node has its battery capacity, and with adjustable transmission power. The energy consumption of a sensor mainly includes the energy consumed on the sensor module, its processor module and its communication module, among which the communication module consumes the most energy. Hence, the energy consumption of a sensor node can be approximated by the communication energy consumption while ignoring its other energy consumptions. According to the typical model of energy consumption of free-space spherical wave, the energy consumption of a sensor node is: where pl is the data volume that a sensor node receives or transmits, in bit; e s is the circuit energy consumption of emitting or receiving per bit data, in J/bit; e r is the minimum energy of signal per bit that can be received by sensor nodes or sink nodes successfully, in J/(bit · m 2 ); d is the communication distance, in meter.

Forwarding Orientation
In order to prolong the longevity of UWSNs, it is significant to adopt an energy-efficient forwarding method. Inspired by the murmuration of a swarm of swallows, Pearce et al. [14] proposed a biotic model, the Hybrid Projection Model, which defines the murmuration via two metrics: the opacity and the orientation.
As can be seen in Figure 3, the orientation is mathematically defined as average accumulation of vectors created by the neighbors of a node, which can be calculated by Equation (4), where e ori (i, t) ∈ R 3 denotes the vector of orientation, |H(i, t)| is the number of neighbors of v i within its communication range and v j (i, t) ∈ R 3 denotes the vector from v i to its j-th neighbor v j at time slot t. The orientation can be acquired locally via the Received Signal Strength (RSS) and Arrival of Angle (AoA) of the broadcasting packets from neighborhood.  The length of e ori (i, t) denotes the absolute value of the orientation and the orientation direction is denoted by the direction of e ori (i, t). Nodes with large orientation values are generally located on the edge of a neighborhood. Otherwise, they are near the centers of their neighborhoods and nodes with lower orientation values are more likely to be the relay node. It has been proved that determining the forwarding direction via orientation metric is energy-efficient [12]. Moreover, there is no requirement for localization when using the orientation metric, which is very suitable for underwater sensors due to their inaccessibility to GPS signals. Therefore, we adopt the orientation metric to determine data forwarding direction.

Problem Definition
Given a UWSN G(t) = (V, E(t)) at time slot t. As mentioned above, we ascertain the objective as minimizing the energy consumption of data forwarding with maximal Value of information within a given monitoring duration T. Therefore, we aim to solve the problem of data forwarding by jointly considering timeliness and energy consumption.
As shown in Equation (6), p * i,t represents the candidate packet which has the highest value of information in v i at time slot t. Furthermore, if v i is able to forward data to any neighbor at time slot t, p * i,t will be delivered. In Equation (7), each sensor node has limited energy and is out of use when its residual energy hits the bottom at 0. The living time of packets cannot exceed the maximum living duration TTL as shown in Equation (8). In Equation (9), the moving range of each sensor node is limited to the length of its tethered wire R i .

Data Forwarding Method
In our scenario, the sensor nodes are dynamically moving due to water flow. In addition, the environment and neighborhood topology of each sensor node keep changing. We adopt a reinforcement learning-based method by which sensor nodes can distributively learn from the changing environments to forward data. This section describes the data forwarding method in detail. Specifically, we present the learning model, the learning method to choose a relay and the algorithm for packet forwarding.

Data Forwarding Procedure
The procedure of data forwarding mainly contains the following three stages, as can be seen in Algorithm 1.
(1) In the beginning of each time slot, each sensor node and sink node broadcasts its beacon signal, e.g., the identifier, orientation and residual energy. Therefore, each sensor node knows its neighbors. (2) When v i hears the beacon signal from s m , it adds s m to the set of its available sink nodes S(i, t).
Similarly, if v i can hear the beacon signal of sensor node v j , v i will add v j to the set of its neighbors H(i, t). Additionally, the distance and orientation of each neighbor or reachable sink node can be acquired locally via the Received Signal Strength (RSS) and Arrival of Angle (AoA) of the beacon signal, respectively. If v i cannot hear from any sink nodes or sensor nodes, v i will wait until the next time slot coming. (3) Sensor node v i selects the reachable sink node or next relay node by the algorithm RelaySelect which performs a learned choice of a relay node. The RelaySelect algorithm will be introduced in detail in the third subsection.

3:
H(i, t) = ∅ 4: end for 5: for each s m ∈ S do 6: s m broadcasts its beacon signal 7: for each v i ∈ V do 8: if v i can hear s m then 9: S(i, t) = S(i, t) ∪ {s m } 10: end if 11: end for 12: end for 13: for each v i ∈ V do 14: v i broadcasts its beacon signal 15 a i (t) = RelaySelect(S(i, t), H(i, t), P(i, t)) 25: end for

Q-Learning Model
Q-Learning is a model-free reinforcement learning technique, based on agents taking actions and receiving rewards from the environment in response to actions [11]. Each action is evaluated a Q-value due to its fitness. In the learning process, the agent calculates the reward of each potential action and updates the Q-value by which the real action can be determined. Q-Learning has been widely adopted in wireless ad hoc communications. The main challenge is the modeling of the Q-Learning process and the definition of Q-values.
Given the set X = {x 1 , x 2 , ..., x t , ..., x T } of states of an agent, a reward r t (a t ) is received in state x t after the agent takes action a t ∈ A at time slot t.
To evaluate how good an action is at a state, the Q-value of action a t at time slot t, Q(x t , a t ), is updated as follows: where r t (a t ) is the reward of taking action a t at time slot t, Q(x t+1 , a t+1 ) is the expected fitness at time slot (t + 1), γ is the learning discount factor and P a x t →x t+1 represents the transition probability from state x t to x t+1 .
In order to determine the optimal action, the action with the highest Q-value from state x t to x t+1 at time slot t can be acquired as follows: For each state x t ∈ X, the optimal action a * t can be greedily acquired by updating the Q-value.

Learning to Forward
If v i transmits a packet to a relay node or a sink node, the state of v i at time slot (t + 1) turns to 1, x t+1 = 1. Otherwise, x t+1 = 0. The action a t in our scenario is a i (t) = (p i,t , v j ) which denotes the action of v i forwarding packet p i,t to v j . Then, the reward of of taking action a t to next state x t+1 is described as r(p i,t , v j ). Lastly, the Q-value is updated to Q(p i,t , v j ) which indicates the fitness of v i forwarding packet p i,t to v j at time slot t.
In our data forwarding scenario, each sensor node is an independent learning agent and actions are options of a relay node or a sink node within its communication range. The following describes details of the model solution, including time, actions, transmission probabilities, rewards, and Q-values.
Agents Agents are underwater sensor nodes. Time A v i handling a packet p is associated with a time slot t ∈ {0, 1, 2, 3, ..., T} defined by the sequential number of time slots.
Actions Actions refer to the joint selection of a packet in the node's cache and of a relay node in its neighborhood. The set where a i (t) = (p i,t , v j ) is the action of forwarding packet p i,t to relay node v j .
Transmission Probabilities Denote the probability of transmission from v i to v j at time slot t as Pr i,j (t). Meanwhile, the transmission probability from the current relay node v j to the next potential relay node v k is denoted by Pr j,k (t + 1). Pr i,j (t) is computed by v i while Pr j,k (t + 1) is computed by v j and sent to v i in the header of the broadcast packet in each round. The transmission probabilities can be calculated via the orientation metric by Equation (12), as follows.
π arccos e ori (j,t)v n j (j,t) |e ori (j,t)||v n j (j,t)| (12) Note that Pr j,k (t + 1) is the prediction from the current time slot t because the topology at time slot (t + 1) cannot be ascertained yet due to the node mobility.
Rewards The rewards mainly consist of two aspects, energy consumption and VoI, as shown in Equation (13), where r(p i,t , v j ) represents the reward of v i transmitting packet p i,t to v j , VoI(p i,t ) denotes the VoI of packet p i,t , and E r (i, t) represents the residual energy of v i after transmission, at time slot t. Q-values Q-values represent the goodness of actions and agents aim to learn the actual fitness of potential actions. We initialize the Q-values as shown in Equation (14), where Q(p i,t , v j ) refers to the Q-value of v i in response to the action of choosing v j as the relay node, VoI(p i,t ) denotes the VoI of packet p i,t to be transmitted, and E r (i, t) represents the residual energy of v i , in the beginning. Algorithm 2 describes the learning process of v i ∈ V in each time slot as well as the corresponding determination of the packet to forward and its relay node.
If sink node s m ∈ S is within the transmission range of v i , v i transmits the packet with the largest VoI in its cache to s m directly. Otherwise, to identify an optimal forwarding decision, v i learns the value of function Q(p i,t , v j ) and updates the Q-value. Based on this value v i determines the optimal forwarding action a i (t) = (p i,t , v j ). Each node starts with no knowledge of its surrounding environment. Broadcasting and listening in neighborhood, sensor nodes iteratively acquire and update their knowledge over time. Function r(p j,t+1 , v k ) in Equation (15) is approximated via Equation (13) based on the localization and neighborhood at time slot t The Q-values can be updated as shown in Equation (15), where r(p j,t+1 , v k ) is the reward of v j transmitting packet p j,t+1 to v k at time slot (t + 1), and r(p j,t+1 , v k ) is approximated via Equation (13) based on the localization and neighborhood at time slot t.
Additionally, Pr j,k (t + 1) represents the probability of transmission from v j to v k and γ is the learning factor. In the learning process, sensor nodes calculate the reward of each potential relay node and update the Q-value. Finally, sensor nodes acquire the Q-table by which the most appropriate relay node can be determined. (S(i, t), H(i, t), P(i, t)).

Algorithm 2 RelaySelect
if ∃s m ∈ S(i, t) then 3: for each p i,t ∈ P i,t do 7: for each v j ∈ H(i, t) do 8: for each v k ∈ H(j, t) and k = i do 9: Q(p i,t , v j ) = r(p i,t , j) + γ ∑ k∈H(j,t),k =j,k =i Pr j,k (t + 1)r(p j,t+1 , k) return a i (t) 17: end for In our method, each sensor node has to ascertain its neighborhood and then selects the relay node in its neighborhood. Specifically, we have to execute two rounds of calculation for each sensor node in each time slot: (1) the determination of neighbor nodes within the sensor's communication range; (2) the selection of the neighbor node with highest Q-value. In the first round of calculation, it takes a complexity of O( ) to calculate the distances between sensor nodes. In the second round, the complexity depends on the size of the neighborhood of sensor nodes. In the most complicated case, all the sensor nodes in the same neighborhood, i.e., ∀j = i, v j ∈ H i , we need to calculate (N − 1) times of Q-value of the neighbor nodes of v i . Therefore, it takes a complexity of O(N(N − 1)) at most to select relay nodes of all the sensor nodes. Since the number of time slots is constant, the complexity of our method can be ascertained as O(N 2 ).

Results
In this section, we evaluate the performance of our proposed method compared with two well-known routing protocols: (i) QELAR, a machine learning-based protocol designed for minimizing and balancing node energy consumption [7]; (ii) DBR, a data forwarding method for UWSNs based on the depth of the sender [15]. It is worth mentioning that we use the total residual energy of sensor nodes, Value of Information and the ratio of packet delivery to sink nodes as the main metrics of performance evaluation.

Experimental Setup
The region of interests cover a space of 1000 m × 1000 m × 1000 m. We assume that the anchors are randomly deployed at the bottom and the length of tethering wires are also randomly generated, while the sink nodes are stationary at (333, 333, 1000) m and (666, 666, 1000) m. We consider UWSNs with different sizes of 10 and 100 sensor nodes, respectively. The sensors use Orthogonal Frequency-division Multiplexing (OFDM) modulation which allows simultaneous transmission from several users.
The simulation parameters are shown in Table 1. Each sensor node has a communication range of 300 m with initial energy of 100 J. The packets are set to the length of 1000 bit with the TTL of 10 time slots. Sensor nodes move passively at a maximum speed of 100 m per time slot. The coefficient of energy consumption e s and e r are set to 5 × 10 −8 J/bit and 10 −8 J/(bit · m 2 ), respectively. The decaying factor of VoI, i.e., α, is set as 0.5, while the learning discount factor γ is set as 1, which speeds up the learning rate. All simulation results are acquired with runs of 100 times.

Simulation Metrics
Data forwarding performance is assessed through the following three metrics. Value of Information defined as the VoI of packets acquired by the sink nodes within the monitoring duration.
Residual Energy defined as the total residual energy of sensor nodes within the monitoring duration.
Packet Delivery Ratio defined as the fraction of packets received by the sink nodes within the monitoring duration.

Simulation Results
In this section, we illustrate the results from simulations. All results are obtained by averaging over 100 simulation times.
(1) Value of Information As can be seen in Figure 4, the value of information acquired by sink nodes in the scenario of 10 sensor nodes is presented. Our method gains the highest VoI, 14.63% and 51.61% higher than QELAR and DBR, respectively. QELAR comes in the second place while DBR obtains the lowest VoI among the three methods. Moreover, as shown in Figure 5, the VoI acquired by our method performs better as the network size increases, which is 43.48% and 83.35% higher than QELAR and DBR, respectively. When forwarding data, QELAR and DBR choose the earliest packet in the cache. Not surprisingly, DBR achieves the lowest VoI because the forwarding decision of DBR depends on the accessibility of neighbors with smaller depths. Specifically, compared with QELAR and our method, sensor nodes have to wait longer for the qualified neighbors, which leads to more decay of the VoI of packets. Our proposed method performs the highest VoI, because our method explicitly takes VoI into account in its reward function (Section 4), which leads to the choice of the packet with largest VoI in the sensor cache. (2) Residual Energy The results of residual energy of QELAR, DBR and our method with 10 sensor nodes is indicated in Figure 6. The residual energy of QELAR is the lowest while our method consumes the smallest energy among the three methods. More specifically, our method consumes 31.21% and 37.26% of the energy consumed by QELAR and DBR, respectively. As shown in Figure 7, our method still consumes the least energy among the three methods when the network size increases, only 24.79% and 31.43% of the energy consumption of QELAR and DBR, respectively. That is mainly because by choosing packets and relay nodes smartly, our method achieves excellent performance in energy consumption. Our method always selects the latest packets in the cache while QELAR always selects the earliest packets. Moreover, in our method, earlier packets may have been discarded due to TTL constraint when the latter packets are forwarded, which leads to the avoidance of forwarding too many early packets in the cache, compared with QELAR. Therefore, the energy consumption of our method is much lower than that of QELAR. (3) Packet Delivery Ratio The packet delivery ratio (PDR) of QELAR, DBR and our method can be seen in Table 2. DBR achieves higher PDR than other two methods in both scenarios. Because packets are forwarded towards sensor nodes with less depths, the packets are either staying in a sensor node or approaching the water surface, which prevents the packets from being forwarded repeatedly between several sensor nodes and trapped in a certain region. Therefore, DBR decreases the repeating forwarding between sensor nodes and increases the PDR. The PDR to sink nodes of our method in scenarios of 10 and 50 sensor nodes are 66.36% and 71.64%, respectively. Our method achieves a PDR slightly lower than QELAR does, mainly because more packets with earlier generation time in the cache are discarded due to the maximum living duration.

Discussion and Conclusions
In this paper, we proposed the data forwarding method in joint consideration of VoI of packets and energy consumption, with passive mobility of sensors in UWSNs. We explicitly take both VoI and energy consumption into account in its reward function, thus reducing the energy consumption as well as enhancing the timeliness of data forwarding in UWSNs. In our method, the Q-value of the same sensor node can be different along the time, thus avoiding the same node acting as a relay node until the depletion of its battery. Meanwhile, packets with larger value of information have higher priority to be transmitted so as to realize better timeliness. Although the packet delivery ratio of our method is relatively lower, our proposed method achieves much higher timeliness and consumes less energy than DBR and QELAR in the circumstance of dynamical topology change due to the passive mobility of sensor nodes. Given that the timeliness and energy consumption were more significant than the delivery ratio in our scenario, our method enhances the performance of UWSNs. In our scenario, the sink nodes are stationary and the performance of data collection may be different if the sink nodes are moving on the surface of the detection region. As a future work, we will study how the movement of sink nodes can influence the data collection of UWSNs. Additionally, recent studies of harvesting ambient energy of UWSNs has drawn large attention. For instance, the kinetic energy of underwater currents can be harvested to prolong the lifetime of UWSNs. Therefore, we intend to carry out the research of energy harvesting-aware data forwarding in UWSNs with passive mobility in the future.