Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization

Khan, Zahid Ullah; Gao, Hangyuan; Kulsoom, Farzana; Mohsan, Syed Agha Hassnain; Muhammad, Aman; Chaudry, Hassan Nazeer

doi:10.3390/jmse14030262

Open AccessArticle

Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization

by

Zahid Ullah Khan

¹,

Hangyuan Gao

^1,*,

Farzana Kulsoom

²,

Syed Agha Hassnain Mohsan

³

,

Aman Muhammad

⁴

and

Hassan Nazeer Chaudry

⁵

¹

College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China

²

Department of Telecommunication Engineering, University of Engineering and Technology, Taxila 47080, Pakistan

³

School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China

⁴

School of Software, Shanxi Agricultural University, Jinzhong 030801, China

⁵

Department of Computer Science, Barani Institute of Information Technology, Rawalpindi 46000, Pakistan

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(3), 262; https://doi.org/10.3390/jmse14030262

Submission received: 11 December 2025 / Revised: 14 January 2026 / Accepted: 23 January 2026 / Published: 27 January 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate and energy-efficient localization of autonomous underwater vehicles (AUVs) remains a fundamental challenge due to the complex, bandwidth-limited, and highly dynamic nature of underwater acoustic environments. This paper proposes a fully adaptive deep reinforcement learning (DRL)-driven localization framework for AUVs operating in Underwater Acoustic Sensor Networks (UAWSNs). The localization problem is formulated as a Markov Decision Process (MDP) in which an intelligent agent jointly optimizes beacon selection and transmit power allocation to minimize long-term localization error and energy consumption. A hierarchical learning architecture is developed by integrating four actor–critic algorithms, which are (i) Twin Delayed Deep Deterministic Policy Gradient (TD3), (ii) Soft Actor–Critic (SAC), (iii) Multi-Agent Deep Deterministic Policy Gradient (MADDPG), and (iv) Distributed DDPG (D2DPG), enabling robust learning under non-stationary channels, cooperative multi-AUV scenarios, and large-scale deployments. A round-trip time (RTT)-based geometric localization model incorporating a depth-dependent sound speed gradient is employed to accurately capture realistic underwater acoustic propagation effects. A multi-objective reward function jointly balances localization accuracy, energy efficiency, and ranging reliability through a risk-aware metric. Furthermore, the Cramér–Rao Lower Bound (CRLB) is derived to characterize the theoretical performance limits, and a comprehensive complexity analysis is performed to demonstrate the scalability of the proposed framework. Extensive Monte Carlo simulations show that the proposed DRL-based methods achieve significantly lower localization error, lower energy consumption, faster convergence, and higher overall system utility than classical TD3. These results confirm the effectiveness and robustness of DRL for next-generation adaptive underwater localization systems.

Keywords:

underwater acoustic wireless sensor networks; localization; autonomous underwater vehicles; deep reinforcement learning; multi-object optimization; energy efficiency; multi-agent systems

1. Introduction

Accurate localization of autonomous underwater vehicles (AUVs) is fundamental to a wide range of mission-critical underwater operations, including long-term oceanographic mapping, environmental surveillance, offshore resource exploration, target tracking, and cooperative navigation within Underwater Acoustic Sensor Networks (UAWSNs) [1,2,3,4]. Despite its significance, achieving reliable positioning in underwater environments remains an open challenge. Acoustic propagation is strongly affected by depth-dependent sound speed variations, long propagation delays, multi-path reflections, Doppler distortions, and high absorption losses [5,6,7,8]. Furthermore, the inherently limited bandwidth, sparse connectivity, and strict energy constraints for the underwater nodes restrict the frequency of communication exchanges, making real-time localization both computationally and operationally difficult. Existing localization techniques can be broadly categorized into Range-Free and Range-Based methods [9,10,11]. Range-Free techniques rely on beacon connectivity, hop counts, or geometric constraints [8,9,10,11]. They are energy-efficient [12] but often inadequate in dynamic and deep-water environments. Methods such as the ellipse-based geometric model in [13] can tolerate beacon uncertainty but still suffer from limited accuracy under mobility. Range-based techniques using Round Trip Time (RTT), Received Signal Strength (RSS), Time of Arrival (ToA), Time Difference of Arrival (TDoA), and Angle of Arrival (AoA) achieve higher precision but require meticulous power control [14,15,16]. While increasing the number of active beacons improves estimation accuracy, it simultaneously amplifies communication overhead and power expenditure. Conventional schemes, such as Fixed Localization Algorithm (FLA) [17], which rely on constant transmit power and a fixed subset of beacons, degrade rapidly when channel conditions fluctuate or when the scale of the network increases. To address these challenges in this paper, we introduced a fully adaptive and energy-aware localization framework driven by deep reinforcement learning (DRL). The localization task is formulated as a Markov Decision Process (MDP), enabling an intelligent agent to jointly optimize beacon selection and per-beacon transmit power allocation. The state representation incorporates acoustic features, AUV depth, the number of available beacons, and instantaneous ranging uncertainty, allowing the agent to adaptively respond to real-time channel variations. The reward structure integrates localization accuracy, energy efficiency, and ranging reliability, enabling optimal trade-offs between precision and resource consumption. The proposed architecture strategically combines four complementary actor–critic algorithms, (i) Twin Delayed Deep Deterministic Policy Gradient (TD3), (ii) Soft Actor–Critic (SAC), (iii) Multi-Agent Deep Deterministic Policy Gradient (MADDPG), and (iv) Distributed Deep Deterministic Policy Gradient (D2DPG), to achieve robust learning across heterogeneous underwater environments. TD3 mitigates overestimation bias and stabilizes convergence through dual critics and delayed policy updates, SAC enhances exploration in non-stationary acoustic conditions through an entropy regularized objective, MADDPG enables coordinated behavior when multiple AUVs or cooperative beacon clusters are involved, and D2DPG introduces communication-efficient distributed learning that is suitable for large-scale deployments. RTT-based geometric estimation with an isogradient sound speed model is used to compute the AUV’s position, and a risk metric based on mean ranging error is incorporated to penalize inaccurate decisions. A theoretical Cramér–Rao Lower Bound (CRLB) analysis establishes the fundamental accuracy limits under realistic measurement noise and beacon perturbations, while a complexity assessment demonstrates scalability with respect to the number of beacons and model parameters. While several recent studies combine reinforcement learning with underwater localization, these approaches typically rely on a single learning paradigm or address only one aspect of the localization problem (e.g., trajectory planning or anchor selection). In contrast, our work introduces a unified and hierarchical DRL framework that integrates TD3, SAC, MADDPG, and D2DPG within a single policy architecture. Each component addresses a distinct challenge—continuous power control, robust exploration, multi-agent coordination, and distributed scalability—resulting in a localization system that is simultaneously adaptive, energy-efficient, and scalable. To the best of our knowledge, this is the first work to jointly optimize beacon selection and transmit power using a multi-algorithm DRL framework under realistic underwater acoustic propagation models.

The following are the contributions of the paper:

We develop a fully adaptive and energy-aware underwater localization framework based on deep reinforcement learning, enabling AUVs to dynamically optimize beacon selection and transmit power allocation under highly variable acoustic conditions.
We formulate the underwater localization problem as a Markov Decision Process (MDP) with a rich state representation that incorporates signal features, depth, ranging uncertainty, and beacon availability, enabling environment-aware decision making.
We design a multi-objective reward function that jointly balances localization accuracy, energy consumption, and ranging reliability through a risk metric derived from the mean ranging error.
We integrate multiple DRL algorithms, TD3, SAC, MADDPG, and D2DPG, each addressing different challenges, such as overestimation bias, entropy-driven exploration, multi-agent cooperation, and scalable decentralized learning.
We incorporate an RTT-based geometric localization model with an isogradient sound-speed profile, capturing depth-dependent acoustic propagation effects for more accurate position estimation.
We derive the CRLB for the proposed framework to analytically characterize the theoretical limits of localization accuracy under noisy acoustic measurements and beacon uncertainties.
We provide a detailed computational complexity analysis demonstrating that the proposed hierarchical DRL framework scales effectively with the number of beacons, transmit-power levels, and learning iterations, making it practical for large-scale UASN deployments.

The remainder of this paper is organized as follows. Section 2 reviews related literature and current state-of-the-art techniques. Section 3 presents the system model. Section 4 describes the proposed reinforcement learning (RL)-based localization framework. Section 5 details the simulation setup and experimental evaluations, followed by a comprehensive performance analysis. Finally, Section 6 concludes the paper with insights and future research directions.

2. Related Work

Localization is one of the most significant issues in UAWSNs because the propagation models of underwater environments are complicated by the unique qualities of the medium and because of the weaknesses of traditional radio-based positioning. UAWSNs localization schemes are generally divided into range-free and range-based schemes, which have different approaches and merits. In range-free localization methods, the location of target nodes is calculated using topological knowledge or network connectivity, and no real distance or angle values are needed. In [18], a strong multi-AUV cooperative localization scheme is presented, which considers the uncertainties of model mismatches, errors, and time-varying communication channels. The method involves the use of maximum entropy principles and belief propagation through factor graphs to provide increased localization and robustness. The proposed algorithm, as compared to the traditional methods, is able to offer sustained high-accuracy estimations in complex underwater conditions, unlike traditional methods, which suffer uncertainties, thus suitable in the application of oceanic rescue, resource development, and marine supervision. In [19], four probabilistic algorithms used to localize AUVs based on bathymetry are evaluated. The analysis combines data of bathymetry with depth and range values through diverse Bayesian filters, such as the extended Kalman filter, unscented Kalman filter, particle filter, and marginalized particle filter (MPF). On a simulation platform constructed using ROS Gazebo, the performance of these algorithms is evaluated with real-world lakes data, and the results indicate that the MPF is typically the most accurate at localization and able to surmount the difficulties encountered using visual landmark-based techniques.

The authors of [20] introduce a combined method of localization, communication, and trajectory planning (LCP) in an AUV network and treat these components as a system instead of operating them in isolation. Historical research usually focuses on these elements individually, which often creates conflicting resource requirements and performance issues. The suggested approach presents a cooperative localization method between AUVs, and using trajectory planning, an AUV can adjust its localization accuracy, even when the underwater environment is severe in terms of communication conditions. This method, in contrast to the previous ones, where indirect planning influences the accuracy of localization via geometrical optimization, uses the results of the trajectory planning directly to enhance localization. The algorithm is also optimized for obstacle-avoidance applications, which also enhances performance. Computer simulations and field experiments indicate that the proposed algorithm has better localization, communication, and planning performance compared to state-of-the-art alternatives, indicating that this algorithm is a more resilient solution for multi-AUV systems. Nevertheless, the approach can have problems with computational complexity in real time and scalability in large-scale networks because the approach is based on accurate communication conditions to succeed.

The scholars in [21] present an Efficient AUV-aided Localization Scheme (EAL) to large-scale UAWSNs, which tackles major dilemmas regarding the localization of AUVs. The first issue is the complex path planning for several AUVs, where localization accuracy and travel path optimization should be considered jointly. The second difficulty is due to the poor condition of underwater localization, including non-synchronized clocks and stratification effects, which may seriously decrease localization accuracy. To address these issues, the paper presents a graph-based localization path planning mechanism, which considers the influence of the route on localization accuracy and contributes to finding effective routes of travel using AUVs. Additionally, a localization mechanism that is asynchronous and uses iterations is developed to offset the stratification effect, hence making sure that correct localization is performed even under unfavorable circumstances. It is shown through extensive simulations that the proposed method is highly localized, accurate, and efficient in large-scale UAWSNs and is a worthy method of application in the real world. Nevertheless, the approach also has a weakness associated with real-time implementation and the impact of environmental factors on system performance. In [22], the researchers present a trajectory planning system, based on reinforcement and learning concepts, to be applied to AUVs to enhance their localization within the framework of UAWSNs. Unlike traditional methods that operate under fixed anchors, the method employs AUVs, which act as mobile anchors to minimize localization uncertainties. The trajectory planning procedure is developed to maximize entropy reduction within the network, which ensures an optimal route choice for the AUV and correct localization. The AUV’s path is optimized using a modified version of the DDPG algorithm that balances efficiency and localization accuracy. Simulations show that the suggested approach is better than traditional approaches in accuracy and computational efficiency, but there are still some issues with scalability and computational complexity. Other approaches to the localization of the anchors and the nodes include frequency- and weighted-fusion based synchronization techniques [23], which make the process more accurate.

On the other hand, asynchronous schemes [24,25] seek to minimize communication overhead and computational complexity. It is worth noting that the consensus-based unscented Kalman filter in [24] is faster in converging to localization, whereas [25] proposes iterative least-squares algorithms with both active and passive inputs, which enhance accuracy and versatility. Another factor with a significant impact on localization reliability is network topology optimization and sensor node selection. The scholars of [26] present a secure localization protocol for UWSNs, where cooperative beamforming of the underwater mobile anchor nodes is used to improve the security and energy efficiency of networks. The algorithm uses the TDOA algorithm to localize sensor nodes, as well as to address the security risk of eavesdroppers intercepting the communication. The localization problem is defined as a multi-anchor, multi-objective dual-joint optimization, the objective of which is to optimize the accuracy of localization and reduce energy consumption. The MADDPG algorithm, a form of DRL, is based on the solution of the optimization. The efficacy and precision of the suggested scheme are confirmed by both the simulation and the field experiment, yet the technique is burdened with issues regarding computational complexity and large network scale. The authors of [27] introduce a new method of multi-agent underwater source localization based on Multi-agent Reinforcement Learning (MARL) in this paper. It is a framework that maximizes the paths of two AUVs, each of which tows an antenna, in order to maximize the likelihood of detection of an underwater source. The shared-parameter MARL strategy proposed is an approach to overcome the problem of a non-stationary multi-agent environment that requires non-synchronous actions. A neural network is used to train the system in a simplified simulation environment before being tested in a more realistic simulation engine. The findings indicate that the suggested strategy is resistant to a communication loss rate of up to 60% and is equally effective as well-known localization strategies. Nonetheless, scaling and real-world underwater dynamics can be an issue for the approach, as it uses simplified simulations to train.

The ToA and TDoA-based approaches [28,29,30] tend to be highly accurate not only inside the localization region but also outside it. As an example, the scheme in [28] combines trajectory tracking and frequency-based anchor activation to reduce the error in initial location, and Kalman filtering enhances motion state estimation. Indicatively, the geometric model-based scheme in [31] combines real-world motion measurements to minimize uncertainties on the beacon point placement and geometric shape distortion. Equally, in [32], the hop progress-based approach uses a mixture of geometric constraints and multi-hop communication as a trade-off between accuracy of distance estimation and efficiency in using anchors. Though range-free localization is generally associated with an increased communication overhead, it is commonly used in UAWSNs that have fixed infrastructure because it is robust and resilient to channel anomalies. Alternatively, range-based localization techniques compute approximations of distances between sensor nodes and anchors based on signal parameters AoA, RSS, ToA, and TDoA. Spatial–temporal signal processing is important in enhancing localization accuracy. As an illustration, the bearing-angle measurements of the AoA-based approach in [33] obtain a closed-form solution to improve the precision of measurements in noisy environments. Doppler-based approaches, as mentioned in [34], take advantage of the frequency shift estimation to identify the location of mobile nodes, but it is challenging to attain a high level of Doppler estimation accuracy because of high dynamics caused by multi-path and environmental factors. Models based on kernel functions have also been used to address outliers in ToA measurements in order to enhance overall robustness [35]. Timewise, range-based localization may be synchronous or asynchronous. In [36], the researchers present a cooperative online target detection system for Multi-AUVs that uses side-scan sonar (SSS) sensors to achieve real-time, effective detection and positioning of underwater targets in an unknown hostile environment. The research solves some of these problems through the use of a multi-scale cascaded network (MSCNet) with prior-based threshold segmentation to reduce serious noise, geometric-based deformation of sonar images, and a high level of false alarm. The paper proposes a dual-branch lightweight block (LWBlock) as a baseline for efficiently extracting features to address real-time computational constraints. Moreover, the behavior-based DDBB path re-planning algorithm enables the administration of AUVs in order to scan targets autonomously according to information found using an automatic target recognition (ATR) platform. The simulation and sea test outcomes indicate that the offered method obtains a recognition rate of 92.16 percent, with an inference time of 2.45 s, and the detection efficiency increases by 40 percent over that of a single AUV. Nevertheless, the approach can face scalability and real-time adoption problems in massive systems.

3. System Model

Consider a UAWSN consisting of an AUV and a set of spatially distributed beacons that assist in localization and communication. The AUV either follows a predefined trajectory for ocean monitoring or hovers at a fixed position to gather data from nearby underwater sensors. Let us

B = {1, 2, \dots, N}

denote the set of N beacons located within the AUV’s acoustic communication range. The proposed system model architecture is depicted in Figure 1.

3.1. AUV Beacon Communication Model

At discrete time slot

k \in N

, the AUV is positioned at

x (k) = {[x_{1} (k), x_{2} (k), x_{3} (k)]}^{⊤} \in R^{3}

and moves with instantaneous velocity

v (k) = {[v_{1} (k), v_{2} (k), v_{3} (k)]}^{⊤} .

Each beacon

i \in B

is located at

l_{i} (k) = {[l_{i, 1} (k), l_{i, 2} (k), l_{i, 3} (k)]}^{⊤} .

The AUV initiates the localization process by broadcasting a localization request at time

T_{s} (k)

. Each beacon responds with a signal using transmit power

a_{i} (k)

, selected from a finite discrete set of levels:

a_{i} (k) \in \{\frac{m P_{T}}{M} ∣ 0 \leq m \leq M\},

where

P_{T}

denotes the maximum transmit power, and M is the quantization level of power control. The subset of active beacons at time k is expressed as

Y (k) = {i \in B ∣ a_{i} (k) > 0},

and the number of active beacons is given by

n (k) = \sum_{i = 1}^{N} I (a_{i} (k) > 0),

where

I (\cdot)

denotes the indicator function.

3.2. Signal Propagation and Attenuation Model

Upon receiving the localization request, the beacon i transmits a response signal back to the AUV. The AUV records the RSS

ξ_{i} (k)

and the time of reception

T_{r, i} (k)

. The propagation loss

L (d_{i}, f_{c})

of the acoustic wave between the AUV and beacon i is modeled as

L (d_{i}, f_{c}) = 10 η {log}_{10} (\frac{d_{i}}{d_{0}}) + α (f_{c}) d_{i}, d_{i} \geq d_{0},

(1)

where

d_{i} = {∥ x (k) - l_{i} (k) ∥}_{2}

is the AUV–beacon distance,

η

is the geometric spreading factor,

d_{0}

is the reference distance,

f_{c}

is the acoustic center frequency (in kHz), and

α (f_{c})

represents the frequency-dependent absorption coefficient (in dB/km).

The received power at the AUV from the beacon i is thus

P_{r, i} (k) = a_{i} (k) - L (d_{i}, f_{c}) + ζ_{i} (k),

(2)

where

ζ_{i} (k)

denotes random small-scale fading and measurement noise, modeled as a zero-mean Gaussian variable

N (0, σ_{ζ}^{2})

.

3.3. Sound-Speed Profile and Range Estimation

Due to stratification in the underwater medium, the acoustic sound speed varies with depth. The isogradient sound-speed profile is approximated as a linear function of depth:

\begin{matrix} v_{U} (k) & = B_{0} + B_{1} x_{3} (k), \end{matrix}

(3)

\begin{matrix} v_{i} (k) & = B_{0} + B_{1} l_{i, 3} (k), \end{matrix}

(4)

where

B_{0}

denotes the sound speed at the sea surface, and

B_{1}

is the gradient coefficient determined by the local thermohaline environment. The average sound speed

{\bar{v}}_{i} (k)

along the direct path between the AUV and beacon i can be expressed as

{\bar{v}}_{i} (k) = \frac{B_{1} (x_{3} (k) - l_{i, 3} (k))}{ln (\frac{B_{0} + B_{1} x_{3} (k)}{B_{0} + B_{1} l_{i, 3} (k)})}

(5)

Using the RTT measurement

Δ t_{i} (k) = T_{r, i} (k) - T_{s} (k)

, the estimated range between the AUV and beacon i is

{\hat{d}}_{i} (k) = \frac{1}{2} {\bar{v}}_{i} (k) Δ t_{i} (k) .

(6)

The ranging error is defined as

ε_{i} (k) = {\hat{d}}_{i} (k) - d_{i} (k),

(7)

with

ε_{i} (k) \sim N (0, σ_{d}^{2})

representing Gaussian measurement uncertainty.

3.4. AUV Position Estimation

Given the estimated distances

{{\hat{d}}_{i} (k)}_{i \in Y (k)}

and the known positions of the selected beacons, the AUV’s position

\hat{x} (k)

can be obtained by solving the nonlinear least-squares problem:

\hat{x} (k) = arg min_{x} \sum_{i \in Y (k)} {(∥ x - l_{i} {(k) ∥}_{2}^{2} - {\hat{d}}_{i}^{2} (k))}^{2} .

(8)

This optimization can be efficiently solved using iterative gradient descent or linearization methods, such as the Gauss–Newton algorithm.

3.5. The Energy Consumption Model

The instantaneous energy expenditure for the beacon i at time k is given by

E_{i} (k) = a_{i} (k) τ_{i} (k),

(9)

where

τ_{i} (k)

is the transmission duration. The total network energy consumption is then

E_{tot} (k) = \sum_{i \in Y (k)} E_{i} (k) .

(10)

The goal of the reinforcement learning-based framework described in later sections is to jointly optimize the beacon selection policy

Y (k)

and transmit power vector

a (k) = [a_{1} (k), \dots, a_{N} (k)]

such that the long-term cumulative utility

J = E [\sum_{k = 0}^{\infty} γ^{k} (ω_{1} Acc (k) - ω_{2} E_{tot} (k) - ω_{3} Risk (k))]

(11)

is maximized, where

γ \in (0, 1)

is the discount factor;

Acc (k)

denotes the localization accuracy;

Risk (k)

is the average ranging error; and

ω_{1}

,

ω_{2}

, and

ω_{3}

are weighting coefficients. For notational simplicity, the time index k will be omitted hereafter when the context is clear.

4. Proposed Method

Building upon the system model introduced in Section 3, this section presents the proposed DRL-based localization framework for AUVs operating in a UAWSN. The objective is to achieve high-precision and energy-efficient localization under dynamic acoustic conditions, constrained bandwidth, and mobility-induced uncertainties. The localization process is modeled as an MDP, where an intelligent agent adaptively selects beacons and allocates transmission power to minimize long-term localization error and energy expenditure. The proposed framework integrates multiple actor–critic algorithms, namely, TTD3, SAC, MADDPG, and D2DPG, into a unified architecture. Each learning paradigm contributes specific advantages: TD3 mitigates overestimation bias, SAC enhances exploration through entropy regularization, MADDPG enables cooperative multi-agent optimization, and D2DPG ensures scalable distributed convergence.

4.1. Localization Problem Based on RL

The AUV localization process is formulated as an MDP

M = {S, A, P, R, γ},

where

S

,

A

,

P

,

R

, and

γ

denote the state space, action space, transition dynamics, reward function, and discount factor, respectively.

At each time step k, the state vector

s (k) \in S

encodes the AUV position, acoustic features, and beacon status:

s (k) = [x_{1} (k), x_{2} (k), x_{3} (k), \bar{ξ} (k), n (k), \bar{ε} (k)],

(12)

where

(x_{1} (k), x_{2} (k), x_{3} (k))

is the current position,

\bar{ξ} (k)

is the mean RSS,

n (k)

is the number of active beacons, and

\bar{ε} (k)

is the average ranging error.

The action vector

a (k) = {[a_{1} (k), \dots, a_{N} (k)]}^{⊤}

determines per-beacon transmit power levels, where each

a_{i} (k)

takes discrete values

{0, P_{T} / M, \dots, P_{T}}

, and

a_{i} (k) = 0

denotes an inactive beacon. The stochastic transition function is given by

s (k + 1) = f (s (k), a (k), ω (k)),

(13)

where

ω (k) \sim N (0, \sum_{ω})

represents noise and random mobility perturbations. The reward at the time step k balances localization accuracy, energy cost, and ranging risk:

r (k) = ω_{1} exp (- \frac{\bar{ε} (k)}{ε_{max}}) - ω_{2} \frac{E_{tot} (k)}{E_{max}} - ω_{3} \bar{ρ} (k),

(14)

where

ω_{1} + ω_{2} + ω_{3} = 1

. The overall learning objective is to maximize the long-term discounted return:

J (π) = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r (k)] .

(15)

AUV position estimation is obtained through RTT-based least-squares localization:

ε (k) = ∥ \hat{x} {(k) - x (k) ∥}_{2},

(16)

where

\hat{x} (k)

and

x (k)

denote the estimated and true positions, respectively. A covariance-based confidence metric is defined as

\sum_{\hat{x}} (k) = {(J^{⊤} J)}^{- 1} σ_{d}^{2},

(17)

Its trace contributes to the reward, encouraging stable, confident localization. Figure 2 illustrates the proposed model-assisted actor–critic learning framework that integrates a super-sampling Info-GAN with entropy-regularized deep reinforcement learning for AUV localization [37]. On the left, the super-sampling Info-GAN learns the underlying state-transition distribution from past state pairs

(s, s^{'})

, where the generator, driven by random noise and an implicit transition model, produces predictive state pairs

(s_{t}, s_{t + 1})

, and the discriminator enforces realism by maximizing mutual information with real transitions. The generated transitions are combined with real environment interactions and stored in a shared replay buffer. On the right, the actor network selects actions

a_{t} \sim π (\cdot | s_{t})

that interact with the environment, while the Soft Q network minimizes the entropy-regularized TD error and the Soft V network minimizes the corresponding value loss. The actor is updated to maximize the expected return under the learned critics. Overall, the framework leverages GAN-based transition super-sampling to improve data efficiency and stability while learning energy-efficient and accurate localization policies under uncertain underwater acoustic conditions.

4.2. Hierarchical Deep Reinforcement Learning Framework

To achieve adaptive decision making across varying underwater conditions, the proposed framework employs a hierarchy of DRL algorithms, each contributing a specific optimization capability.

(a) TD3-Based Power Adaptation: The core of the proposed hierarchical learning framework is a TD3 agent, which is responsible for continuous beacon power allocation under dynamic underwater acoustic conditions. TD3 is selected as the backbone learning algorithm due to its strong stability properties in continuous action spaces and its robustness against overestimation bias, which is particularly critical in energy-constrained underwater networks, where erroneous value estimates can lead to excessive power consumption. In the proposed framework, the actor network

π_{ϕ} (s)

maps the current system state

s (k)

—comprising the AUV position, average RSS, number of active beacons, and ranging uncertainty—to a continuous power-allocation vector

a (k) = [a_{1} (k), \dots, a_{N} (k)]

. Each element

a_{i} (k)

represents the transmit power level of beacon i and is subsequently quantized to the nearest feasible discrete level defined in Section 3 This design allows TD3 to operate in a continuous control space while respecting practical hardware constraints.

To mitigate value overestimation, TD3 employs two independent critic networks,

Q_{θ_{1}}

and

Q_{θ_{2}}

, which estimate the state–action value function. Each critic is trained using the Bellman update:

Q_{θ_{j}} (s, a) = E! [r + γ Q_{θ_{j}^{'}} (s^{'}, π_{ϕ^{'}} (s^{'}))], j \in 1, 2,

(18)

where

θ_{j}^{'}

and

ϕ^{'}

denote slowly updated target network parameters. The use of two critics ensures that overly optimistic value estimates common in single-critic deterministic policy gradient methods are suppressed by taking the minimum of the target Q-values. To further enhance learning stability, target policy smoothing is applied by injecting clipped Gaussian noise into the target action:

y = r + γ min_{j} Q_{θ_{j}^{'}}! (s^{'}, π_{ϕ^{'}} (s^{'}) + clip (N, - c, c)),

(19)

where

N \sim N (0, σ_{n}^{2})

. This mechanism prevents the critic from exploiting sharp local peaks in the value function, which can arise from channel fading, multi-path effects, or RTT measurement noise in underwater environments. Each critic is trained by minimizing the mean-squared Bellman error:

L (θ_{j}) = E! [{(Q_{θ_{j}} (s, a) - y)}^{2}], j \in 1, 2 .

(20)

The actor network is updated using deterministic policy gradients computed with respect to the first critic:

\nabla_{ϕ} J = E! [\nabla_{a} Q_{θ_{1}} (s, a) \nabla_{ϕ} π_{ϕ} (s)],

(21)

which encourages the policy to select beacon power allocations that maximize the expected long-term localization reward while implicitly balancing energy consumption and ranging risk. A key feature of TD3 is delayed policy updates, where the actor and target networks are updated less frequently than the critics. In the proposed framework, the critic networks are updated at every time step, whereas the actor and target networks are updated once every d steps. This separation allows the critics to converge to more accurate value estimates before influencing the policy update, thereby reducing oscillations and improving convergence speed.

Overall, TD3 serves as the low-level continuous control module in the hierarchical framework, providing stable and energy-aware beacon power adaptation in the presence of nonlinear channel dynamics, stochastic noise, and mobility-induced uncertainties. Higher-level components, such as entropy-regularized exploration and cooperative learning extensions, are built on top of this stable TD3 backbone, as described in subsequent subsections.

The stability of the proposed learning framework follows from established results in actor–critic reinforcement learning. In particular, TD3 mitigates overestimation bias through clipped double-Q learning and delayed policy updates, which reduce gradient variance and prevent rapid oscillations during training. Under standard assumptions of bounded rewards, Lipschitz-continuous function approximators, and sufficiently small learning rates, gradient-based updates converge to a local stationary point of the expected return objective [38,39]. For the distributed setting, consensus-based parameter averaging in D2DPG ensures asymptotic agreement among local policies under connected communication graphs and diminishing step sizes, as established in distributed stochastic approximation theory.

(b) Entropy-Regularized SAC for Robust Exploration: The SAC augments the learning objective with an entropy term to prevent premature convergence:

J_{SAC} = E_{(s, a) \sim D} [Q_{θ} (s, a) - α log π_{ϕ} (a | s)],

(22)

where

α

controls the exploration–exploitation balance.

(c) Cooperative Multi-Agent and Distributed Extensions: For multi-AUV or cooperative scenarios, a MADDPG configuration is used, where each agent m optimizes

\nabla_{ϕ_{m}} J_{m} = E [\nabla_{a_{m}} Q_{θ_{m}} (s, a) \nabla_{ϕ_{m}} π_{ϕ_{m}} (s_{m})],

(23)

based on centralized critics and decentralized actors. For large-scale deployments, D2DPG enables consensus-based policy sharing:

ϕ_{i}^{t + 1} = ϕ_{i}^{t} + η \sum_{j \in N_{i}} w_{i j} (ϕ_{j}^{t} - ϕ_{i}^{t}),

(24)

ensuring convergence

∥ ϕ_{i}^{t} - ϕ^{*} ∥ \to 0

across the network under standard consensus assumptions.

The unified optimization objective across all variants is

max_{π_{θ}} E_{π_{θ}} [\sum_{k = 0}^{K} γ^{k} (ω_{1} f_{1} (\bar{ε} (k)) - ω_{2} f_{2} (E_{tot} (k)) - ω_{3} f_{3} (\bar{ρ} (k)))],

(25)

subject to beacon power and velocity constraints:

\begin{matrix} a_{i} (k) & \leq P_{T}, n (k) \leq N_{max}, \end{matrix}

(26)

\begin{matrix} {∥ x (k + 1) - x (k) ∥}_{2} & \leq v_{max} Δ t . \end{matrix}

(27)

The proposed framework adopts a unified actor–critic architecture in which TD3, SAC, MADDPG, and D2DPG are integrated in a hierarchical and modular fashion, rather than being executed as independent or parallel learning agents. At the core of the framework lies a shared deep neural network backbone composed of a common feature extractor, a single actor network, and a pair of critic networks. The feature extractor maps the state vector

s (k)

into a latent representation that is shared across all learning components, ensuring consistent state interpretation and reducing training complexity. The actor network outputs continuous beacon power-allocation actions

a (k)

, while two critic networks are instantiated following the TD3 paradigm to estimate state–action values and mitigate overestimation bias through clipped double-Q learning. Actor and critic updates follow the delayed-update strategy of TD3, which stabilizes training under continuous action spaces. To enhance exploration during early training stages, the actor objective is augmented with an entropy regularization term inspired by SAC. This entropy term encourages stochastic action sampling and prevents premature convergence to suboptimal deterministic policies. As training progresses, the entropy coefficient is gradually annealed, resulting in a purely deterministic TD3 policy at convergence, which is used during execution. For cooperative localization scenarios involving multiple AUVs, the critic networks are extended to a centralized training configuration following the MADDPG framework. In this setting, each critic receives joint observations and joint actions from all agents, enabling the learning of coordinated beacon selection and power control strategies. The actor networks, however, remain decentralized and identical across agents, allowing each AUV to operate using only local observations at execution time. This centralized training decentralized-execution (CTDE) structure ensures scalability while preserving coordination benefits.

In large-scale or communication-constrained deployments, the framework further incorporates D2DPG as a training-level enhancement. Under D2DPG, each agent performs local actor–critic updates using its own experience replay buffer, followed by periodic consensus-based parameter synchronization with neighboring agents. This consensus step aligns local policy parameters across the network without requiring raw experience-sharing, thereby reducing communication overhead and preserving data privacy. Importantly, D2DPG does not modify the policy architecture or reward structure; it solely governs how parameters are exchanged during training. Overall, the proposed framework maintains a single policy representation throughout training and deployment, with TD3 forming the deterministic learning backbone, SAC providing adaptive exploration, MADDPG enabling cooperative optimization, and D2DPG ensuring distributed scalability. At the execution time, only the trained actor network is deployed onboard the AUV, guaranteeing low computational complexity and real-time feasibility in underwater environments. Algorithm 1 depicts the RL-based AUV localization framework.

Algorithm 1 RL-based AUV localization framework
1:	Initialize: Actor–critic networks $π_{ϕ}$ , $Q_{θ}$ , replay buffer $D$
2:	Set learning rates $α$ , $β$ , discount factor $γ$ , and exploration policy $ϵ$
3:	for episode $e = 1$ to E do
4:	Initialize AUV position $x (0)$ and select initial beacon set $Y (0)$
5:	for time step $k = 1$ to K do
6:	Observe current state $s (k)$
7:	Select action $a (k) \sim π_{ϕ} (s (k))$ (with exploration $ϵ$ )
8:	Execute $a (k)$
9:	Measure RTT and RSS, update estimated position $\hat{x} (k)$
10:	Compute reward $r (k)$
11:	Store transition $(s (k), a (k), r (k), s (k + 1))$ in replay buffer $D$
12:	Sample mini-batch from $D$ and update actor–critic networks via TD3/SAC
13:	if multi-agent mode then
14:	Synchronise critic networks among agents (MADDPG)
15:	else if distributed mode then
16:	Exchange parameters with neighboring agents (D2DPG)
17:	end if
18:	end for
19:	Optional: decay exploration parameter $ϵ$ or learning rates
20:	end for
21:	Return: Trained policy $π_{ϕ}$ and value network $Q_{θ}$

4.3. Complexity and Convergence

Let us

N_{b}

denote the number of active beacons, L represent the number of episodes, and H represent the horizon length. The per-iteration computational complexity is

O (N_{b} d_{h}^{2})

, where

d_{h}

is the hidden-layer width. TD3 and SAC scale linearly with

N_{b}

, while MADDPG adds

O (M N_{b})

overhead for inter-agent communication, and D2DPG introduces

O (N_{b} log N_{b})

for neighborhood consensus.

Under bounded rewards and Lipschitz-continuous value functions, the policy sequence

{π_{θ}^{t}}

converges almost surely to a stationary point

π^{*}

satisfying

Q^{*} (s, a) = R (s, a) + γ E_{s^{'} \sim P} [max_{a^{'}} Q^{*} (s^{'}, a^{'})] .

(28)

Empirical results demonstrate that the proposed TD3, SAC, MADDPG, and D2DPG hybrid system achieves high localization accuracy, low energy usage, and resilient convergence in uncertain, bandwidth-limited underwater environments.

5. Results and Performance Evaluation

This section presents a comprehensive evaluation of the proposed DRL-based localization framework, highlighting its effectiveness in UAWSNs. We first describe the simulation setup and experimental parameters in Section 5.1, including network topology, sensor deployment, and RL configuration. The subsequent analysis focuses on key performance metrics: localization accuracy (Section 5.3), energy consumption (Section 5.4), and overall system utility (Section 5.5). We further investigate the convergence characteristics of various DRL algorithms in Section 5.6 and evaluate the sensitivity of the framework to the number of deployed beacons in Section 5.7. Collectively, these results demonstrate that the proposed approach consistently outperforms classical TD-only, DoA-only, and hybrid methods in terms of accuracy, efficiency, and robustness, validating the advantages of reinforcement learning in adaptive, multi-node localization scenarios. Table 1 depicts the simulation parameters and their values.

5.1. Experimental Setup

All simulations were conducted on a high-performance workstation equipped with an NVIDIA GeForce RTX 5070 GPU, a 14th-generation Intel Core i7 processor, and 256 GB of RAM. The simulation environment was developed specifically for this study and implements the complete system model, antenna array geometry, signal processing pipeline, and reinforcement learning framework described in previous sections. A 100 × 100 m two-dimensional operational area was used to deploy sensor nodes, whose spatial distribution was varied across experiments to evaluate the impact of network density. The aggregate node, located at a fixed reference position, was equipped with L antennas to perform DoA estimation. TD and DoA measurements were corrupted with additive white Gaussian noise to emulate realistic underwater acoustic impairments over a range of SNR conditions.

The RL module was configured using a standard Q-learning formulation in which the state encapsulates the instantaneous localization error, the number of contributing sensor nodes, and relevant measurement parameters. The learning agent refines the initial TD/DoA-based estimate by selecting appropriate actions that minimize long-term localization error and energy expenditure, with exploration governed through an

ϵ

-greedy strategy. The proposed RL-enhanced localization scheme was evaluated against three classical methods: TD-only, DoA only, and a non-RL hybrid TD+DoA estimator. Each scenario was averaged over 50 Monte Carlo trials to ensure statistical reliability. Table 2 illustrates the simulation parameters.

5.2. Baseline Schemes

In this section, we evaluate the performance of our proposed reinforcement learning–based localization methods, namely, SAC, TD3, D2DPG, and MADDPG, by comparing them with two state-of-the-art localization approaches and a theoretical performance bound. The first benchmark is the Floating Localization Algorithm (FLA) proposed in [31], which addresses localization for drifting restricted floating ocean sensor networks. The second benchmark is the recent reinforcement learning-based AUV localization framework presented in [40], which we refer to as SRLUWL and SDRLUWL. In addition to algorithmic comparisons, we also evaluate our proposed methods against the theoretical Cramér–Rao Lower Bound (CRLB) to assess their proximity to the optimal achievable localization accuracy. The comparative results demonstrate that the proposed SAC, TD3, D2DPG, and MADDPG-based localization methods achieve superior localization accuracy, faster convergence, and improved robustness under complex underwater acoustic environments.

5.3. Localization Accuracy Analysis

The localization accuracy trends presented in Figure 3 clearly demonstrate the effectiveness of the proposed RL-based framework compared to traditional estimation methods. As the number of sensor nodes increases, all methods experience some reduction in localization error due to improved geometric diversity and higher measurement redundancy. However, the magnitude and consistency of this improvement vary significantly across the compared approaches. The TD-only and DoA-only estimators exhibit the highest RMSE values across all node densities. Their curves decline slowly and remain relatively steep, indicating that neither modality alone provides sufficient information for high-precision localization, especially in scenarios with noise, multi-path distortions, or unfavorable beacon geometry. The TD-only curve shows mild fluctuations at lower node densities, a symptom of time-delay sensitivity to noise, while the DoA-only curve stays smoother but consistently performs worse due to angular estimation errors dominating in underwater conditions.

The traditional hybrid TD+DoA estimator performs better than the individual components, and its curve benefits more noticeably from the addition of sensor nodes. Nevertheless, the downward trend of its RMSE still lacks smoothness and exhibits small oscillations. These inconsistencies arise from the fact that classical hybrid fusion methods treat measurements uniformly, without contextual awareness of which nodes contribute the most information under varying geometrical configurations, depths, and channel distortions. As a result, the method fails to fully exploit the potential benefits of increasing node density. In contrast, the proposed RL-based method shows a markedly different accuracy profile. Its curve starts significantly lower than all baseline methods, even at low node counts, highlighting the agent’s ability to make informed decisions regarding node selection and transmit power allocation from the early stages. As the network becomes denser, the RL curve descends rapidly and then gradually flattens, showing a clear diminishing-return effect once the agent saturates its achievable accuracy. The smoothness of this curve reflects the stability and robustness of the learned policy: instead of relying on raw measurements alone, the RL framework selectively prioritizes high-quality signals, suppresses unreliable nodes, and dynamically allocates energy based on current channel conditions. A subtle but important detail in the RL curve is its minimal variance across the entire range of node densities. Unlike classical methods, which show small but noticeable irregularities due to measurement inconsistencies, the RL method maintains a monotonic and stable improvement trajectory. This behavior indicates that the RL agent has internalized a consistent strategy for handling multi-path noise, unequal node spacing, and dynamic underwater channel variations, all of which contribute to error spikes in non-learning approaches. Overall, the RMSE results verify that the RL-based localization system not only achieves the lowest absolute error but also demonstrates greater consistency, resilience to uncertainty, and optimal exploitation of increasing node availability. These characteristics confirm the superiority of the proposed framework in producing stable and reliable localization in underwater acoustic environments.

5.4. Energy Consumption Evaluation

The behavior shown in Figure 4 reveals several important characteristics of the energy consumption profile across different localization strategies. As the number of sensor nodes increases, the energy consumption of the TD-only and DoA-only estimators grows almost linearly. This trend is expected because both classical methods rely on fixed sensing routines, and the addition of each new node directly contributes to increased communication and processing overhead. The non-RL hybrid approach shows slightly reduced consumption compared to the single-modality methods, but the increase with node count remains steep due to its lack of adaptive control. In contrast, the proposed RL-based approach exhibits a noticeably sublinear increase in energy usage. The curvature of its graph flattens as the node count grows, indicating that the agent increasingly benefits from measurement redundancy and does not require all nodes to remain active during each localization cycle. The graph shows a clear knee point where energy expenditure begins to stabilize, demonstrating that the RL agent has learned to prioritize nodes that contribute the highest information gain while suppressing unnecessary transmissions. At larger node densities, the RL curve remains significantly below the other methods, confirming the model’s ability to exploit environmental structure and geometric diversity for more energy-efficient localization. Another important detail is that the RL method avoids the spikes in consumption observed in the non-RL approaches under mid-range node counts. These spikes arise from amplifying errors that trigger additional correction cycles in classical methods. However, the RL agent anticipates such conditions and selects power levels and refinement actions that maintain stable performance, resulting in a smooth and gradually increasing energy curve. This illustrates the method’s capacity not only to reduce overall consumption but also to enforce consistency and avoid abrupt increases in resource usage.

To illustrate the practical implications of energy efficiency, consider a typical underwater beacon powered by a lithium battery with a capacity of 100 Wh, which is representative of long-term underwater acoustic sensor deployments. Based on the average energy consumption trends observed in Figure 4, the TD-only and DoA-only localization schemes consume approximately 0.5 Wh per localization cycle, allowing roughly 200 localization cycles before battery depletion. In contrast, the proposed DRL-based approach reduces the per-cycle energy consumption to approximately 0.3 Wh, enabling nearly 330 localization cycles using the same battery capacity. This corresponds to an effective 40% reduction in energy consumption per cycle and an approximate 65% extension in operational lifetime compared to classical methods. Such a substantial reduction in battery stress directly lowers the frequency of costly underwater maintenance and redeployment operations and significantly enhances mission endurance. These gains are particularly critical in underwater environments, where battery replacement is logistically challenging, time-consuming, and expensive.

5.5. Utility Function Performance

The utility performance plotted in Figure 5 provides a comprehensive view of how well each localization method balances accuracy, energy, and reliability. The TD-only and DoA-only curves start with relatively low utility values and grow slowly with increasing node density. This slow growth reflects their limited ability to leverage additional sensor nodes, as both methods are constrained by their reliance on a single measurement modality. The modest slope in their curves shows that even when more nodes become available, the resulting improvement in localization and thus utility is marginal. The non-RL hybrid estimator performs better and exhibits a steeper positive slope, indicating that combining TD and DoA indeed yields higher utility. However, its graph still displays irregular fluctuations, particularly in the mid-density region, where conflicting measurement noise causes occasional drops in utility. These dips illustrate the sensitivity of traditional hybrid fusion methods to geometric dilution of precision and noise variance when operating without adaptive adjustment. The RL-based curve, in contrast, shows both the highest utility values across all node configurations and the smoothest upward trajectory. The monotonic rise of the RL curve demonstrates how the learning agent progressively improves its decision making as more nodes become available. Its ability to fuse TD and DoA, selectively choosing the most informative nodes and optimal power levels, results in a consistently higher slope than the competing methods. The graph shows a clear saturation region where the utility plateaus, indicating that the agent has learned to operate near optimality, and further increases in node density yield diminishing marginal returns. This plateau confirms that the RL framework successfully identifies the optimal balance between accuracy improvements and energy expenditure. Another noteworthy feature is the absence of sudden dips or oscillations in the RL curve. This stability reflects the robustness of the learned policy, which mitigates the impact of measurement noise, varying node geometry, and changing environmental conditions. The smooth profile and consistently superior values in Figure 5 collectively validate that reinforcement learning provides not only improved accuracy and energy efficiency but also substantially improved overall system utility.

5.6. Convergence Behavior of Learning Algorithms

Figure 6 illustrates the convergence characteristics of the proposed DRL-based localization schemes, offering insight into their behavior in terms of RMSE evolution, energy consumption trends, and overall utility improvement over time. The RMSE curves show that all learning-based methods gradually reduce localization error as the number of interaction steps increases, reflecting successful policy refinement through continuous exploration and environment feedback. TD3 and SAC converge particularly rapidly in the early stages due to stabilized critic updates in TD3 and entropy-regularized exploration in SAC, allowing both agents to avoid suboptimal deterministic actions. MADDPG and D2DPG converge more slowly but eventually achieve lower RMSE values by leveraging cooperative and distributed learning, which enables improved credit assignment in multi-node settings. In comparison, FLA and SRLUWL exhibit only marginal improvement and follow nearly linear trajectories, underscoring their limited capacity to adapt to dynamic underwater conditions. The CRLB plotted alongside these curves serves as a fixed reference and demonstrates how the DRL methods progressively approach the theoretical accuracy limit. This collective behavior confirms that the learned policies not only reduce localization error effectively but also stabilize quickly enough for real-time acoustic deployments. Energy consumption over time shows a similarly advantageous trend for DRL-based methods, with all learning agents gradually reducing power usage as training progresses. D2DPG achieves the most pronounced reduction, suggesting that distributed value propagation enables the agents to identify energy-efficient transmission strategies that do not compromise localization precision. TD3 and MADDPG also demonstrate substantial reductions because both methods are capable of suppressing unnecessary beacon activations once sufficient information has been acquired. In contrast, the FLA baseline maintains consistently high energy expenditure, highlighting the limitations of static transmission power and fixed beacon participation. These observations collectively indicate that the DRL models internalize energy-aware behavior naturally as part of long-term reward optimization. The utility metric, which jointly incorporates accuracy, energy usage, and beacon reliability, increases steadily for all DRL-based approaches. D2DPG once again presents the steepest rise, benefiting from shared learning signals that enhance coordination in distributed underwater networks. MADDPG and SAC similarly exhibit monotonic growth patterns, maintaining stable increases throughout training. Baseline methods such as FLA and SRLUWL grow much more slowly and remain significantly below the DRL curves due to the absence of adaptive decision making. The evolving shape of the utility curves demonstrates that the DRL methods improve overall system performance holistically rather than optimizing isolated aspects, thereby achieving a balanced integration of accuracy, efficiency, and robustness throughout the learning process.

5.7. Sensitivity to Number of Beacons

Figure 7 examines the influence of beacon quantity on the performance of the proposed localization framework. As the number of active beacons increases, RMSE decreases across all algorithms because additional beacons improve geometric diversity and reduce uncertainty in the multilateration process. Although performance improves for all methods, the DRL-based algorithms consistently achieve significantly lower RMSE values, even when using fewer beacons. D2DPG, MADDPG, TD3, and SAC maintain clear advantages over the baselines, and their RMSE curves begin to saturate as the beacon count grows, indicating that the learned policies avoid redundant beacons and selectively utilize those offering the highest informational value. The CRLB serves as a benchmark and reveals that D2DPG approaches this theoretical lower bound more closely than other methods, particularly beyond nine to ten beacons, confirming that DRL-driven localization scales efficiently in dense acoustic networks. The energy consumption trends show that although energy efficiency increases with the number of beacons for all methods, the growth rate differs significantly. DRL-based algorithms, especially D2DPG and SAC, exhibit the slowest increase because the learned policies avoid unnecessary transmissions and selectively activate only beneficial nodes. FLA, on the other hand, shows a steep rise in energy usage due to its fixed transmission power and inability to exploit the presence of multiple beacons intelligently. SDRLUWL and SRLUWL show moderate increases, reflecting partial adaptivity but lacking deeper optimization capability. These observations underline that the DRL frameworks handle additional beacons intelligently, activating them only when they contribute meaningfully to localization accuracy. The utility metric rises with increasing beacon count, demonstrating that larger beacon sets enhance overall performance by providing richer acoustic information. D2DPG, MADDPG, and TD3 achieve the highest utility values, reflecting their superior ability to balance accuracy and energy efficiency. As the number of beacons increases beyond ten to eleven, the utility curves enter a saturation region where additional beacons provide only marginal improvement. This behavior is consistent with diminishing returns commonly observed in multilateration-based systems. Baseline methods remain significantly lower in utility due to their inability to adapt beacon selection or power levels effectively. The combined trends confirm that the RL-driven framework remains robust and efficient even as the network becomes denser, retaining high utility through informed and adaptive decision making.

5.8. Computational Complexity and Resource Analysis

Table 3 and Table 4 provide a comprehensive quantitative and asymptotic evaluation of the computational complexity associated with the considered localization algorithms. The classical Floating Localization Algorithm (FLA) exhibits the lowest computational burden, requiring approximately

1.2 \times 10^{6}

floating-point operations (FLOPs) per episode and minimal memory resources due to its closed-form processing and absence of learning components. This low computational cost is accompanied by short training and execution times; however, the limited adaptivity of FLA results in inferior localization accuracy and energy efficiency, as reflected in Figure 3, Figure 4 and Figure 5. The learning-based baseline schemes, SRLUWL and SDRLUWL, introduce moderate computational overhead arising from neural network inference and parameter updates. Their increased FLOPs per episode and memory consumption stem from function approximation and experience replay, yet their complexity growth remains subquadratic with respect to the number of beacons, as shown in Table 4. While these methods achieve improved performance compared to classical approaches, their limited coordination capabilities constrain scalability in dense deployments.

Among the proposed DRL-based localization strategies, SAC and TD3 demonstrate a favorable trade-off between computational complexity and performance gains. Both algorithms maintain linear complexity, scaling

O (N)

while achieving significant reductions in RMSE and notable improvements in system utility. Although their training times are higher than those of non-learning methods, the average inference time per episode remains sufficiently low to support real-time underwater localization. In contrast, MADDPG and D2DPG incur higher computational and memory costs due to multi-agent coordination and the use of centralized or distributed critic architectures. This design leads to quadratic complexity growth

O (N^{2})

, particularly evident in scenarios with a large number of beacons. Nevertheless, the additional overhead is offset by superior convergence behavior, improved stability, and closer proximity to the theoretical Cramér–Rao Lower Bound. Overall, the presented results confirm that the proposed DRL framework remains computationally tractable and scalable, enabling a controllable trade-off between computational cost and localization performance in practical underwater acoustic networks.

5.9. Adaptivity Analysis Under Dynamic and Adverse Conditions

To explicitly demonstrate the adaptive capability of the proposed DRL-based localization framework, two stress-test experiments were conducted under dynamic and adverse operating conditions. These experiments evaluate how the learning-based approach responds to non-stationary environmental statistics and abrupt network disruptions, which are commonly encountered in long-term underwater acoustic deployments. In both experiments, the localization process is trained over 3000 episodes, and the Root Mean Square Error (RMSE) is recorded at each episode to capture the temporal evolution of localization accuracy. During the initial phase (episodes 1–1500), the environment remains stationary, allowing both methods to converge under nominal conditions. At episode 1500, a deliberate disturbance is introduced. The classical localization method employs fixed beacon participation and static processing rules, whereas the proposed DRL-based method dynamically adapts beacon selection and power allocation through continuous interaction with the environment.

5.9.1. Adaptivity to Time-Varying Noise Statistics

This experiment evaluates robustness against non-stationary measurement noise, which commonly arises in underwater acoustic channels due to temporal variations in temperature, salinity, and ambient interference. At episode 1500, the ranging noise statistics are altered to emulate a sudden environmental change without modifying the network topology. The resulting RMSE evolution is illustrated in Figure 8. Prior to the noise variation, both methods exhibit a gradual reduction in RMSE as training progresses, with the DRL-based method consistently achieving lower localization error than the classical approach. Following the change in noise statistics, the classical method experiences a noticeable increase in RMSE and subsequently shows a slow degradation trend, as observed in Figure 8. This behavior reflects the inherent limitation of fixed-parameter localization schemes, which are unable to adjust to evolving noise conditions once deployed.

In contrast, the DRL-based method exhibits a temporary increase in RMSE immediately after the noise variation, followed by a gradual recovery over subsequent episodes. As shown in Figure 8, the localization error steadily decreases and converges to a stable trajectory under the new noise regime. This behavior indicates that the learning agent successfully adapts its policy by implicitly learning the altered noise characteristics and adjusting beacon participation and power allocation accordingly. The absence of full recovery to pre-disturbance RMSE levels is expected, as higher noise variance imposes a fundamental accuracy limit. Overall, the results in Figure 8 demonstrate effective statistical adaptivity of the proposed framework under time-varying acoustic conditions.

5.9.2. Adaptivity to Sudden Beacon Failure

This experiment examines the framework’s response to a structural disturbance in which

30 %

of the beacons are randomly deactivated at episode 1500. This scenario models practical failure modes such as hardware malfunction, energy depletion, and communication outages in underwater sensor networks. The corresponding localization performance is shown in Figure 9. Before beacon failure, both methods show improving localization accuracy, with the DRL-based approach maintaining a clear performance advantage. At the failure point, the classical method suffers a sharp and sustained increase in RMSE due to the sudden loss of geometric diversity and measurement redundancy, as clearly illustrated in Figure 9. Although a slow decrease in RMSE is observed afterward, the localization accuracy remains significantly worse than pre-failure levels, indicating a permanent degradation caused by the reduced beacon set.

The DRL-based method also experiences an abrupt increase in RMSE immediately after beacon failure, reflecting the sudden topology change. However, unlike the classical approach, the DRL curve in Figure 9 exhibits a distinct recovery phase. After a short transient period, the RMSE decreases steadily and stabilizes at a new plateau. While this plateau is slightly higher than the pre-failure accuracy—an expected consequence of reduced beacon availability—it remains substantially lower than that of the classical method throughout the remainder of training. This behavior demonstrates the ability of the proposed framework to reselect informative beacons and reallocate transmission power among the remaining nodes to compensate for anchor loss. Overall, the results presented in Figure 8 and Figure 9 confirm that the proposed DRL-based localization framework exhibits strong adaptivity to both statistical disturbances and structural network changes, enabling robust operation under unknown, time-varying, and adverse underwater conditions.

6. Conclusions

In this article, we presented an adaptive and energy-efficient localization architecture of AUVs in UAWNs. The proposed framework, using DRL, can optimize beacon choice and transmit the power distribution, which is important, as it solves the issue of energy efficiency and localization accuracy in dynamic underwater environments. The integration of multiple DRL algorithms, including TD3, SAC, MADDPG, and D2DPG, enables effective decision making in continuously evolving environments, while a real-time, boundary-based geometric localization model enhances the accuracy of position estimation. The derived CRLB provides valuable insight into the theoretical performance limits of the localization process, while the computational complexity analysis demonstrates the scalability of the proposed framework for large-scale UAWSN applications. The proposed approach is much more efficient in terms of energy consumption and network lifetime than a conventional protocol. To sum up, the article has provided a strong solution for adaptive and energy-efficient AUV navigation in UAWSNs, with the potential to explore large-scale and real-life applications. Future studies may include more optimization methods, validation in real time, and incorporation of more sensor data to improve the strength and precision of the localization framework.

Author Contributions

Conceptualization, Z.U.K. and H.G.; methodology, Z.U.K. and H.G.; software, Z.U.K., H.G., and F.K.; validation, Z.U.K., H.G., F.K., S.A.H.M., A.M., and H.N.C.; formal analysis, Z.U.K., H.G., F.K., S.A.H.M., A.M., and H.N.C.; investigation, Z.U.K., H.G., and F.K.; resources, H.G.; data curation, H.G., F.K., S.A.H.M., A.M., and H.N.C.; writing—original draft preparation, Z.U.K.; writing—review and editing, Z.U.K., H.G., F.K., S.A.H.M., A.M., and H.N.C.; visualization, Z.U.K., H.G., F.K., S.A.H.M., A.M., and H.N.C.; supervision, H.G.; project administration, H.G.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62372131), the Natural Science Foundation of Heilongjiang Province (LH2020F017), and the Fundamental Research Funds for the Central Universities (3072025YC0801).

Data Availability Statement

The data will be available upon proper request to the corresponding author.

Acknowledgments

The authors extend their appreciation to Hongyuan Gao of Harbin Engineering University, the College of Information and Communication Engineering, for supervising this research project.

Conflicts of Interest

The authors declare that there are no conflicts of interest between them.

References

Essaky, S.; Raja, G.; Dev, K.; Niyato, D. ARReSVG: Intelligent Multi-UAV Navigation in Partially Observable Spaces Using Adaptive Deep Reinforcement Learning Approach. IEEE Trans. Veh. Technol. 2025, 74, 15429–15440. [Google Scholar] [CrossRef]
Ioannou, G.; Forti, N.; Millefiori, L.M.; Carniel, S.; Renga, A.; Tomasicchio, G.; Binda, S.; Braca, P. Underwater inspection and monitoring: Technologies for autonomous operations. IEEE Aerosp. Electron. Syst. Mag. 2024, 39, 4–16. [Google Scholar] [CrossRef]
Zhang, Y.; Zhu, J.; Wang, H.; Shen, X.; Wang, B.; Dong, Y. Deep reinforcement learning-based adaptive modulation for underwater acoustic communication with outdated channel state information. Remote Sens. 2022, 14, 3947. [Google Scholar] [CrossRef]
Gang, Q.; Muhammad, A.; Khan, Z.U.; Khan, M.S.; Ahmed, F.; Ahmad, J. Machine learning-based prediction of node localization accuracy in IIoT-based MI-UWSNs and design of a TD coil for omnidirectional communication. Sustainability 2022, 14, 9683. [Google Scholar] [CrossRef]
Yan, J.; Yi, M.; Yang, X.; Luo, X.; Guan, X. Broad-learning-based localization for underwater sensor networks with stratification compensation. IEEE Internet Things J. 2023, 10, 13123–13137. [Google Scholar] [CrossRef]
Campo-Valera, M.; Diego-Tortosa, D.; Asorey-Cacheda, R. Signal Processing for Estimating the Time of Arrival and Amplitude of Nonlinear Underwater Acoustic Waves. In Research and Applications of Digital Signal Processing; IntechOpen: London, UK, 2025. [Google Scholar]
Yan, J.; Guan, X.; Yang, X.; Chen, C.; Luo, X. A Survey on Integration Design of Localization, Communication and Control for Underwater Acoustic Sensor Networks. IEEE Internet Things J. 2025, 12, 6300–6324. [Google Scholar] [CrossRef]
Muhammad, A.; Li, F.; Khan, Z.U.; Khan, F.; Khan, J.; Khan, S.U. Exploration of contemporary modernization in UWSNs in the context of localization including opportunities for future research in machine learning and deep learning. Sci. Rep. 2025, 15, 5672. [Google Scholar] [CrossRef] [PubMed]
Nain, M.; Goyal, N.; Dhurandher, S.K.; Dave, M.; Verma, A.K.; Malik, A. A survey on node localization technologies in UWSNs: Potential solutions, recent advancements, and future directions. Int. J. Commun. Syst. 2024, 37, e5915. [Google Scholar] [CrossRef]
Kim, Y.; Erol-Kantarci, M.; Noh, Y.; Kim, K. Range-free localization with a mobile beacon via motion compensation in underwater sensor networks. IEEE Wireless Commun. Lett. 2020, 10, 6–10. [Google Scholar] [CrossRef]
Muhammad, A.; Li, F.; Mohsan, S.A.; Khan, Z.U.; Khan, W.; Khan, S.U.; Han, Z.; Khan, F. Magneto Inductive (MI) Channel Variables Prediction through Machine Learning Linear regression method, for Underwater and Underground WSNs. IEEE Access 2025, 13, 33124–33137. [Google Scholar] [CrossRef]
Khan, S.U.; Khan, Z.U.; Alkhowaiter, M.; Khan, J.; Ullah, S. Energy-efficient routing protocols for UWSNs: A comprehensive review of taxonomy, challenges, opportunities, future research directions, and machine learning perspectives. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102128. [Google Scholar] [CrossRef]
Tian, X.; Du, X.; Liu, X.; Wang, L.; Zhao, L. A Low-Delay Source-Location-Privacy Protection Scheme with Multi-AUV Collaboration for Underwater Acoustic Sensor Networks. IEEE Sens. J. 2025, 25, 12236–12252. [Google Scholar] [CrossRef]
Yue, Y.; Pan, Z.; Li, S.; Su, W.; Han, J. Reinforcement Learning Based Smart UUV-IoUT Localization in Underwater Acoustic Topology Network. IEEE Internet Things J. 2025, 12, 16637–16652. [Google Scholar] [CrossRef]
Su, R.; Gong, Z.; Li, C.; Han, S. High accuracy AUV-aided underwater Localization: Far-field information fusion perspective. IEEE Trans. Signal Process. 2024, 72, 1877–1891. [Google Scholar] [CrossRef]
Chen, C.; Mei, X. Brief Literature Survey on Localization in Ocean Wireless Sensor Networks. J. Phys. Conf. Ser. 2025, 3055, 012045. [Google Scholar] [CrossRef]
Liu, C.; Lv, Z.; Xiao, L.; Su, W.; Ye, L.; Yang, H.; You, X.; Han, S. Efficient beacon-aided AUV localization: A reinforcement learning based approach. IEEE Trans. Veh. Technol. 2024, 73, 7799–7811. [Google Scholar] [CrossRef]
Li, Y.; Yu, W.; Xu, H.; Guan, X. Robust Multiple Autonomous Underwater Vehicle Cooperative Localization Based on the Principle of Maximum Entropy. IEEE Trans. Autom. Sci. Eng. 2025, 22, 12960–12974. [Google Scholar] [CrossRef]
Hong, J.; Fulton, M.; Orpen, K.; Barthelemy, K.; Berlin, K.; Sattar, J. A Quantitative Evaluation of Bathymetry-Based Bayesian Localization Methods for Autonomous Underwater Robots. IEEE J. Ocean. Eng. 2025, 50, 985–1000. [Google Scholar] [CrossRef]
Li, Y.; Yu, W.; Guan, X. Trajectory Planning-Aided Cooperative Localization for Multi-AUV Networks Under Harsh Communication Conditions: A Co-Designed Approach. IEEE Trans. Netw. 2025, 33, 3088–3103. [Google Scholar] [CrossRef]
Wang, Y.; Song, S.; Liu, J.; Guo, X.; Cui, J. Efficient AUV-aided localization for large-scale underwater acoustic sensor networks. IEEE Internet Things J. 2024, 11, 31776–31790. [Google Scholar] [CrossRef]
Huang, P.; Li, Y.; Wang, Y.; Guan, X. Information-entropy-based trajectory planning for AUV-aided network localization: A reinforcement learning approach. IEEE Internet Things J. 2024, 12, 2122–2134. [Google Scholar] [CrossRef]
Li, Y.; Liu, M.; Zhang, S.; Zheng, R.; Lan, J. Node dynamic localization and prediction algorithm for internet of underwater things. IEEE Internet Things J. 2021, 9, 5380–5390. [Google Scholar] [CrossRef]
Yan, J.; Zhao, H.; Luo, X.; Wang, Y.; Chen, C.; Guan, X. Asynchronous localization of underwater target using consensus-based unscented Kalman filtering. IEEE J. Ocean. Eng. 2019, 45, 1466–1481. [Google Scholar] [CrossRef]
Yan, J.; Guo, D.; Luo, X.; Guan, X. AUV-aided localization for underwater acoustic sensor networks with current field estimation. IEEE Trans. Veh. Technol. 2020, 69, 8855–8870. [Google Scholar] [CrossRef]
Fan, R.; Boukerche, A.; Pan, P.; Jin, Z.; Su, Y.; Dou, F. Secure Localization for Underwater Wireless Sensor Networks via AUV Cooperative Beamforming with Reinforcement Learning. IEEE Trans. Mobile Comput. 2024, 24, 924–938. [Google Scholar] [CrossRef]
Middelkoop, J.M.; Celi, F.; Faggiani, A.; Hummel, H.; Bhulai, S.; Tesei, A.; Been, R.; Ferri, G. Optimizing Source Localization via Reinforcement Learning in Multi-Agent Underwater Networks. In Proceedings of the OCEANS 2025 Brest, Brest, France, 16–19 June 2025; pp. 1–10. [Google Scholar]
Li, Y.; Cai, K.; Zhang, Y.; Tang, Z.; Jiang, T. Localization and tracking for AUVs in marine information networks: Research directions, recent advances, and challenges. IEEE Netw. 2019, 33, 78–85. [Google Scholar] [CrossRef]
Liu, J.; Wang, Z.; Cui, J.-H.; Zhou, S.; Yang, B. A joint time synchronization and localization design for mobile underwater sensor networks. IEEE Trans. Mobile Comput. 2015, 15, 530–543. [Google Scholar] [CrossRef]
Liu, B.; Chen, H.; Zhong, Z.; Poor, H.V. Asymmetrical round trip based synchronization-free localization in large-scale underwater sensor networks. IEEE Trans. Wireless Commun. 2010, 9, 3532–3542. [Google Scholar] [CrossRef]
Luo, H.; Wu, K.; Gong, Y.-J.; Ni, L.M. Localization for drifting restricted floating ocean sensor networks. IEEE Trans. Veh. Technol. 2016, 65, 9968–9981. [Google Scholar] [CrossRef]
Liu, X.; Han, F.; Ji, W.; Liu, Y.; Xie, Y. A novel range-free localization scheme based on anchor pairs condition decision in wireless sensor networks. IEEE Trans. Commun. 2020, 68, 7882–7895. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, T.; Shin, H.-S.; Xu, X. Efficient underwater acoustical localization method based on time difference and bearing measurements. IEEE Trans. Instrum. Meas. 2020, 70, 8501316. [Google Scholar] [CrossRef]
Gong, Z.; Li, C.; Jiang, F.; Zheng, J. AUV-aided localization of underwater acoustic devices based on Doppler shift measurements. IEEE Trans. Wireless Commun. 2020, 19, 2226–2239. [Google Scholar] [CrossRef]
Pinheiro, B.C.; Moreno, U.F.; de Sousa, J.T.; Rodríguez, O.C. Kernel-function-based models for acoustic localization of underwater vehicles. IEEE J. Ocean. Eng. 2016, 42, 603–618. [Google Scholar] [CrossRef]
Wang, Q.; He, B.; Zhang, Y.; Yu, F.; Huang, X.; Yang, R. An autonomous cooperative system of multi-AUV for underwater targets detection and localization. Eng. Appl. Artif. Intell. 2023, 121, 105907. [Google Scholar] [CrossRef]
Wang, Z.; Sui, Y.; Qin, H.; Lu, H. State super sampling soft actor–critic algorithm for multi-AUV hunting in 3D underwater environment. J. Mar. Sci. Eng. 2023, 11, 1257. [Google Scholar] [CrossRef]
Fujimoto, S.; Van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning (ICML); PMLR: Brookline, MA, USA, 2018; pp. 1587–1596. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor–critic Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning (ICML); PMLR: Brookline, MA, USA, 2018; pp. 1861–1870. [Google Scholar]
Liu, C.; Chen, Y.; Xiao, L.; Yang, H.; Su, W.; You, X. Reinforcement learning-based AUV localization in underwater acoustic sensor networks. In 2023 IEEE/CIC International Conference on Communications in China (ICCC); IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]

Figure 1. Proposed system model architecture.

Figure 2. Architecture of the proposed system.

Figure 3. Localization estimation results based on RMSE.

Figure 4. Performance comparison of DRL–based schemes over time. (a) RMSE evolution, illustrating the localization/tracking accuracy achieved by different algorithms across time slots. (b) Energy consumption profile, highlighting the cumulative and instantaneous energy usage associated with each method. (c) Utility trend, reflecting the overall system reward that jointly captures accuracy, energy trade-offs, and long-term operational efficiency.

Figure 5. Performance comparison of DRL–based schemes over number of beacons. (a) RMSE evolution, illustrating the localization/tracking accuracy achieved by different algorithms across number of beacons. (b) Energy consumption profile, highlighting the cumulative and instantaneous energy usage associated with each method. (c) Utility trend, reflecting the overall system reward that jointly captures accuracy energy trade-offs and long-term operational efficiency.

Figure 6. Convergence behavior of DRL-based localization algorithms. (a) RMSE convergence over time slots, demonstrating the progressive improvement in localization accuracy as the learning process stabilizes. (b) Energy consumption convergence, showing the reduction and stabilization of energy usage with continued training and policy refinement. (c) Utility convergence trend, indicating the steady increase in overall system reward as the algorithms balance accuracy and energy efficiency during learning.

Figure 7. Convergence behavior of DRL-based localization algorithms. (a) RMSE convergence over number of beacons, demonstrating the progressive improvement in localization accuracy as the learning process stabilizes. (b) Energy consumption convergence, showing the reduction and stabilization of energy usage with continued training and policy refinement. (c) Utility convergence trend, indicating the steady increase in overall system reward as the algorithms balance accuracy and energy efficiency during learning.

Figure 8. Adaptivity to time-varying measurement noise. The ranging noise statistics change abruptly at episode 1500.

Figure 9. Adaptivity to sudden beacon failure. At episode 1500,

30 %

of the beacons are deactivated.

Figure 9. Adaptivity to sudden beacon failure. At episode 1500,

30 %

of the beacons are deactivated.

Table 1. Summary of related work (UAWSN localization and related problems).

Ref./Year	Proposed Method	Key Focus	Addressed Challenges	Key Findings	Limitations
Li, H. et al. (2025) [18]	Robust cooperative localization for multi-AUVs	Cooperative localization for multi-AUVs in UWSNs	Uncertainties in UWSNs, model mismatches, measurement errors, communication channel variability	Enhances accuracy, scalability, and robustness of multi-AUV localization by addressing uncertainty (maximum entropy + belief propagation).	Real-time uncertainty handling can be complex; potential performance drops in extreme environments; high computational cost.
Hong, J. et al. (2025) [19]	New probabilistic technique for AUV bathymetry localization	Bayesian AUV bathymetry localization	Visual landmark localization in UWSNs, finding true location in UWSNs scenarios	Compares EKF, UKF, Particle Filter, and MPF; MPF is most accurate under various conditions.	Simulations may not fully replicate real underwater complexities; performance may vary in non-ideal conditions.
Li, Y. et al. (2025) [20]	Integration of LCP for AUVs	Integrated method for LCP in AUV networks	Independent operation challenges, communication and planning conflicts, conflicting demands between communication and planning	Cooperative localization guided by trajectory planning; mutual enhancement under different communication conditions.	Potential real-time computational complexity; limited scalability for large networks; dependency on precise communication conditions.
Wang, Y. et al. (2024) [21]	Efficient AUV-based localization approach	Large-scale UAWSNs localization using multiple AUVs	Complex path planning for multiple AUVs, harsh underwater conditions	Unified framework integrating path planning and localization.	Challenges in real-time implementation; sensitivity to environmental conditions (e.g., stratification); computational complexity for large-scale networks.
Huang, P. et al. (2024) [22]	Trajectory planning localization for AUVs using RL	Anchor-based AUV localization for UAWSNs	Harsh submerged environment, high network maintenance cost, reliance on fixed anchors (buoys)	RL-based trajectory planning reduces entropy and shortens AUV trajectory while maintaining localization accuracy.	Limited scalability for large UAWSNs; computational complexity of RL algorithms.
Li, Y. et al. (2021) [23]	Dynamic node localization and prediction	Dynamic localization for IoT	Environmental changes, dynamic node behavior	Improves localization accuracy in changing environments.	Requires real-time environmental monitoring.
Yan, J. et al. (2020) [24]	DRL for privacy-preserving localization	Privacy-preserving localization in underwater networks	Security, privacy concerns	Provides a secure, privacy-preserving localization technique.	High computational cost for real-time applications.
Fan, R. et al. (2024) [26]	Secure localization in UWSNs using cooperative beamforming	Secure localization in UWSNs with cooperative beamforming	Privacy leaks, eavesdropping risks, multi-path complexities, stratification effects	Multi-anchor, multi-objective dual joint optimization improves security and energy performance; solved via MADDPG; validated in simulations and field experiments.	Complex and computationally intensive optimization; potential scalability issues for large deployments.
Middelkoop, J. et al. (2025) [27]	Multi-agent RL for underwater source localization	Trajectory optimization for underwater source localization	Non-stationary multi-agent environments, communication losses, trajectory optimization challenges	Shared-parameter MARL optimizes two-AUV trajectories to maximize detection probability.	Relies on simplified simulations; scalability limits for larger teams; sensitivity to real-world underwater dynamics.
Li, Y. et al. (2019) [28]	AUV-based localization	Tracking and localization of AUVs	Low accuracy in real-time tracking, interference in UAWSN environments	Summarizes recent advancements and future research directions.	Not suitable for large-scale IoT networks in UAWSNs.
Liu, J. et al. (2015) [29]	Joint time synchronization and localization	Localization for mobile UWSNs	Time synchronization in mobile networks	Combines time synchronization and localization to improve accuracy.	Computational complexity for real-time applications.
Liu, B. et al. (2010) [30]	Synchronization-free localization	Localization in large-scale UWSNs	Synchronization issues in large-scale networks	Proposes synchronization-free localization methods.	Limited scalability in real-world deployments.
Luo, H. et al. (2016) [31]	Localization of floating sensor networks	Drifting/ floating localization for UAWSNs	Sensor node mobility, lack of fixed infrastructure	Reports advances for localization under node mobility.	Mobility modeling and system design assumptions may not fully reflect complex real deployments.
Liu, X. et al. (2020) [32]	Anchor-paired range-free localization	Range-free localization for WSNs	Limited coverage, limited communication range	Introduces a decision-making approach for node localization.	High complexity when handling large networks.
Zhang, L. et al. (2020) [33]	Bearing measurement and time difference	Localization in UAWSNs environments	Time synchronization, environmental interference	Efficient localization approach for UAWSNs.	Sensitive to environmental changes.
Gong, Z. et al. (2020) [34]	Doppler shift-based AUV localization	AUV-based localization	Time synchronization errors, Doppler effects	Outperforms prior Doppler-based localization methods.	Dependency on AUV motion/conditions and external factors.
Pinheiro, B.C. et al. (2016) [35]	Kernel-function-based models	Localization of underwater vehicles	High error rates in mobile underwater vehicles	Improves localization accuracy for mobile vehicles.	Performance may degrade for large-scale networks.
Wang, Q. et al. (2023) [36]	Cooperative online target detection using multi-AUVs with SSS	Real-time target detection and positioning using multiple AUVs	Severe noise and geometric deformation (SSS), high false alarm rates, real-time computational constraints on AUVs	MSCNet for threshold segmentation and LWBlock for feature extraction.	Evaluated mainly in simulation/sea trials; potential scalability issues for larger detection networks.
Our Work	Multi-agent DRL-based adaptive localization (TD3, SAC, MADDPG, D2DPG)	Energy-efficient AUV localization with adaptive beacon selection	Dynamic underwater channels, energy constraints, scalability, non-stationarity	Achieves lower localization error, reduced energy consumption, faster convergence, and robustness under dynamic conditions.	Higher training complexity; requires offline training and sufficient exploration data.

Table 2. Simulation parameters.

Parameter	Value	Parameter	Value
Simulation area	$100 \times 100 m^{2}$	Number of beacons (N)	4–15
AUV motion model	Random waypoint/hovering	Beacon deployment	Random uniform
Carrier frequency ( $f_{c}$ )	$25 kHz$	Reference distance ( $d_{0}$ )	$1 m$
Geometric spreading factor ( $η$ )	$1.5$	Absorption coefficient ( $α (f_{c})$ )	$0.035 dB / m$
Maximum transmit power ( $P_{T}$ )	$10 W$	Power quantization levels (M)	5
Sound speed at surface ( $B_{0}$ )	$1500 m / s$	Sound speed gradient ( $B_{1}$ )	$0.017 s^{- 1}$
RTT noise variance ( $σ_{d}^{2}$ )	$0.05 m^{2}$	RSS noise variance ( $σ_{ζ}^{2}$ )	$1 dB$
Transmission duration ( $τ_{i}$ )	$50 ms$	Discount factor ( $γ$ )	$0.99$
Learning rate	$10^{- 4}$	Replay buffer size	$10^{6}$ samples
Mini-batch size	256	Exploration strategy	$ϵ$ -greedy/entropy
Training episodes	3000	Steps per episode	150
Monte Carlo runs	50	GPU used	NVIDIA RTX 5070

Table 3. Computational complexity and resource consumption comparison.

Algorithm	FLOPs/Episode	Avg. Time (ms)	Training Time (h)	Memory (MB)
FLA	$1.2 \times 10^{6}$	4.5	0.4	35
SRLUWL	$3.8 \times 10^{6}$	9.2	1.6	120
SDRLUWL	$4.6 \times 10^{6}$	11.0	2.0	150
SAC	$8.9 \times 10^{6}$	16.5	3.8	420
TD3	$7.6 \times 10^{6}$	14.8	3.4	390
MADDPG	$9.8 \times 10^{6}$	18.2	4.5	480
D2DPG	$1.05 \times 10^{7}$	19.6	5.0	520

Table 4. Asymptotic complexity growth with number of beacons.

Algorithm	Complexity Growth
FLA	$O (N)$
SRLUWL	$O (N log N)$
SDRLUWL	$O (N log N)$
SAC	$O (N)$
TD3	$O (N)$
MADDPG	$O (N^{2})$
D2DPG	$O (N^{2})$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khan, Z.U.; Gao, H.; Kulsoom, F.; Mohsan, S.A.H.; Muhammad, A.; Chaudry, H.N. Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization. J. Mar. Sci. Eng. 2026, 14, 262. https://doi.org/10.3390/jmse14030262

AMA Style

Khan ZU, Gao H, Kulsoom F, Mohsan SAH, Muhammad A, Chaudry HN. Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization. Journal of Marine Science and Engineering. 2026; 14(3):262. https://doi.org/10.3390/jmse14030262

Chicago/Turabian Style

Khan, Zahid Ullah, Hangyuan Gao, Farzana Kulsoom, Syed Agha Hassnain Mohsan, Aman Muhammad, and Hassan Nazeer Chaudry. 2026. "Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization" Journal of Marine Science and Engineering 14, no. 3: 262. https://doi.org/10.3390/jmse14030262

APA Style

Khan, Z. U., Gao, H., Kulsoom, F., Mohsan, S. A. H., Muhammad, A., & Chaudry, H. N. (2026). Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization. Journal of Marine Science and Engineering, 14(3), 262. https://doi.org/10.3390/jmse14030262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy-Efficient, Multi-Agent Deep Reinforcement Learning Approach for Adaptive Beacon Selection in AUV-Based Underwater Localization

Abstract

1. Introduction

2. Related Work

3. System Model

3.1. AUV Beacon Communication Model

3.2. Signal Propagation and Attenuation Model

3.3. Sound-Speed Profile and Range Estimation

3.4. AUV Position Estimation

3.5. The Energy Consumption Model

4. Proposed Method

4.1. Localization Problem Based on RL

4.2. Hierarchical Deep Reinforcement Learning Framework

4.3. Complexity and Convergence

5. Results and Performance Evaluation

5.1. Experimental Setup

5.2. Baseline Schemes

5.3. Localization Accuracy Analysis

5.4. Energy Consumption Evaluation

5.5. Utility Function Performance

5.6. Convergence Behavior of Learning Algorithms

5.7. Sensitivity to Number of Beacons

5.8. Computational Complexity and Resource Analysis

5.9. Adaptivity Analysis Under Dynamic and Adverse Conditions

5.9.1. Adaptivity to Time-Varying Noise Statistics

5.9.2. Adaptivity to Sudden Beacon Failure

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI