Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks

Yang, Liejun; Chen, Yi; Wang, Hui

doi:10.3390/engproc2025091007

Open AccessProceeding Paper

Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks^†

by

Liejun Yang

^1,2,*,

Yi Chen

¹ and

Hui Wang

³

¹

College of Information Engineering, Ningde Normal University, Ningde 352000, China

²

Key Laboratory of Fujian Provincial Universities of Intelligent Ecotourism and Leisure Agriculture, Ningde 352000, China

³

College of Physics and Information Engineering, Minnan Normal University, Zhangzhou 363000, China

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2024 IEEE 6th International Conference on Architecture, Construction, Environment and Hydraulics, Taichung, Taiwan, 6–8 December 2024.

Eng. Proc. 2025, 91(1), 7; https://doi.org/10.3390/engproc2025091007

Published: 9 April 2025

(This article belongs to the Proceedings of 2024 IEEE 6th International Conference on Architecture, Construction, Environment and Hydraulics)

Download

Browse Figures

Versions Notes

Abstract

Underwater acoustic communication networks (UACNs) play a critical role in ocean environmental monitoring, maritime rescue, and military applications. However, they are highly susceptible to performance degradation due to narrow bandwidths, long propagation delays, and severe multipath effects, especially adversarial jamming attacks. Traditional anti-jamming techniques struggle to adapt to the dynamic nature of underwater acoustic channels effectively. To address this issue, an anti-jamming power control and relay optimization method was developed based on transfer reinforcement learning. By introducing relay nodes, the reliability of jammed communication links is enhanced. Transfer learning was used to initialize Q-values and strategy distributions and accelerate the convergence of reinforcement learning in the underwater communication environment, thereby mitigating the inefficiency of random exploration in the early stages. The proposed method optimizes the transmission power and relay selection to improve the signal-to-interference-plus-noise ratio (SINR) and reduce the bit error rate (BER). Simulation results demonstrated that the proposed method significantly enhanced the anti-jamming performance and communication efficiency of underwater acoustic communication even in complex interference scenarios.

Keywords:

underwater acoustic communication networks; anti-jamming; transfer reinforcement learning; power control; relay optimization

1. Introduction

Underwater acoustic communication networks (UACNs) play a critical role in marine environment monitoring, maritime rescue, and naval surveillance [1,2]. However, UACNs present significant challenges due to their inherent limitations, including narrow bandwidths, prolonged transmission delays, and severe multipath effects [3,4,5]. These vulnerabilities make UACNs susceptible to interference, attacks, and eavesdropping on communication channels to steal sensitive information or disrupt communications by dynamically adjusting jamming power. Such attacks can deplete the energy of underwater sensor nodes and potentially trigger denial-of-service (DoS) scenarios.

Traditional anti-jamming techniques, such as frequency hopping and spread spectrum, are impractical for UACNs due to the dynamic topology of underwater networks. Consequently, power control is critical for UACNs. However, conventional power control methods struggle to adapt to the rapid variability and complexity of underwater acoustic channels as they rely on convex optimization [6,7]. Recently, reinforcement learning (RL) algorithms have been used to address interference using power control and relay selection technologies. In underwater relay cooperative communication networks, relay nodes are used to enhance transmission quality by forwarding signals. When channel conditions and interference models are unknown, RL algorithms optimize relay strategies to improve anti-jamming performance. For example, an RL-based anti-jamming transmission method [8] reduces the bit error rate by optimizing transmission power. Similarly, Q-learning in multi-relay cooperative networks is used to enhance communication efficiency [9].

Based on such results, the anti-jamming problem in UACNs was solved in this study by using a novel transmitter anti-jamming method that integrated RL with relay-assisted communication. Specifically, a transfer reinforcement learning (TRL) algorithm based on a hybrid strategy was designed to optimize transmitter actions. By leveraging prior anti-jamming data from similar scenarios, the developed approach initializes Q-values and policy distributions to mitigate the inefficiencies of traditional RL methods, which often start with zero-initialized Q-value matrices. This initialization accelerates the convergence of optimal policies, enhancing the overall anti-jamming performance.

2. System Model

We investigated a trunk-assisted anti-interference communication system for UACN, which comprises a transmitter, receiver, relay node, and a hostile jammer. The transmitter operates intermittently as transmitting information is vulnerable to disruption by nearby hostile jammers. These jammers emit interference signals to degrade the performance of information transmission. To enhance the system’s anti-interference capabilities, a relay node is introduced to forward information between the transmitter and receiver. The relay node effectively mitigates the impact of jamming, thereby improving the overall reliability and performance of the communication system.

At time k, the transmitter sends a signal to the receiver at the center frequency

f_{0}

and bandwidth

w

. The transmitter decides whether to trigger

x \in X = \{0, 1\}

, where A is the set of feasible actions determined by the transmitter, to utilize the relay node for forwarding the signal. When x = 1, the transmitter activates the relay node, which forwards the signal to the receiver at a fixed transmission power

P_{R}^{(k)}

and transmission cost

C_{R}

. Otherwise, the transmitter sends the signal directly to the receiver. The transmitting power

P_{S}

of the transmitter is selected from the range

[0, P_{S}^{\max}]

, where

P_{s}^{\max}

is the maximum allowable transmission power. The cost incurred by the transmitter per unit of transmitted power is denoted as

C_{S}

.

3. Anti-Interference Power Control Based on Reinforcement Learning

RL enables agents to optimize actions through feedback from environmental interactions. In this study, we developed a transmitter anti-interference communication scheme, HPTR, which leverages RL-based optimization to select relay nodes and transmission power under dynamic interference and uncertain channel conditions. The proposed scheme employs a test reference least squares (TRL) algorithm based on mixed strategies. By incorporating prior knowledge from similar scenarios, the algorithm accelerates the learning process, improving efficiency and effectiveness in complex and dynamic underwater environments.

State space: The transmitter’s performance is directly influenced by the state space. At slot k, the transmitter evaluates the previous communication performance, including the signal-to-noise ratio (SINR) of the direct link

S I N R_{S, T}^{(k - 1)}

and the bit error rate (BER) over the entire communication range

b^{(k - 1)}

. Thus, the state space is represented as

s^{(k)} = [S I N R_{S, T}^{(k - 1)}, b^{(k - 1)}]

(1)

Action space: Based on the policy distribution

π (s^{(k)}, \cdot)

and Q-value

Q (s^{(k)}, \cdot)

of the current state, the transmitter selects an anti-jamming strategy

a^{(k)} = [x^{(k)}, p_{s}^{(k)}] \in A

. Here, the relay triggers factor

x^{(k)} \in \{0, 1\}

and the transmitting power

P_{S}^{(k)} \in \{j \cdot P_{S}^{\max} / M | 0 \leq j \leq M\}

, where M represents the number of discrete power levels. The probability of selecting strategy

a^{(k)}

is as follows:

\Pr (a = a^{(k)}) = π (s^{(k)}, a^{*}), a^{*} \in A

(2)

Reward function: The reward function is a key component of the learning process, as it directly impacts the decision-making strategy of the transmitter. Based on the selected anti-jamming strategy of the transmitter, when

x^{(k)} = 0

, the transmitter transmits the signal to the receiver through a direct link. Otherwise, the transmitter triggers the relay node for information transmission. After receiving feedback from the receiver, the signal-to-noise ratio (SINR) and the relay trigger factor at the destination are calculated. The reward function is expressed as

\begin{array}{l} R^{(k)} = x^{(k)} \max \{S I N R_{S, T}^{(k)}, \min \{S I N R_{S, R}^{(k)}, S I N R_{R, T}^{(k)}\}\} \\ + (1 - x^{(k)}) S I N R_{S, T}^{(k)} - C_{S} P_{S}^{(k)} \end{array}

(3)

where the first term represents the effectiveness of the selected transmission strategy (either direct or relay-assisted) based on the SINR. The second term penalizes energy consumption by weighting the transmitter’s power cost.

The Q-function (s,), representing the expected utility of taking action a in state s, is updated iteratively using the Bellman equation as follows:

Q (s, a) \leftarrow (1 - α) Q (s, a) + α (U_{R} (s, a) + δ V (s^{'}))

(4)

where

s^{'}

is the next state resulting from action a, and V(s′) is the value function, defined as the maximum Q(s, a) across all possible actions:

V (s^{'}) \leftarrow \max_{a \in A} Q (s, a)

(5)

In addition, the learning rate

α \in (0, 1]

represents the weight of current experience, while the discount factor

δ \in [0, 1]

represents the degree of uncertainty about future utility. Based on the mixed strategy table represented by

π (s, a)

, the probability of other actions is reduced by increasing the probability corresponding to the actions with the largest Q-value.

π (s, a) \leftarrow π (s, a) + \{\begin{cases} β, a = \arg \max_{a \in A} Q (s^{(k)}, a) \\ \frac{- β}{| A | - 1}, o . w . \end{cases}

(6)

The proposed TRL algorithm leverages experience from large-scale UACNs in anti-jamming tasks to initialize Q-values and policy distributions for the current scenario. By transferring knowledge from similar environments, the algorithm reduces the randomness of early exploration and accelerates learning convergence. In particular,

ξ

experiments of anti-jamming UACN transmission in similar scenarios are considered before the learning process of transmitter strategy optimization begins. Each experiment lasts for time k, during which the transmitter observes the current state, including the bit error rate, the signal-to-dry-noise ratio, and other anti-jamming transmission performances of the underwater acoustic communication network, and relays information. Then, the transmitter mix strategy is selected according to the greedy strategy. The preparation phase of the HPTR algorithm is shown in Algorithm 1.

Algorithm 1: HPTR preparation phase

Initialize

α

,

β

,

δ

,

ξ

,

s^{(0)}

,

A

,

Q = 0

,

V = 0

,

π = \frac{1}{A}

For

n = 1, 2, \dots, ξ

do
for

k = 1, 2, \dots, K

do
By choice

a^{(k)} = [x^{(k)}, p_{s}^{(k)}] \in A

if

x^{(k)} = 1

then
The transmitter triggers the relay with power

p_{s}^{(k)}

to forward the information
else
The transmitter forwards the information directly to the receiver at power

p_{s}^{(k)}

end if
Measure the signal-to-noise ratio of current information

S I N R_{S, T}^{(k)}

,

S I N R_{S, R}^{(k)}

,

S I N R_{R, T}^{(k)}

,

b^{(k)}

Acquired utility function

U_{S}^{(k)}

According to calculation

Q^{*} (s^{(k)}, a^{(k)})

According to calculation

π^{*} (s^{(k)}, a^{(k)})

end for
end for
Output

Q^{*}

and

π^{*}

By utilizing the initial Q-values and the mixed policy table generated by Algorithm 1, Algorithm 2 is initialized with these values to facilitate learning. In Algorithm 2, the transmitter observes the current state, which includes the SINR of the transmitter–receiver link and the bit error rate (BER) received at the receiver from the previous time slot. Based on the observed state, the transmitter selects an action according to predefined decision rules and evaluates the utility of the chosen action. During each time slot, the Q-function and the mixed strategy table are updated using a formula for updating. Through this iterative and interactive process, the transmitter gradually learns an optimal anti-jamming strategy within the dynamic game framework of anti-jamming transmission. As a result, the proposed approach effectively enhances the anti-jamming performance of the transmission system.

Algorithm 2: HPT algorithm

Connect algorithm 1
Initialize

δ

α

,

β

,

s^{(0)}

,

A

Q = Q^{*}

,

V = 0

,

π = π^{*}

for

k = 1, 2, \dots, K

do
By choice

a^{(k)} = [x^{(k)}, p_{s}^{(k)}] \in A

if

x^{(k)} = 1

then
The transmitter triggers the relay with power

p_{s}^{(k)}

to forward the information
else
The transmitter forwards the information directly to the receiver at power

p_{s}^{(k)}

end if
Measure the signal-to-noise ratio of current information

S I N R_{S, T}^{(k)}

,

S I N R_{S, R}^{(k)}

,

S I N R_{R, T}^{(k)}

,

b^{(k)}

Acquired utility function

U_{S}^{(k)}

According to update

Q^{*} (s^{(k)}, a^{(k)})

According to update

V (s^{(k)})

According to update

π^{*} (s^{(k)}, a^{(k)})

end for

4. Simulation Results and Analysis

The performance of the HPTR learning algorithm was evaluated and analyzed through simulation experiments. In the experiment, the equipment was placed 0.5 m below the water surface. The transmitter, located at coordinates (0,0), sent signals to the receiver while selecting the appropriate transmission power. The power setting ranged from 1 to 10 W and was quantized into five discrete levels. The system operated with a center frequency of 20 kHz and a bandwidth of 2 kHz. The transmitter utilized the relay node to assist it in forwarding information. Upon receiving the trigger signal from the transmitter, the relay, positioned at (0.5,1.1), forwarded the signal to the receiver at a fixed transmission power, thereby enhancing the reliability of information delivery. The receiver, located at (1.9,0.3), decoded the received messages using a selection combination approach and evaluated BER and the signal-to-noise ratio (SNR) of the received signals to provide this feedback to the transmitter and relay. Meanwhile, an intelligent jammer at (0.5,−0.2) employed software-defined radio equipment to monitor the channel’s transmission state and estimate the quality of the transmitted signals. Using a reinforcement learning algorithm, the jammer dynamically selected its interference power, which ranged from 1 to 10 W, and was quantized into five levels to maximize its long-term discounted utility. Based on the chosen interference power, the jammer transmits interference signals targeting the receiver and relay. The transmitter power cost coefficient was set to 0.03, the interference power cost coefficient was 0.02, and the noise power was fixed at 0.1.

The anti-jamming performance of the proposed HPTR algorithm was compared with that of two benchmark algorithms commonly used in UACNs.

QPR: An anti-jamming power allocation algorithm based on Q-learning, similar to the approach described in Ref. [10]. Among the algorithms discussed in Refs. [10,11,12], the one in Ref. [12] is the most comparable to the proposed HPTR algorithm and is thus selected for performance comparison.
PTR: A reinforcement learning algorithm based on hybrid strategies [13]. This algorithm introduces decision randomness to create uncertainty and deceive jammers and enhance robustness against interference.

In the anti-interference attack and defense scenario of UACNs, whether the agent can quickly reach the Nash equilibrium solution is an important indicator for evaluating the convergence of the algorithm. Figure 1 shows the convergence of the HPTR algorithm in terms of utility. The results show that both the transmitter and the jammer can converge quickly, where the utility of the transmitter gradually increases over time, while the utility of the jammer gradually decreases over time, which verifies the effectiveness of the proposed model.

Figure 2 illustrates the convergence of the HPTR algorithm in terms of utility. The transmitter and the jammer converged quickly, and the transmitter’s utility increased over time, while the jammer’s utility decreased. The results validated the effectiveness of the proposed scheme model.

As shown in Figure 3, the SINR and transmitter utility of the proposed scheme were improved over time, while the message BER decreased. Specifically, from the initial stage to 1000 time slots, the message error rate of HPTR was reduced from 0.63 to 0.60, the SINR increased from 0.24 to 0.275 (increased by 14.5%), and the transmitter utility was increased by 125%. In comparison to QPR, the proposed scheme exhibited a higher SINR, improved transmitter utility, and a lower BER. For instance, at 1000 time slots, the HPTR scheme improved the transmitter utility by 12.5% and the SNR by 3.77% and reduced the bit error rate by 1.6%, compared with those of QPR. This improvement is attributed to the transfer learning technology employed by HPTR, which leverages anti-jamming experiences from similar UACNs. In the initial phase of transmitter strategy optimization, HPTR mitigates the inefficiencies of blind random exploration by incorporating additional experience. Furthermore, the hybrid strategy-based reinforcement learning algorithm optimizes the transmitter’s strategy to confuse the jammer, preventing it from executing precise attacks, thereby enhancing anti-jamming communication performance. Compared with PTR, the proposed HPTR method requires approximately 500 times more slots to converge to the optimal policy. The policy is then used until the network state or attack strategy changes for faster convergence due to the anti-jamming experience accumulated by HPTR via transfer learning.

Author Contributions

Conceptualization, L.Y. and Y.C.; methodology, L.Y. and H.W.; software, Y.C. and H.W.; validation, L.Y., Y.C. and H.W.; formal analysis, Y.C.; investigation, L.Y., Y.C. and H.W.; data curation, Y.C.; writing—original draft preparation, L.Y., Y.C. and H.W.; writing—review and editing, L.Y. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Fujian Provincial Natural Science Fund (2022J011226 and 2023J01155), and open project funding project of Fujian University Key Laboratory of Intelligent Ecotourism and Leisure Agriculture under Grant LN202401.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We are grateful for the insightful feedback and thoughtful comments offered by the journal editors and reviewers, which have been instrumental in enhancing our work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Su, Y.; Li, L.; Fan, R.; Liu, Y.; Jin, Z. A Secure Transmission Scheme with Energy-Efficient Cooperative Jamming for Underwater Acoustic Sensor Networks. IEEE Sens. J. 2022, 22, 21287–21298. [Google Scholar] [CrossRef]
Sheetal, B.; Ramakrishnan, S. Distributed Resource Allocation Model with Presence of Multiple Jammer for Underwater Wireless Sensor Networks. Iraqi J. Comput. Sci. Math. 2023, 2, 1002–1010. [Google Scholar]
Luo, H.; Wang, J.; Bu, F.; Ruby, R.; Wu, K.; Guo, Z. Recent progress of air/water cross-boundary communications for underwater sensor networks: A review. IEEE Sens. J. 2022, 22, 8360–8382. [Google Scholar] [CrossRef]
Yan, J.; Guo, D.; Luo, X.; Guan, X. AUV-aided localization for underwater acoustic sensor networks with current field estimation. IEEE Trans. Veh. Technol. 2020, 69, 8855–8870. [Google Scholar] [CrossRef]
Wang, H.; Han, G.; Hou, Y.; Guizani, M.; Peng, Y. A multi-channel interference based source location privacy protection scheme in underwater acoustic sensor networks. IEEE Trans. Veh. Technol. 2021, 71, 2058–2069. [Google Scholar] [CrossRef]
Li, X.; Feng, W.; Chen, Y.; Wang, C.-X.; Ge, N. Maritime coverage enhancement using UAVs coordinated with hybrid satellite-terrestrial networks. IEEE Trans. Commun. 2020, 68, 2355–2369. [Google Scholar] [CrossRef]
Zeng, C.; Wang, J.-B.; Ding, C.; Zhang, H.; Lin, M.; Cheng, J. Joint optimization of trajectory and communication resource allocation for unmanned surface vehicle enabled maritime wireless networks. IEEE Trans. Commun. 2021, 69, 8100–8115. [Google Scholar] [CrossRef]
Xiao, L.; Donghua; Jiang; Wan, X.; Su, W.; Tang, Y. Anti-jamming underwater transmission with mobility and learning. IEEE Commun. Lett. 2018, 22, 542–545. [Google Scholar] [CrossRef]
Shams, F.; Bacci, G.; Luise, M. Energy-efficient power control for multiple-relay cooperative networks using Q-Learning. IEEE Trans. Wirel. Commun. 2014, 14, 1567–1580. [Google Scholar] [CrossRef]
Lv, S.; Xiao, L.; Hu, Q.; Wang, X.; Hu, C.; Sun, L. Anti-jamming power control game in unmanned aerial vehicle networks. In Proceedings of the GLOBECOM 2017–2017 IEEE Global Communications Conference, Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar]
Wang, H.; Huang, Y.; Zeng, X. Jamming games in underwater sensor networks with hierarchical learning. In Proceedings of the 2022 8th International Conference on Big Data Computing and Communications (BigCom), Xiamen, China, 6–7 August 2022; pp. 428–434. [Google Scholar]
Yin, Z.; Lin, Y.; Zhang, Y.; Qian, Y.; Shu, F.; Li, J. Collaborative multiagent reinforcement learning aided resource allocation for UAV anti-jamming communication. IEEE Internet Things J. 2022, 9, 23995–24008. [Google Scholar] [CrossRef]
Xiao, L.; Xie, C.; Min, M.; Zhuang, W. User-centric view of unmanned aerial vehicle transmission against smart attacks. IEEE Trans. Veh. Technol. 2018, 67, 3420–3430. [Google Scholar] [CrossRef]

Figure 1. Utility convergence curve.

Figure 2. Transmitter anti-jamming communication performance: (a) transmitter utility, (b) bit error rate, and (c) SINR.

Figure 3. Anti-jamming performance of the transmitter under different interference power levels: (a) transmitter utility, (b) bit error rate, and (c) SINR.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Chen, Y.; Wang, H. Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks. Eng. Proc. 2025, 91, 7. https://doi.org/10.3390/engproc2025091007

AMA Style

Yang L, Chen Y, Wang H. Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks. Engineering Proceedings. 2025; 91(1):7. https://doi.org/10.3390/engproc2025091007

Chicago/Turabian Style

Yang, Liejun, Yi Chen, and Hui Wang. 2025. "Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks" Engineering Proceedings 91, no. 1: 7. https://doi.org/10.3390/engproc2025091007

APA Style

Yang, L., Chen, Y., & Wang, H. (2025). Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks. Engineering Proceedings, 91(1), 7. https://doi.org/10.3390/engproc2025091007

Article Menu

Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks^†

Abstract

1. Introduction

2. System Model

3. Anti-Interference Power Control Based on Reinforcement Learning

4. Simulation Results and Analysis

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks †

Abstract

1. Introduction

2. System Model

3. Anti-Interference Power Control Based on Reinforcement Learning

4. Simulation Results and Analysis

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Transfer Reinforcement Learning-Based Power Control for Anti-Jamming in Underwater Acoustic Communication Networks^†