An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion

Zhou, Fanyu; Wu, Haoran; Yue, Zhibin; Li, Han

doi:10.3390/app15116234

Open AccessArticle

An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion

by

Fanyu Zhou

,

Haoran Wu

^*

,

Zhibin Yue

and

Han Li

Naval Institute of Underwater Acoustic Technology, Naval University of Engineering, Wuhan 430033, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6234; https://doi.org/10.3390/app15116234

Submission received: 24 April 2025 / Revised: 24 May 2025 / Accepted: 30 May 2025 / Published: 1 June 2025

(This article belongs to the Special Issue Emerging Technologies for Underwater Acoustic Sensing and Communication)

Download

Browse Figures

Versions Notes

Abstract

Traditional underwater acoustic reconnaissance technologies are limited in directly detecting underwater acoustic communication signals. This paper proposes a dual-feature ResNet–Transformer model with two innovative breakthroughs: (1) A dual-modal fusion architecture of ResNet and Transformer is constructed using residual connections to alleviate gradient degradation in deep networks and combining multi-head self-attention to enhance long-distance dependency modeling. (2) The time–frequency representation obtained from the smooth pseudo-Wigner–Ville distribution is used as the first input branch, and higher-order statistics are introduced as the second input branch to enhance phase feature extraction and cope with channel interference. Experiments on the Danjiangkou measured dataset show that the model improves the accuracy by 6.67% compared with the existing Convolutional Neural Network (CNN)–Transformer model in long-distance ranges, providing an efficient solution for modulation recognition in complex underwater acoustic environments.

Keywords:

automatic modulation recognition (AMR); underwater acoustic communication; transformer network; deep learning

1. Introduction

Underwater acoustic (UWA) communication often employs modulation techniques such as Multiple Frequency Shift Keying (MFSK), Multiple Phase Shift Keying (MPSK), Direct Sequence Spread Spectrum (DSSS), and Orthogonal Frequency Division Multiplexing (OFDM). Compared with active sonar pulse signals, UWA communication signals are more complex and diverse, with greater differences in signal characteristics and feature parameters [1,2,3]. Traditional UWA reconnaissance technologies are difficult to directly apply to the detection of UWA communication signals [4], mainly because UWA communication signal reconnaissance also involves signal-modulation-style recognition and classification. The differences between the modulation recognition of underwater acoustic communication signals and that of radio signals mainly lie in the channel environment, which means that identifying modulation patterns for UWA acoustic signals requires considering far more factors than radio signals. For example, the propagation speed of sound waves in water is relatively low (approximately 1500 m/s), resulting in significant delays in received signals (about 66 ms/km) [5]. Against this background, how to effectively design a modulation recognition method to address the impacts of underwater acoustic channels has become an urgent problem to solve.

The modulation recognition of UWA signals is divided into traditional methods and those based on deep learning. Traditional methods are further divided into pattern recognition methods based on hypothesis testing and feature extraction. Compared with hypothesis testing methods, feature extraction methods do not rely on prior knowledge of the signal, so they are widely used in non-cooperative scenarios. Swami et al. first proposed a modulation recognition method based on higher-order statistics (HOS) [6]. This method made up for the shortcomings of the traditional maximum-likelihood algorithm, such as extremely high computational complexity and poor robustness to phase and frequency offsets. It also solved the problem that the feature classification methods at that time relied on a large number of samples or complex pre-processing. However, because the variance of the fourth-order statistics estimation of this method will increase at a low signal-to-noise ratio (SNR), the decision boundaries between similar sub-classes become blurred, and there is a limitation of an increased probability of misclassification. Khandker et al. proposed a digital modulation signal classification method based on time–frequency representation (TFR) analysis and peak detection [7]. This method solved the problem that traditional HOS methods have an insufficient ability to distinguish multi-peak signals such as FSK and OFDM. However, the classification of this method highly depends on the setting of the peak detection threshold. When the noise is too large, it may lead to the misjudgment of peaks, which in turn affects the classification accuracy. Deep learning shows significant advantages by integrating traditional features and end-to-end learning. Xu et al. proposed a recognition method based on CNN [8]. Additionally, the deep CNN model proposed by Singh et al. [9] and the CNN model with optimized hyperparameters proposed by Beeharr et al. [10] have all addressed the errors caused by manually designed decision criteria. However, constrained by the characteristics of their network architectures, these methods are prone to overfitting in small-sample scenarios and complex channels with fragmented features. Meanwhile, they exhibit limited robustness against low-SNR conditions and phase-modulated signals. With the emergence of the Long Short-Term Memory (LSTM) network, many people have started to introduce it into modulation recognition. Ke et al. proposed a learning framework based on an LSTM denoising autoencoder [11], which can automatically extract stable features from noisy radio signals. The classification accuracy of modulation signals such as 8PSK reaches 90% at a SNR of 8 dB. Dampage et al. proposed a joint classifier–demodulator–recognizer method based on the LSTM architecture [12]. This method has significant advantages in processing time–series data and improves the classification accuracy. Zhang et al. proposed a recognition network that combines an SNR classifier, a denoising autoencoder, and LSTM [13], which effectively enhances the model’s robustness to noise and its ability to distinguish features. In general, the LSTM model captures long-sequence time domain features through a gating mechanism, effectively alleviating the overfitting problem of deep learning models and improving the signal recognition accuracy in the low SNR range. However, this method can only improve the recognition accuracy by stacking the number of LSTM layers and using denoising autoencoders, resulting in a large number of training parameters and high computational complexity. Zhang et al. designed a hybrid neural network using a combination of CNN and Recurrent Neural Network (RNN) for the modulation recognition of UWA signals [14]. This method effectively improves the anti-noise performance of the model. However, due to the inherent defects of RNN, it is difficult for this method to capture long-distance dependencies. Nathan et al. designed a Convolutional Long Short-Term Deep Neural Network (CLDNN) [15]. This method retains the strong representation ability of CNN for local TFR and introduces the long-term modeling ability of LSTM for time–series dynamics, breaking through the limitations of a single network and providing more comprehensive feature processing capabilities for signal modulation recognition. Nevertheless, CLDNN has certain limitations in terms of training speed, recognition ability for complex modulation methods, and model complexity.

As the Transformer network model has been recognized for its advantages in global feature extraction through self-attention, J. Cai et al. apply the Transformer network model to modulation recognition [16], effectively addressing the issues of excessive parameter counts and insufficient temporal dependency in similar neural networks. However, this approach is only suitable for capturing long-range dependencies and has weaker capabilities in extracting local detailed features. Zhang et al. proposed a CTRnet modulation recognition network by combining CNN and Transformer [17], addressing the issue that existing single Transformer network models neglect local features. Li et al. proposed a Two-Stream Transformer (TSTR) network structure by integrating CNN and Transformer to identify the modulation types of underwater acoustic signals [18]. This method primarily optimizes the self-attention mechanism and employs a soft thresholding mechanism to enhance the network’s robustness to low SNR. Pang et al. proposed a parallel two-branch dynamic dilated Convolutional Neural Network combined with Transformer [19]. This method adaptively adjusts the receptive field to effectively capture signal features at different scales. References [20,21] propose a model combining the multi-head self-attention mechanism and LSTM, which effectively models long-range dependencies and reduces confusion between similar modulations. However, existing hybrid neural network architectures based on the multi-head self-attention mechanism commonly suffer from gradient degradation issues. Additionally, their fusion feature settings are often suboptimal, leading to low recognition accuracy for phase-modulated signals under low-SNR conditions, which leaves significant room for performance improvement.

To address the above challenges, this paper proposes a dual-feature ResNet–Transformer (DFRT) model, with main innovations including (1) constructing a dual-modal feature fusion framework, which uses ResNet residual connections to solve the gradient degradation problem in traditional deep networks while leveraging the Transformer’s multi-head self-attention mechanism to enhance long-distance dependency modeling capabilities. (2) The TFR obtained from the smooth pseudo-Wigner–Ville distribution (SPWVD) is used as the first input branch, and the HOS is introduced as the second input branch to enhance the extraction of signal phase features and make up for the defect that a single feature is affected by channel interference. (3) Experiments on the Danjiangkou measured dataset show that the model improves accuracy by 6.67% compared with existing CNN–Transformer models in long-distance ranges, providing new insights for UWA communication signal processing.

2. UWA Signal Preprocessing

The received UWA signal can be expressed as:

r (t) = s (t) * h (t) + n (t)

(1)

where

s (t)

represents the transmitted signal,

h (t)

represents the UWA channel, and

n (t)

represents channel noise.

The SPWVD [22] employs double smoothing processing with time domain and frequency domain windows. Compared with short-time Fourier transform (STFT) and wavelet transform (WT), SPWVD can provide higher time–frequency resolution and more effective suppression of cross-term interference, thereby offering a clearer representation of the signal’s TFR. The mathematical expression of SPWVD is given by:

{SPWVD}_{x} (t, f) = \int_{- \infty}^{+ \infty} \int_{- \infty}^{+ \infty} g (v) h (τ) e^{- j 2 π f τ} r (t - v + \frac{τ}{2}) r^{*} (t - v + \frac{τ}{2}) d v d τ

(2)

where

g (v)

and

h (τ)

denote window functions, and * represents complex conjugation.

The TFR obtained via SPWVD initially has a size of 4800 × 4800, Among them, the time domain window type is the Hamming window, the window length is 480, and the overlap rate is 240; the frequency domain window type is the Hamming window, the window length is 1200, and the overlap length is 600. To remove irrelevant information, the diagram is first cropped to 1000 × 4800 and then processed using bicubic interpolation to resize it to 100 × 480. The final TFR of the UWA signal is shown in Figure 1.

As shown in Figure 1, the TFR demonstrates strong capability in discriminating frequency-modulated signals (such as 2FSK, 4FSK, etc.) but performs poorly in distinguishing phase-modulated signals (such as 2PSK, 4PSK, etc.). Therefore, another feature needs to be extracted to further aid the model in improving its ability to extract phase information from phase-modulated signals. References [6,23] demonstrate how HOS is applied to modulation recognition and provide related verification. Experimental results show that this method can effectively extract the phase information of phase-modulated signals. Therefore, this paper performs feature extraction of HOS for the signal following the above method.

For a zero-mean complex stationary random process

y (n)

, its second-order statistics are defined in two different ways based on the position of complex conjugation as:

C_{20} = E [y^{2} (n)] and C_{21} = E [{|y (n)|}^{2}]

(3)

Similarly, the fourth-order statistic can be defined as:

C_{40} = cum (y (n), y (n), y (n), y (n)) C_{41} = cum (y (n), y (n), y (n), y * (n)) C_{42} = cum (y (n), y (n), y * (n), y * (n))

(4)

where * denotes complex conjugation.

For a zero-mean complex stationary random process

w, x, y, z

,

cum (w, x, y, z)

can be expressed as:

cum (w, x, y, z) = E [w x y z] - E [w x] E [y z] - E [w y] E [x z] - E [w z] E [x y]

(5)

Thus, Equations (3) and (4) can be optimized as:

C_{20} = \frac{1}{N} \sum_{n = 1}^{N} y^{2} (n) C_{21} = \frac{1}{N} {\sum_{n = 1}^{N} |y (n)|}^{2} C_{40} = \frac{1}{N} \sum_{n = 1}^{N} y^{4} (n) - 3 C_{20}^{2} C_{41} = \frac{1}{N} \sum_{n = 1}^{N} y^{3} (n) y^{*} (n) - 3 C_{20} C_{21} C_{42} = \frac{1}{N} \sum_{n = 1}^{N} {|y (n)|}^{4} - {|C_{20}|}^{2} - 2 C_{21}^{2}

(6)

Therefore, after subjecting the received signal to down-conversion, mean removal, and normalization, and then substituting it into Equation (6), the HOS of the signal can be obtained. Then, the obtained HOS are concatenated according to Equation (7) to obtain the feature

X_{h}^{'}

:

X_{h}^{'} = [C_{20}, C_{21}, C_{40}, C_{41}, C_{42}]

(7)

Complex-valued data cannot be directly input into a neural network, so the real and imaginary parts of

X_{h}^{'}

need to be split. The final HOS feature

X_{h}

is obtained as:

X_{h} = [\begin{array}{l} C_{R 20}, C_{R 21}, C_{R 40}, C_{R 41}, C_{R 42} \\ C_{I 20}, C_{I 21}, C_{I 40}, C_{I 41}, C_{I 42} \end{array}]

(8)

where

C_{R X}

is the real part of

C_{X}

and

C_{I X}

is the imaginary part of

C_{X}

.

3. Network Model Design

3.1. Methodology

Through innovative deep learning architecture design (a dual-model framework integrating ResNet and Transformer) and dual-feature joint learning (combining TFR with HOS), this study systematically addresses the challenge of modulation recognition for underwater acoustic communication signals in complex channels. The proposed approach provides an efficient end-to-end solution for underwater non-cooperative communication scenarios. The referenced model architectures include ResNet [24], Transformer [16], CNN–Transformer [17,18,19], and ResNet–Transformer [25]. The core structure of the DFRT model is illustrated in Figure 2.

In the feature extraction module of the proposed model, we designed two parallel branches (corresponding to the oyster gray blocks in Figure 2) by considering three key factors: feature type discrepancy, network adaptability, and computational efficiency.

(1): Feature Type Discrepancy: Time–frequency maps are 2D images containing both local structures (e.g., frequency-hopping patterns and energy distributions) and global trends (e.g., periodicity of modulation modes). HOS are low-dimensional vector features derived from signal statistics (e.g., real/imaginary parts of second/fourth-order statistics), reflecting non-Gaussianity and phase information as global statistical properties without local spatial structures.
(2): Network Adaptability: Statistical features, being mathematically transformed statistics without inherent spatial dependencies, are ill-suited for convolutional operations. Thus, ResNet is ineffective for their extraction.
(3): Computational Efficiency: The time–frequency branch employs a dual-network architecture to enhance multi-scale feature extraction, prioritizing expressive power. The statistics branch leverages Transformer’s global modeling capability to efficiently capture statistical correlations, avoiding overfitting from excessive layers.

In the classification module of the model, we designed the classification layer (corresponding to the dark gray block in Figure 2) with a focus on three key aspects: effective feature fusion, nonlinear expressiveness, and model generalization.

(1): Effective Feature Fusion:

A multi-head attention mechanism is employed to fuse the TFR (256 dimensions) extracted by the ResNet branch and the HOS features (256 dimensions) from the Transformer branch. This approach addresses the limitations of single-modal features (e.g., the insensitivity of TFR to phase information and the lack of spatial details in statistics).

(2): Nonlinear Expressiveness:

A fully connected layer maps the 256-dimensional fused features to 128 dimensions, reducing parameter redundancy and enhancing nonlinear separability. Meanwhile, nonlinear activation functions are applied to break linear classification constraints, enabling the model to capture complex nonlinear relationships among modulation types.

(3): Model Generalization:

Dropout regularization is incorporated to mitigate overfitting and improve the model’s generalization capability across diverse underwater scenarios.

3.2. Feature Extraction Layer

3.2.1. ResNet Model Structure

To effectively extract features of UWA signals for modulation classification, deep neural networks are typically used. However, in deep CNNs, gradients may vanish or explode as the number of network layers increases, leading to convergence failure. ResNet addresses these issues by introducing residual blocks and shortcut connections. Figure 3 illustrates the structure of a residual block in ResNet, where “3 × 3” denotes the size of the convolution kernel, “Conv” represents the convolution operation, “BatchNorm” denotes batch normalization, and “GeLu” represents the activation function operation. The input information

x

has two paths: one directly transmits to the output end, and the other goes through two convolution operations to obtain

F (x)

. The final output

H (x)

can be expressed as:

H (x) = F (x) + x

(9)

During forward propagation, when the input x meets the task requirements, the output

F (x)

of the residual module becomes 0, achieving an identity mapping and solving the network degradation problem. During backpropagation, thanks to the identity mapping, gradients can be transmitted forward through a single GeLu layer, effectively addressing gradient vanishing and explosion issues.

The ResNet model structure used in this paper is shown in Figure 4. To extract low-level features and accelerate the network’s convergence speed, the input signal is first fed into a convolutional layer and a batch normalization layer. The definition of batch normalization is:

{\hat{x}}_{i} = \frac{x_{i} - E [x_{i}]}{\sqrt{Var [x_{i}]}}

(10)

where

x_{i}

represents the i-th dimension of the feature batch, and

E [\cdot]

and

Var [\cdot]

denote the calculation of the mean and variance of

x_{i}

. To enable the neuron output to undergo a nonlinear transformation based on the input probability distribution, the batch-normalized data must pass through the GeLu activation function before being input into the residual block. The GeLu activation function is defined as:

GeLu (x) = 0.5 x [1 + \tanh (\sqrt{\frac{2}{π}} (x + 0.047715 x^{3}))]

(11)

The resulting signal is then input into the residual blocks to further extract signal features. Block1, Block2, and Block3 are the modules shown in Figure 3, and the numbers below indicate the number of output channels for each module. Finally, to improve the model’s generalization ability, an adaptive pooling layer is used to pool the input feature map into an output feature map of size 6 × 15, which is then flattened into a feature vector of size 256 for subsequent algorithm use.

3.2.2. Transformer Model Structure

In the process of modulation recognition, capturing global features can avoid the interference of local noise and further improve the accuracy of classification. However, the single ResNet model structure needs to continuously stack convolutional layers to extract global features, which significantly increases the number of parameters in the entire network. The Transformer can establish long-distance dependencies through multi-head self-attention. While effectively capturing global features, it can also effectively reduce the number of network parameters. The structure of the Transformer model is shown in Figure 5.

The yellow part in Figure 5 represents the calculation of the linear projection of the input sequence.

l

represents the number of heads

h

. For the i-th head

h_{i}

, its query matrix

Q_{i}

, key matrix

K_{i}

, and value matrix

V_{i}

can be expressed as:

Q_{i} = X W_{i}^{Q}, K_{i} = X W_{i}^{K}, V_{i} = X W_{i}^{V}

(12)

where

X \in R^{n \times d}

is the input sequence,

n

is the sequence length, and

d

is the feature dimension;

W_{i}^{Q} \in R^{d \times d_{k}}

,

W_{i}^{K} \in R^{d \times d_{k}}

, and

W_{i}^{V} \in R^{d \times d v}

are the network parameters for the i-th head, with

d_{k}

denoting the feature dimension of Q and K, and

d_{v}

denoting the feature dimension of V.

\begin{matrix} h e a d_{i} & = Attention (Q_{i}, K_{i}, V_{i}) \\ = Softmax (Scale (Q_{i}, K_{i})) V_{i} \\ = Softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i} \end{matrix}

(13)

The blue part in Figure 5 represents the multi-head concatenation operation. The output

h e a d_{1}, h e a d_{2}, \dots, h e a d_{l}

of the head

l

is concatenated along the feature dimension to obtain the head matrix

H e a d

as:

H e a d = [h e a d_{1}, \dots h e a d_{l}]

(14)

The red part in Figure 5 represents performing a linear projection on

H e a d

to obtain the output of the multi-head self-attention

M u l t i H e a d (Q, K, V)

as:

MultiHead (Q, K, V) = [h e a d_{1}, \dots h e a d_{l}] W^{o}

(15)

where

W^{o} \in R^{l d_{v} \times d_{k}}

is a network parameter used to ensure that the final output dimension matches the original input dimension while integrating information from different heads.

The “Mean” in the gray part of Figure 5 indicates taking the mean of the output along the time dimension; “LayerNorm” indicates performing layer normalization on the output. The output of the multi-head self-attention

MultiHead (Q, K, V)

has a corresponding feature vector at each time step. However, in subsequent processing, only a single feature vector representing the entire sequence is needed. Therefore, it is also necessary to take the average value of

MultiHead (Q, K, V)

in the time dimension to improve computational efficiency. Finally, to enhance the model’s performance and training stability, layer normalization is performed on the averaged sequence to obtain the final output of the Transformer model structure.

3.3. Classification Layer

The classification layer can be divided into the following steps:

(1): Fusing dual features: Multi-head attention is adopted to fuse time–frequency features and HOS features, resulting in a fused feature dimension of 256.
(2): Linear transformation: Linear transformation is achieved through matrix multiplication and addition operations, mapping the 256-dimensional input features to 128 dimensions.
(3): Nonlinear activation: The GELU activation function is applied to perform the nonlinear transformation on the output of the linear transformation, introducing nonlinearity to enable the model to learn more complex features.
(4): Dropout regularization: During training, a dropout layer is introduced to randomly set a portion of input units to zero, preventing the model from overfitting.
(5): Final classification: Linear transformation is implemented through matrix multiplication and addition operations, mapping the 128-dimensional input features to 8 dimensions to obtain the final classification results.

The parameter table for the entire classification layer is presented in Table 1.

4. Results

4.1. Datasets

To verify the effectiveness and reliability of the algorithm, this paper uses received and transmitted signals from Danjiangkou Reservoir as the dataset. The dataset was collected at Danjiangkou Reservoir through multiple campaigns from 5 December 2023 to the end of July 2024. All data were acquired under clear weather conditions with minimal vessel activity. The deployment depth of the experimental transducer was approximately 10 m, and the experimental hydrophone was deployed at a depth of about 4 m. The basic parameters of the collected dataset are shown in Table 2:

Among them, each signal modulation type has 100 samples collected at each distance. The SNR estimation result for a transmission and reception distance of 5 km is 6 dB. Due to the absence of 2FSK data at 10 km, data from 15 km were used to replace the missing 10 km data to ensure an equal number of samples for each signal type during the training phase. Meanwhile, the 2FSK data at 15 km were excluded during the testing phase.

In this experiment, cross-entropy is used as the loss function, and its expression is:

L o s s = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{i j} \log (p_{i j})

(16)

where

N

is the number of samples in the batch,

C

is the number of categories,

y_{i j}

is the label of the i-th sample, and

p_{i j}

is the predicted probability that the model assigns to the i-th sample belonging to the j-th category.

The batch size is set to 16, and the optimizer uses AdamW with a learning rate of 0.0005. All of the above training and prediction processes are implemented using PyTorch2.3.0.

4.2. Analysis

The experimental content is divided into two parts: (1) comparing the performance of the DFRT model with other models; (2) verifying whether HOS can improve the recognition accuracy. Both parts use classification accuracy as the evaluation metric, and parameters such as the optimizer and the number of training epochs are the same, except for the different contents being compared.

To validate the effectiveness of the DFRT model for underwater acoustic signal modulation classification, this paper conducts comparative experiments using three models: traditional CNN, CNN–Transformer, and DFRT. The advantages and disadvantages of each model are summarized in Table 3, and the accuracy results are presented in Figure 6. As the CNN relies only on simple convolutional and fully connected layers, it struggles to effectively capture the complex features of underwater acoustic signals. In contrast, the DFRT optimizes gradient propagation through a residual structure and combines the long-range dependency modeling capability of Transformer, significantly enhancing multi-scale feature extraction and cross-domain correlation analysis abilities. This allows it to exhibit superior accuracy across all distance ranges. In short-distance ranges (1, 5, 10 km), the DFRT model achieves an average recognition accuracy of 98.89%, representing a 2.68% improvement over the other two models. In long-distance ranges (15, 18 km), the DFRT model still maintains an accuracy of 95.84%, which is 6.67% higher than the suboptimal CNN–Transformer model (89.91%).

Secondly, to verify the effectiveness of HOS in enhancing signal features, this paper conducts comparative experiments based on the DFRT model using TFR and TFR-HOS features. The experimental results are shown in Figure 7. In short-distance ranges, the accuracy of the two approaches is similar, but in long-distance ranges, the input of mixed features increases the classification accuracy to 95.84%, an improvement of 2.92% compared to the input of single features (92.92%). Through analysis, it can be concluded that HOS compensates for the insensitivity of time–frequency domain analysis to phase information by capturing the non-Gaussian characteristics of signals, especially retaining key phase information under low-signal-to-noise ratio conditions.

To visually observe the impact of HOS on signal feature enhancement, Figure 8 displays the confusion matrices of different inputs at 18 km. Rows represent predicted modulation types, and columns represent true modulation types. Labels 0–7 denote 2FSK, 4FSK, 2PSK, 4PSK, OFDM, DSSS, 4DSSS, and 9DSSS, respectively. The results show that after fusing HOS features, the DFRT model improves the recognition accuracy for multiple modulated signals such as 2FSK, 2PSK, and OFDM.

5. Conclusions and Future Work

5.1. Conclusions

This paper proposes a dual-feature ResNet–Transformer (DFRT) network model, which systematically optimizes the problem of modulation-style recognition for UWA communication signals. In the feature extraction module of the model, experimental simulations were conducted using three architectures—CNN, CNN–Transformer, and ResNet–Transformer. As shown in Figure 6, the ResNet–Transformer architecture demonstrates significant advantages in accuracy across different distances, achieving an average improvement of 2.92% while maintaining a baseline accuracy of 90%. This validates the effectiveness of the architectural enhancements. In the classification module, a dual-feature fusion framework combining time–frequency domain and HOS was constructed and compared with a single time–frequency domain input. Figure 7 shows that the dual-feature fusion framework significantly improves the network’s accuracy under low-SNR conditions. Additionally, to verify the framework’s enhancement of phase-modulated signal recognition accuracy, Figure 8 demonstrates that this architecture effectively corrects previously misclassified OFDM and 2PSK signals. Through these two key improvements, the integrated DFRT network is formed, showcasing superior performance in UWA signal modulation recognition. Through the optimization of these two approaches, the model proposed in this paper not only improves classification accuracy but also effectively addresses the gradient vanishing issue prevalent in previous models. It demonstrates superior performance and has been successfully applied to the underwater acoustic communication field, offering a robust solution for modulation recognition in complex aquatic environments.

5.2. Future Work

Future research can be pursued in the following directions:

(1): Further enhance the model’s robustness in extremely long-distance and dynamic time-varying channels, and improve the extraction of weak phase features by incorporating semi-supervised learning techniques. This will address the challenges of low SNR and phase ambiguity in complex underwater environments.
(2): Expand the coverage of modulation types and develop a multi-task learning framework to achieve joint optimization of modulation recognition and channel estimation. This integrated approach will enhance the model’s versatility across diverse communication protocols.

Ultimately, translating the technology into practical applications will involve catering to the specific needs of both military and civil sectors, ensuring its adaptability to real-world underwater acoustic scenarios.

Author Contributions

Conceptualization, F.Z.; methodology, F.Z.; software, F.Z.; validation, F.Z., H.W.; formal analysis, Z.Y.; investigation, H.L.; resources, F.Z.; data curation, H.W.; writing—original draft preparation, F.Z.; writing—review and editing, F.Z. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

I apologize for the inconvenience, but the dataset used in this study is derived from a specific project. Publicly disclosing it would compromise the intellectual property rights and interests of the project due to confidentiality agreements. We appreciate your understanding of this constraint.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AMR	Automatic Modulation Recognition
CLDNN	Convolutional Long Short-Term Deep Neural Network
CNN	Convolutional Neural Network
DFRT	Dual Feature ResNet–Transformer
DSSS	Direct Sequence Spread Spectrum
HOS	Higher-Order Statistics
LSTM	Long Short-Term Memory
MFSK	Multiple Frequency Shift Keying
MPSK	Multiple Phase Shift Keying
OFDM	Orthogonal Frequency Division Multiplexing
RNN	Recurrent Neural Network
SNR	Signal-to-Noise Ratio
SPWVD	Smooth Pseudo-Wigner–Ville Distribution
STFT	Short-Time Fourier Transform
TFR	Time–Frequency Representation
TSTR	Two-Stream Transformer
UWA	Underwater Acoustic
WT	Wavelet Transform

Notation List

The following notations are used in this manuscript:

$Attention (∙)$	Mechanism in neural networks that computes a weighted sum of input vectors based on their relevance to a query, enhancing the model’s ability to focus on important features.
$cum (∙)$	Abbreviation for cumulative operations, which compute running totals or products over a sequence.
$E (∙)$	Mathematical expectation operator, denoting the average value of a random variable over its probability distribution.
$MultiHead (∙)$	Refers to multi-head attention, a technique that splits the input into multiple subspaces and applies parallel attention mechanisms to capture diverse patterns.
$Softmax (∙)$	Normalization function that converts a vector of raw scores into a probability distribution over multiple class.
$SPWVD (∙)$	Smoothed Pseudo Wigner–Ville Distribution, a time–frequency analysis method that combines the Wigner–Ville distribution with kernel smoothing to reduce cross-terms.
$\tan h (∙)$	Hyperbolic tangent function, a sigmoidal activation function that maps inputs to the range (−1, 1), often used in recurrent neural networks.
$Var (∙)$	Variance operator, measuring the spread of a random variable’s distribution around its expected value.

References

James, P. Acoustic propagation considerations for underwater acoustic communications network development. ACM SIGMOBILE Mob. Comput. Commun. Rev. 2007, 11, 2–10. [Google Scholar]
Rizwan, K.M.; Das, B.; Pati, B.B. Channel estimation strategies for underwater acoustic (UWA) communication: An overview. J. Frankl. Inst. 2020, 357, 7229–7265. [Google Scholar]
Nvzhi, T.; Zeng, Q.; Zeng, Q.; Xu, Q. Research on development and application of underwater acoustic communication system. J. Phys. Conf. Ser. 2020, 1617, 012036. [Google Scholar]
Fang, S. Principles and Techniques of Underwater Acoustic Reconnaissance; Science Press: Beijing, China, 2023; pp. 151–178. [Google Scholar]
Yao, X.; Yang, H.; Sheng, M. Automatic modulation classification for underwater acoustic communication signals based on deep complex networks. Entropy 2023, 25, 318. [Google Scholar] [CrossRef] [PubMed]
Ananthram, S.; Sadler, B.M. Hierarchical digital modulation classification using cumulants. IEEE Trans. Commun. 2000, 48, 416–429. [Google Scholar]
Nadya, H.K.; Mansour, A.; Nordholm, S. Classification of digital modulated signals based on time frequency representation. In Proceedings of the 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, Australia, 13–15 December 2010. [Google Scholar]
Xu, Z.; Wu, X.; Gao, D.; Su, W. Blind modulation recognition of UWA signals with semi-supervised learning. In Proceedings of the 2022 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Xi’an, China, 25–27 October 2022. [Google Scholar]
Brahmjit, S.; Jindal, P.; Verma, P.; Sharma, V.; Prakash, C. Automatic Modulation Recognition Using Modified Convolutional Neural Network. In Proceedings of the 2025 3rd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), Dehradun, India, 21–22 March 2025. [Google Scholar]
Yogesh, B.; Emilien, D.G.D. A Modified Convolutional Neural Network Model for Automatic Modulation Classification. In Proceedings of the 2025 Emerging Technologies for Intelligent Systems (ETIS), Dehradun, India, 21–22 March 2025. [Google Scholar]
Ke, Z.; Vikalo, H. Real-time radio technology and modulation classification via an LSTM auto-encoder. IEEE Trans. Wirel. Commun. 2021, 21, 370–382. [Google Scholar] [CrossRef]
Udaya, D.; Amarasooriya, R.; Samarasinghe, S.M.; Karunasingha, N.A. Combined Classifier-Demodulator Scheme Based on LSTM Architecture. Wirel. Commun. Mob. Comput. 2022, 2022, 5584481. [Google Scholar]
Zhang, B.; Chen, G.; Jiang, C. Research on modulation recognition method in low SNR based on LSTM. J. Phys. Conf. Ser. 2022, 2189, 012003. [Google Scholar] [CrossRef]
Zhang, W.; Yang, X.; Leng, C.; Wang, J.; Mao, S. Modulation recognition of underwater acoustic signals using deep hybrid neural networks. IEEE Trans. Wirel. Commun. 2022, 21, 5977–5988. [Google Scholar] [CrossRef]
West Nathan, E.; O’shea, T. Deep architectures for modulation recognition. In Proceedings of the 2017 IEEE international symposium on dynamic spectrum access networks (DySPAN), Baltimore, MD, USA, 6–9 March 2017. [Google Scholar]
Cai, J.; Gan, F.; Cao, X.; Liu, W. Signal modulation classification based on the transformer network. IEEE Trans. Cogn. Commun. Netw. 2022, 8, 1348–1357. [Google Scholar] [CrossRef]
Zhang, W.; Xue, K.; Yao, A.; Sun, Y. CTRNet: An Automatic Modulation Recognition Based on Transformer-CNN Neural Network. Electronics 2024, 13, 3408. [Google Scholar] [CrossRef]
Li, J.; Jia, Q.; Cui, X.; Gulliver, T.A.; Jiang, B.; Li, S.; Yang, J. Automatic modulation recognition of underwater acoustic signals using a two-stream transformer. IEEE Internet Things J. 2024, 11, 18839–18851. [Google Scholar] [CrossRef]
Pang, C.; Wang, F.; Chen, M.; Liu, Y.; Dong, Y. Modulation Recognition of Underwater Acoustic Signals Using Dynamic Dilated Convolutional Neural Network and Transformer. In Proceedings of the 2024 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Zhuhai, China, 22–24 November 2024. [Google Scholar]
Shin, D.-M.; Park, D.-H.; Kim, H.-N. Deep Learning-Based Modulation Recognition with Multi-Scale Temporal Feature Extraction. In Proceedings of the 2025 International Conference on Information Networking (ICOIN), Chiang Mai, Thailand, 15–17 January 2025. [Google Scholar]
Dong, Y.; Zhai, R.; Wang, B.; Zhong, Y.; Rong, Z.; Wang, Y. A Novel Distributed Solution for Automatic Modulation Classification Based on Federated Learning and Modified LSTM. IEEE Trans. Veh. Technol. 2025, 1–13. [Google Scholar] [CrossRef]
Leon, C. Time-Frequency Analysis: Theory and Applications; Pearson College Div: Toronto, ON, Canada, 1995. [Google Scholar]
Xie, W.; Hu, S.; Yu, C.; Zhu, P.; Peng, X.; Ouyang, J. Deep learning in digital modulation recognition using high order cumulants. IEEE Access 2019, 7, 63760–63766. [Google Scholar] [CrossRef]
He, K.; Zhang, X. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Chen, J.; Zhao, R. Wireless Signal Recognition Based on ResNet-Transformer. In Proceedings of the 2023 11th International Conference on Intelligent Computing and Wireless Optical Communications (ICWOC), Chongqing, China, 16–18 June 2023. [Google Scholar]

Figure 1. TFR of UWA signal. (a) 2FSK; (b) 4FSK; (c) 2PSK; (d) 4PSK; (e) OFDM; (f) DSSS; (g) 4DSSS; (h) 9DSSS.

Figure 2. The model structure of the DFRT.

Figure 3. The structure of the residual block in the ResNet.

Figure 4. The model structure of ResNet.

Figure 5. The model structure of Transformer.

Figure 6. Classification accuracy of different models.

Figure 7. Classification accuracy of different input features.

Figure 8. Confusion matrix for different inputs at 18 km. (a) TFR; (b) TFR-HOC.

Table 1. The parameter table for the entire classification layer.

Layer (Type)	Input Shape	Output Shape	Param
Linear	[1, 256]	[1, 128]	32,896
Gelu	[1, 128]	[1, 128]	0
Dropout	[1, 128]	[1, 128]	0
Linear	[1, 128]	[1, 8]	1032

Table 2. Dataset parameters.

Modulations	2FSK, 4FSK, 2PSK, 4PSK, OFDM, DSSS, 4DSSS, 9DSSS
Signal format	time domain signal
Signal dimension	1 × 4800
Distance (km)	1; 5; 10; 15; 18
Total number of samples	4000
training sets: test sets	7:3

Table 3. The advantages and disadvantages of each model.

	Advantages	Disadvantages
CNN	(1) Simple architecture: Characterized by a small parameter size and high computational efficiency, making it suitable for lightweight scenarios. (2) Local feature sensitivity: Capable of capturing local structures in TFR (e.g., frequency hopping patterns) with reasonable accuracy.	(1) Inadequate long-range dependency modeling: Fails to effectively capture global temporal or frequency domain correlations in signals (e.g., periodicity of modulation patterns). (2) Weak phase information extraction: Results in low classification accuracy for phase-modulated signals (e.g., 2PSK, 4PSK) due to the lack of nonlinear phase feature modeling capability. (3) Gradient degradation issue: Prone to gradient vanishing in deep networks, limiting the ability to learn complex features.
CNN–Transformer	(1) Enhanced global feature capture capability: Compared to single CNN, it can extract long-range dependencies through Transformer’s self-attention mechanism. (2) Attempt at multi-modal feature fusion: Integrates TFR with shallow statistical features to improve recognition accuracy for complex modulations.	(1) High design complexity: The hybrid architecture requires coordinating parameters between convolutional and Transformer modules, increasing training complexity and potentially necessitating larger datasets or more sophisticated parameter tuning strategies. (2) Unresolved gradient degradation: Training difficulties persist in deep networks due to the absence of residual connections.
DFTR	(1) Effective gradient degradation mitigation: The ResNet residual connections ensure training stability in deep networks. (2) Enhanced phase feature extraction: HOS directly captures signal non-Gaussianity and phase information, compensating for the insensitivity of TFR to phase. (3) Global–local feature collaboration: Transformer models long-range dependencies, while ResNet extracts local details, jointly enhancing robustness in complex scenarios. (4) Dynamic feature fusion: The multi-head attention mechanism automatically adjusts feature weights according to task requirements, avoiding the limitations of fixed fusion.	(1) Higher computational complexity: The dual-branch structure and multi-head attention mechanism result in a larger parameter size compared to single CNN or Transformer models. (2) Strong data dependency: HOS extraction is dependent on high-quality signal preprocessing, requiring more complex denoising processing under low-SNR conditions.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, F.; Wu, H.; Yue, Z.; Li, H. An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion. Appl. Sci. 2025, 15, 6234. https://doi.org/10.3390/app15116234

AMA Style

Zhou F, Wu H, Yue Z, Li H. An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion. Applied Sciences. 2025; 15(11):6234. https://doi.org/10.3390/app15116234

Chicago/Turabian Style

Zhou, Fanyu, Haoran Wu, Zhibin Yue, and Han Li. 2025. "An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion" Applied Sciences 15, no. 11: 6234. https://doi.org/10.3390/app15116234

APA Style

Zhou, F., Wu, H., Yue, Z., & Li, H. (2025). An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion. Applied Sciences, 15(11), 6234. https://doi.org/10.3390/app15116234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Underwater Acoustic Communication Signal Modulation-Style Recognition Algorithm Based on Dual-Feature Fusion and ResNet–Transformer Dual-Model Fusion

Abstract

1. Introduction

2. UWA Signal Preprocessing

3. Network Model Design

3.1. Methodology

3.2. Feature Extraction Layer

3.2.1. ResNet Model Structure

3.2.2. Transformer Model Structure

3.3. Classification Layer

4. Results

4.1. Datasets

4.2. Analysis

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Notation List

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI