LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition

Feng, Yu; Chen, Zhangxin; Chen, Yixuan; Xie, Ziqin; He, Jiale; Li, Jiachang; Ding, Houqian; Guo, Tao; Chen, Kai

doi:10.3390/jmse13112073

Open AccessArticle

LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition

by

Yu Feng

^1,2,3,†

,

Zhangxin Chen

^2,3,†,

Yixuan Chen

^2,3

,

Ziqin Xie

⁴

,

Jiale He

⁵

,

Jiachang Li

⁶,

Houqian Ding

⁷,

Tao Guo

^2,3,* and

Kai Chen

^1,*

¹

Key Laboratory of Modern Acoustics, Institute of Acoustics, Nanjing University, Nanjing 210093, China

²

Wuhan Digital Engineering Research Institute, Wuhan 430074, China

³

Wuhan Lingjiu Microelectronics Co., Ltd., Wuhan 430074, China

⁴

Department of Computer Science, Hubei University of Technology, Wuhan 430068, China

⁵

Department of Electrical and Electronic Engineering, Guangdong Polytechnic College, Zhaoqing 526100, China

⁶

School of Electronics Science and Engineering, Nanjing University, Nanjing 210093, China

⁷

Department of Physics, Nanjing University, Nanjing 210093, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2025, 13(11), 2073; https://doi.org/10.3390/jmse13112073

Submission received: 30 September 2025 / Revised: 27 October 2025 / Accepted: 29 October 2025 / Published: 31 October 2025

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Ship-radiated noise (SRN) recognition is vital for underwater acoustics, with applications in both military and civilian fields. Traditional manual recognition by sonar operators is inefficient and error-prone, motivating the development of automated recognition systems. However, most existing deep learning approaches demand high computational resources, limiting their deployment on resource-constrained edge devices. To overcome this challenge, we propose LW-MS-LFTFNet, a lightweight model informed by time-frequency analysis of SRN that highlights the critical role of low-frequency components. The network integrates a multi-scale depthwise separable convolutional backbone with CBAM attention for efficient spectral representation, along with two LSTM-based modules to capture temporal dependencies in low-frequency bands. Experiments on the DeepShip dataset show that LW-MS-LFTFNet achieves 75.04% accuracy with only 0.85 M parameters, 0.38 GMACs, and 3.27 MB of storage, outperforming representative lightweight architectures. Ablation studies further confirm that low-frequency temporal modules contribute complementary gains, improving accuracy by 2.64% with minimal overhead. Guided by domain-specific priors derived from time-frequency pattern analysis, LW-MS-LFTFNet achieves efficient and accurate SRN recognition with strong potential for edge deployment.

Keywords:

underwater acoustic signal recognition; lightweight network; time-frequency pattern analysis; depthwise separable convolution; attention mechanism; low-frequency temporal features

1. Introduction

Ship-radiated noise recognition is a critical research area in underwater acoustics, with significant military and civilian applications. Traditionally, this task has relied on the judgement of sonar operators, which requires considerable time and resources for training. Additionally, sonar operators are susceptible to errors in recognition due to both human factors and environmental influences [1]. These challenges underscore the urgent need for a robust and efficient automated recognition system. While substantial progress has been made in this area [2,3,4,5], most existing methods struggle in practical scenarios, as they are designed for high-performance computing environments without considering the strict limitations in the computational power and battery life of underwater devices [6,7]. Consequently, these methods are unsuitable for deployment on edge platforms. To address these limitations, it is essential to develop lightweight models that are both robust and efficient, which are specifically designed to meet the resource constraints of edge devices.

Ship-radiated noise recognition involves two key steps: feature extraction and classifier design. Accurate recognition heavily relies on discriminative features [8], which are typically categorized into three types: the time domain, the frequency domain, and the time-frequency domain. Time-frequency analysis leverages the advantage of time and frequency domain analyses, providing a comprehensive view of signal variations [9]. These features are subsequently input into classifiers, which commonly involve classical machine learning and deep learning models. Although many methods based on classical machine learning have been proposed [10,11,12], their low computational complexity limits their ability to capture the non-stationary nature of underwater signals and the nonlinear relationships introduced by multipath propagation and Doppler effects [13,14]. These limitations have led to a growing interest in deep learning, which offers stronger feature representation [15], greater noise resilience, and better scalability for real-world applications. Researchers have utilized all kinds of features to propose numerous deep learning-based models for ship-radiated noise recognition. Commonly used models include Convolutional Neural Networks (CNNs) [16], Recurrent Neural Networks (RNNs), Transformers [17], and their variants. To eliminate preprocessing, raw time-domain signals can serve directly as model inputs [5,18,19,20,21,22]. These methods have demonstrated effectiveness in waveform feature learning [23] and are well suited for end-to-end recognition. However, time domain-based methods are limited by their susceptibility to noise interference and high computational costs, which present challenges in real-world applications.

Compared with raw time-domain signals, time-frequency features exhibit superior noise robustness and can more accurately capture the non-stationary nature of underwater acoustic signals [24]. A number of deep learning methods utilizing time-frequency features and their fusion have been proposed. For example, Wang et al. [25] employed STFT spectrograms as inputs to a multi-branch network with a convolutional attention block. Lin et al. [26] input a feature matrix into a ResNet18 network, which was derived by applying STFT to both the real and imaginary components of the original signal’s spectrogram. Likewise, Zhang et al. [27] fused the bispectrum with the STFT amplitude and phase spectra, and used the combined features as input to an integrated neural network. Zhang et al. [28] combined the Mel spectrum, MFCC, and their first- and second-order derivatives (MFCC-3D) with the STFT, and fed the fused features into a CNN-LSTM network. Hu et al. [14] developed a recognition model based on multi-scale convolution, which takes MFCC-3D features as input and employs channel attention for feature fusion. It is worth noting that in [14,28], LSTMs were used as encoders rather than for temporal feature extraction. Zhu et al. [1] extracted CQT and Mel spectrum features corresponding to propeller rotation noise, singing noise, and cavitation noise respectively. These features were fused with adaptive weights and input into a CNN, achieving high recognition accuracy. To modeling temporal dependencies in underwater signals, Xu et al. [29] introduced a Bi-LSTM branch to capture long-term dependencies from Mel spectrograms, thereby enhancing recognition performance. Li et al. [30] were the first to apply Transformers in underwater acoustics to capture long-range dependencies in spectrograms, while Chen et al. [31] adopted the Swin Transformer as a more efficient alternative, substantially reducing the computational cost of multi-head attention.

Methods combining deep learning models with time-frequency features have achieved significant success in underwater acoustic signal recognition. However, most of these approaches are designed for high-performance computing environments, resulting in high computational complexity and memory demands that limit their deployment on resource-constrained edge devices. For instance, the model in [1] employed a ResNet18 backbone and four classifiers, leading to substantial computational overhead due to the complexity of its feature maps. Similarly, the bidirectional LSTM introduced in [29] proportionally increases computational costs as sequence length grows, thereby reducing real-time performance. Transformer-based methods, such as those in [30,31], also require considerable computational resources due to the multi-head attention mechanism. Given these challenges, there is a clear need for lightweight models that maintain high performance with reduced computational complexity and memory usage for deployment on resource-constrained edge devices [32]. Although several attempts have been made [23,33], research on lightweight models remains in its early stages.

To further advance the development of lightweight models for underwater signal recognition, we propose a novel lightweight network named LW-MS-LFTFNet, which is specifically designed for the time-frequency characteristics of ship radiated noise. Our method effectively leverages time-frequency analysis and domain-specific knowledge to design a specialized feature extraction lightweight backbone. Additionally, low-frequency temporal features are integrated as a key supplement, enabling high recognition accuracy with reduced parameter count and computational cost. During testing, we ensured that there was no data leakage or contamination. The primary contributions of our work are as follows:

1.: We propose a novel lightweight multi-scale (LW-MS) backbone based on depthwise separable convolutions, specifically tailored for the time-frequency characteristics of underwater acoustic signals. This backbone outperforms existing mainstream lightweight architectures in terms of both performance and efficiency.
2.: We introduce two LSTM-based temporal modules to effectively incorporate low-frequency temporal features (LFTF) into the model. This approach enhances the model’s performance by capturing temporal dependencies in spectrograms, with only a slight increase in computational cost and parameter count.
3.: The LW-MS-LFTF network achieves state-of-the-art performance among lightweight models, with a recognition accuracy of 75.04%, using only 0.85 M parameters, 0.38 GMACs, and 3.27 MB of storage. This optimal balance between accuracy and model complexity makes our model highly suitable for deployment on resource-constrained underwater edge platforms.

The remainder of this article is organized as follows. Section 2 provides an overview of our ship-radiated noise classification method in detail. Section 3 presents the experimental setup. Section 4 discusses the results and analysis, and Section 5 concludes the paper.

2. Materials and Methods

This section introduces the proposed lightweight model. Initially, the time-frequency analysis is conducted on ship-radiated noise signals from the DeepShip dataset (School of Software, Northwestern Polytechnical University, Xi’an, China). The prior distribution patterns of time-frequency features are then summarized. Finally, the overall framework and the detailed structure of each module are presented.

2.1. Time-Frequency Pattern Analysis

Directly utilizing the time-frequency map of SRNs as input to CNNs may overlook the inherent characteristics of a ship’s operational dynamics. Therefore, a detailed analysis of the time-frequency spectrum of SRNs is crucial, as it can reveal essential domain-specific prior knowledge that can inform the design of lightweight models.

SRN is primarily composed of three components: mechanical noise, propeller noise, and hydrodynamic noise [1]. Hydrodynamic noise, generated by the interaction between water flow and ships, is inherently irregular. Its intensity fluctuates primarily with the ship’s speed, which is often overshadowed by the louder mechanical and propeller noise. Mechanical noise stems from vibrations in the ship’s prime movers and auxiliary machinery, including diesel engines, main electric motors, generators, pumps, and air compressors. The spectrum of mechanical noise consists of strong line spectra and weak continuous spectra, predominantly occupying the low-frequency range of SRN. Propeller noise, produced by the interaction between the rotating propeller and the fluid, comprises three main components: propeller cavitation noise, singing noise and blade rotation noise. Propeller cavitation noise arises from the collapse of numerous bubbles of varying sizes. This noise presents a continuous spectrum and is primarily distributed in the high-frequency range of the radiated noise, with spectral peaks typically within the 100–1000 Hz frequency range. Singing noise is induced by vortex emission that excites resonance in propeller blades, resulting in a low-frequency, intense line spectrum within the 100–1000 Hz range. Blade rotation noise is generated by the periodic cutting of fluid by rotating propeller blades, presenting as a low-frequency line spectrum with frequencies ranging from 1 to 100 Hz. This type of noise satisfies the following relationship:

f_{m} = m n s

(1)

where m is the harmonic order, n is the number of propeller blades, s is the propeller rotation speed (r/s), and

f_{m}

is the frequency corresponding to the m-th harmonic. These characteristics serve as a foundation for model design and target recognition.

To intuitively demonstrate and analyze the spectrum of ship-radiated noise, four samples from different ships are randomly selected from the open-source dataset DeepShip. Each segment is three seconds in length. The Mel spectrograms of these segments are shown in Figure 1. As observed in this figure, the majority of the energy is concentrated in the 0–1000 Hz frequency band, with relatively weak continuous spectra appearing above 1000 Hz, leading to an overall irregular energy distribution. To further detail and compare the energy distribution characteristics of radiated noise from different ships, Mel spectrograms within the 1–100 Hz and 100–1000 Hz are calculated, as shown in Figure 2.

A comparative analysis of Figure 1 and Figure 2 reveals distinct energy distribution patterns in the radiated noise signals of different ships within the 0–1000 Hz frequency band. Specifically, the radiated noise from the cargo ship shows low energy in the 1–100 Hz range, characterized by a weak line spectrum. Within the 100–1000 Hz range, strong line spectra are superimposed on a relatively weak continuous spectrum. For the passenger ship, spectral energy concentrates in the 1–100 Hz range and around 1000 Hz, dominated by a continuous spectrum, with minimal energy at other frequencies. Notably, its spectrogram also exhibits significant temporal fluctuations: energy in the 1–100 Hz range is more concentrated around 1 s and 2.5 s, while energy at other times remains low. Oil tankers’ radiated noise energy is primarily focused on frequencies below 500 Hz, with weak line and continuous spectra emerging above 500 Hz. Below 500 Hz, the spectral energy is more uniformly distributed across the 20–120 Hz range, exhibiting a strong continuous spectrum. The energy of tug’s spectrum is weak within the 1–70 Hz range, with a combination of weak line spectrum and continuous spectra around 100 Hz. In the 100–1000 Hz range, the energy is concentrated around 512 Hz, showing an overall superposition of strong line spectrum and continuous spectra.

In this paper, the Mel spectrogram is used as the primary input to the network. To capture the dynamic characteristics of the spectrogram, the first- and second-order derivatives are also computed, which can enhance feature discriminability [28]. The feature extraction process is illustrated in Figure 3. To mitigate the impact of noise in the high-frequency range, the original signal is downsampled to 16,000 Hz. Subsequently, the Mel-3D spectrogram within the 1–8000 Hz range is computed as the primary input feature for the model. Since low-frequency bands contain abundant information about SRN, Mel spectrograms in the 1–100 Hz and 100–1000 Hz frequency ranges are extracted as auxiliary input features. These features provide a comprehensive representation of underwater acoustic signals by combining static, dynamic and multi-band spectral information. The following subsection describes the architecture of LW-MS-LFTF network, which is designed to leverage this prior knowledge for underwater target recognition.

2.2. The Design of the LW-MS-LFTFNet Model

This subsection introduces the framework of the proposed LW-MS-LFTFNet model, which is designed based on the time-frequency distribution characteristics of SRNs. As shown in Figure 4, the model consists of a multi-scale convolution backbone and two branches. The backbone processes Mel-3D inputs with depthwise separable convolutions, while CBAM modules emphasize informative spectral components such as line and continuous spectra. The two branches target distinct frequency ranges (1–100 Hz and 100–1000 Hz), where LSTM networks are employed to capture temporal dependencies. This frequency-aware design leverages domain priors: most of the energy is concentrated in the low-frequency band, whereas the high-frequency region carries only a small portion of the total energy. The outputs from the three pathways are concatenated and subsequently fed into a linear classifier for recognition. To further improve robustness, mixup augmentation is employed during training. The following subsection details the model’s key components and design principles.

2.2.1. Multi-Scale Convolution Backbone with Depthwise Separable Convolutions

The structure of the multi-scale convolution backbone is illustrated in Figure 5. The Mel-3D input is processed through three parallel branches employing

3 \times 3

,

5 \times 5

, and

7 \times 7

depthwise separable convolutions, each followed by ReLU activation, batch normalization, and AvgPool2D. Feature fusion is achieved via two sequential channel concatenation operations, with CBAM modules integrated in the second stage, and the backbone concludes with downsampling layers. This design is motivated by the inhomogeneous energy distribution characteristics of SRNs: relying solely on single-scale

3 \times 3

convolutions, which have limited receptive fields, is insufficient to capture both detailed fine-grained structures and broader spectral patterns, resulting in inevitable feature loss across temporal and frequency dimensions.

To overcome this limitation, we draw inspiration from the Inception module [34] and introduce multi-scale kernels to extract features at different receptive fields. Specifically, larger kernels capture macro-level spectral characteristics, such as the inhomogeneous energy distribution, while smaller kernels focus on fine-grained line structures and localized details. Concatenating these multi-scale representations along the channel dimension enables the network to aggregate complementary information, enriching feature diversity and mitigating potential information loss.

In this study, each branch is configured with 32 channels to enhance feature extraction capacity. However, the channel concatenation operation after the first multi-scale convolution module increases both parameters count and computational cost, which slows convergence, elevates hardware requirements, and raises overfitting risk, contradicting the lightweight design goal. To address this trade-off, we employ depthwise separable convolutions (DSCs) [35,36,37], which factorize standard convolutions into depthwise and pointwise operations. This decomposition substantially reduces parameter complexity and computational load while maintaining essential feature extraction capabilities. The processing procedure of depthwise separable convolution is illustrated in Figure 6. Depthwise convolution is first applied independently to each input channel, followed by

1 \times 1

pointwise convolution to restore inter-channel interactions. For an input feature map of size

H \times W \times C_{in}

producing an output feature map of size

H \times W \times C_{out}

with a kernel of size

K \times K

, the parameter count is

k \times k \times C_{in} + 1 \times 1 \times C_{in} \times C_{out}

(2)

The computational cost of depthwise separable convolution is:

(k \times k \times C_{in} + 1 \times 1 \times C_{in} \times C_{out}) \times H \times W

(3)

where

C_{in}

and

C_{out}

denote the input and output channels, respectively, and H and W represent the height and width of the feature maps. For standard convolutions under the same conditions, the parameter count is:

k \times k \times C_{in} \times C_{out}

(4)

The computational cost of standard convolution is:

k \times k \times C_{in} \times C_{out} \times H \times W

(5)

The ratio of parameter counts and computational cost between the two are both given by:

\frac{1}{C_{out}} + \frac{1}{K^{2}}

(6)

where K is set to 3, 5, and 7 in this study. As shown in the comparison, depthwise separable convolution significantly reduces both parameter count and computational complexity. This enables efficient edge deployment while adhering to the lightweight design principles. The following subsection details the structure and function of the CBAM utilized in this study.

Figure 6. Processing procedure of Depthwise Separable Convolution.

2.2.2. CBAM Lightweight Attention Mechanism

The Convolutional Block Attention Module (CBAM), introduced in 2018, was originally designed for image classification and object detection tasks [38]. Its lightweight architecture, characterized by minimal computational overhead and low parameter usage, aligns well with our emphasis on lightweight design. In this paper, we pioneer its application in underwater acoustic signal recognition, aiming to enhance feature discriminability and suppress noise.

As shown in Figure 4, CBAM is integrated into the second stage of the backbone, serving two primary functions: focusing on critical spectral components and identifying salient spatiotemporal domains. Figure 7 illustrates the basic structure of CBAM, which consists of a Channel Attention Module (CAM) and a Spatial Attention Module (SAM). Specifically, the CAM first aggregates spatial information from abstract features using global max and average pooling, compressing spatial dimensions into channel-wise statistical signatures that preserve essential SRN characteristics. A shared Multi-Layer Perceptron (MLP) and sigmoid activation then generate channel-wise attention weights to amplify channels that encode key discriminative information while suppressing noise-dominated channels.

Building upon the channel-wise refinement performed by CAM, the SAM processs the reweighted feature maps through two key steps. First, it concatenates the outputs of channel-wise max pooling and average pooling to form a compact spatial descriptor. This descriptor is then passed through a convolutional layer, which is designed to capture broader contextual patterns. Following this, sigmoid activation is applied, resulting in the generation of spatial attention weights. These weights highlight spatial regions containing task-relevant spectral details while suppressing irrelevant spatial noise.

By combining these sequential operations, this two-stage structure enables CBAM to hierarchically refine abstract spectral features from both channel and spatial dimensions without significantly increasing computational burden. This balance of effectiveness and efficiency aligns with our lightweight design philosophy, enhancing the network’s ability to focus on discriminative information within the features of SRN while maintaining a manageable model size and inference speed.

2.2.3. LSTM Network for Low-Frequency Temporal Feature Extraction

While the multi-scale convolution module effectively captures frequency-domain distributions, it fails to model temporal dependencies in underwater acoustic signals. As a result, the model misses crucial information about how underwater signals evolve over time. To overcome this limitation, we integrate Long Short-Term Memory (LSTM) networks into our model to enhance its performance. Specifically, Mel spectrograms from the 1–100 Hz and 100–1000 Hz frequency bands are fed into two distinct LSTM branches, enabling the model to effectively capture low-frequency temporal dependencies and detailed spectral features, while minimizing model parameters and computational cost.

The LSTM consists of a sequence of memory cells that share a common set of parameters, ensuring consistent processing across time steps while reducing model complexity. As depicted in Figure 8, each memory cell contains three key gating mechanisms: the forget gate, the input gate, and the output gate. The forget gate determines which information to discard from the cell state, using a sigmoid activation function to output values between 0 and 1, where a value closer to 1 indicates retention of the information, and a value near 0 denotes that it should be forgotten. The input gate controls the incorporation of new information into the cell state by combining a sigmoid layer for identifying eligible values and a tanh layer to generate candidate values. The element-wise multiplication of these outputs updates the cell state. The output gate determines the next hidden state by selecting portions of the cell state through a sigmoid layer, and processing the result through a tanh function to scale it between −1 and 1. This unique gating mechanism allows the network to preserve critical temporal features and capture long-term dependencies in underwater acoustic signals.

The working flow of the LSTM module is illustrated in Figure 9. As depicted, the frequency intensity distribution vector at each time frame of the Mel spectrogram is sequentially fed into the LSTM module. To improve the model’s feature extraction capability, we implement a two-layer LSTM, where the output of the first layer serves as the input to the second layer. This architecture allows the model to effectively capture the long-term temporal dependencies of low-frequency spectrogram, thereby enhancing performance. The hidden and cell states at the final time step of the second LSTM layer are used as the output, which is then concatenated with the flattened output from the backbone network. This method strengthens the model’s ability to capture the dynamic details of the signal and aligns with the design philosophy of our lightweight model.

3. Experiment Setup

3.1. Dataset

We evaluated the performance of the proposed LW-MS-LFTFNet using the DeepShip dataset [4], which consists of ship-radiated noises recorded at the Strait of Georgia delta node between 2 May 2016, and 4 October 2018. It includes real underwater recordings from 265 distinct vessels, spanning four categories: cargo ships, tugs, passenger ships, and oil tankers, with a total duration of 47 h and 4 min. All samples are single-channel signals originally sampled at 32,000 Hz. To reduce computational load and mitigate the impact of high-frequency noise, we resampled all recordings to 16,000 Hz. The dataset was split according to the method described in [1], where each recording was segmented into non-overlapping 3-s samples. To avoid possible data contamination, samples from the same recording were exclusively assigned to either the training set or the test set. For better monitoring the model’s performance during training, A portion of the original test set was further partitioned to form a separate validation set. This resulted in 39,371 training samples, 8457 validation samples and 8427 test samples, with category-specific distributions detailed in Table 1.

3.2. Parameters Setup

Table 2 summarizes the parameter settings for Mel spectrogram extraction across different frequency ranges. To enhance the resolution of the full-band Mel spectrogram, 513 filter banks are employed, yielding a spectrogram of dimension [513, 94]. The integrated 3D Mel spectrum (Mel-3D) therefore has dimensions of [3, 513, 94]. The two sub-band Mel spectrograms have dimensions of [80, 94] and [20, 94], respectively. This multi-scale design provides comprehensive spectral representations for model training.

The stochastic gradient descent (SGD) optimizer is employed for model training, with a learning rate of 0.03 and weight decay set to 0.001. To enable adaptive learning rate adjustment, a StepLR scheduler is applied with a step size of 1 and a decay factor (gamma) of 0.9. It should be noted that the scheduler performs decay once per cycle. A batch size of 24 is adopted during training to balance GPU and CPU performance. For data augmentation, the mixup technique is introduced after the 15th training epoch with an

α

value of 0.2 and a probability (p) of 0.5, where

α

is the hyperparameter of the Beta distribution controlling the mix ratio, and p denotes the likelihood that a batch undergoing mixup. This strategy contributes to enhancing the model’s generalization performance.

3.3. Evaluation Metric

All methods used in this study were evaluated using four standard performance metrics: Accuracy, Precision, Recall, and F1-score. These metrics provide a comprehensive assessment of classification performance from multiple perspectives and follow standard definitions commonly used in machine learning tasks. Additionally, to demonstrate the complexity of all models, three metrics were used: the number of parameters (M), computational cost (GMACs), and model size (MB), enabling a direct comparison of their relative complexity.

4. Experiment Results and Analysis

The experiments are organized into three main components. First, a comprehensive performance analysis of the proposed LW-MS-LFTFNet was conducted to evaluate both its effectiveness and efficiency. Second, ablation studies were conducted to assess the effectiveness of using the LSTM modules to extract temporal features from low-frequency bands. Finally, comparative experiments were performed against a variety of models. All experiments were conducted on a cloud server using Torch 2.5 with Python 3.12. The server is equipped with an NVIDIA Geforce RTX 3080Ti GPU and an Intel(R) Xeon(R) Gold 6430 CPU, ensuring efficient execution of the training process. Notably, strict data partitioning was used to prevent data contamination. The experiment results are discussed in this section.

4.1. The Result of LW-MS-LFTFNet

Figure 10 presents the confusion matrix of the LW-MS-LFTFNet. To facilitate quantitative analysis, Table 3 further provides key performance metrics for different vessel categories. The results demonstrate that the LW-MS-LFTFNet performs well across various types of vessels. The tug ship achieves the highest recognition precision, followed by the passenger ship, while tankers perform relatively poorly, and cargo ships show the lowest recognition accuracy among all vessel types. The observed performance differences can be attributed to the distinct energy distribution of their mel spectrograms. As shown in Figure 1, the mel spectrogram of the tug ship shows relatively strong energy concentrated in the low-frequency band, with a clear combination of strong line and continuous spectra. Unique spectral features, such as concentrated energy around 512 Hz, likely enable the model to more effectively identify tug ships compared to other vessel types. The passenger ship, by contrast, has the lowest overall spectral energy, primarily concentrated around 1000 Hz, with weaker energy at other frequencies, which likely makes this vessel category easier to classify. On the other hand, cargo ships and tankers exhibit a higher incidence of misclassification between categories, with 444 instances of cargo ships being misclassified as tankers. This can be attributed to the similarities in spectral energy distribution of cargo ships and tankers, which may be due to their comparable propulsion systems and speed requirements [1,26]. The recognition results are consistent with the spectral features extracted from the time-frequency analysis, highlighting the importance of analyzing the time-frequency distribution of acoustic signals before model design.

To further evaluate the classification capability of LW-MS-LFTFNet, the output features from its final fully connected layer were projected into two dimensions using t-SNE (t-distributed Stochastic Neighbor Embedding), as shown in Figure 11. The visualization, generated from the test dataset with distinct colors representing the four vessel types, shows that feature clusters exhibit clear separation, although a noticeable degree of overlap persists. Specifically, tug and passenger ship samples form relatively compact and well-separated clusters, whereas cargo and tanker ship features display substantial overlap, reflecting higher similarity in their representations. These patterns correspond with the quantitative results, in which tug and passenger ships achieve higher recognition accuracy, while tankers and cargo ships are more frequently misclassified. Overall, the t-SNE visualization provides qualitative evidence that LW-MS-LFTFNet effectively learns discriminative feature structures while highlighting the similarities among certain vessel categories.

Building on the previous analysis, it is crucial to emphasize the lightweight nature of the LW-MS-LFTFNet. As illustrated in Table 4, the proposed model strikes an optimal balance between recognition accuracy, parameter count, and computational complexity. With only 0.85 M parameters and a size of 3.27 MB, it achieves an accuracy of 75.04%, while maintaining a relatively low computational cost of 0.38 GMACs. This efficient design not only ensures the model’s high performance but also makes it particularly suitable for deployment on resource-constrained edge devices, such as autonomous underwater vehicles or underwater monitoring systems. Moreover, its compact size and low resource requirements facilitate efficient processing, which is vital for real-time underwater acoustic target recognition systems. The model’s scalability enables seamless adaptation to future applications with more advanced recognition needs, providing a solid foundation for potential deployment in a variety of real-world scenarios.

4.2. Ablation Experiments

The ablation experiments aim to demonstrate the effectiveness of the LSTM-based modules for extracting temporal features from low-frequency bands. Specifically, We conducted recognition experiments on the two LSTM branches separately, with one processing the 1–100 Hz Mel spectrogram (module “A”) and the other handling the 100–1000 Hz Mel spectrogram (module “B”), with the lightweight backbone as the baseline. Performance was assessed in terms of accuracy, parameter count, computational complexity, and model size. All metrics are reported relative to the baseline, with corresponding values provided in parentheses in the table. To ensure fairness, all configurations shared identical hyperparameters, and results were averaged over multiple runs with mixup augmentation applied during training.

As depicted in Table 4, the baseline achieves an accuracy of 72.40%. Incorporating module A increases the accuracy to 73.38%, while module B alone raises it to 73.01%. When both modules are combined in the model, the accuracy reaches 75.04%, representing a substantial improvement of 2.64% relative to the baseline. These results indicate that each module individually contributes positively to model performance by capturing temporal features of different low-frequency bands. The combined use of both modules enables the model to leverage complementary information across the low-frequency spectrum, thereby enhancing the richness of temporal feature representations. This offers a more comprehensive characterization of vessel-specific acoustic signatures, improving the model’s ability to discriminate between spectrally similar ship types and yielding a performance gain that surpasses the sum of the improvements achieved by the individual modules.

In addition to performance gains, the computational and memory overhead introduced by the LSTM modules is also evaluated. Incorporating module A yields a 0.98% accuracy improvement, with increases of 0.27 M parameters, 0.02 GMACs, and 1.04 MB in model size. Module B provides a 0.61% accuracy gain, accompanied by increases of 0.31 M parameters, 0.02 GMACs, and 1.16 MB. When both modules are integrated, accuracy improves by 2.64%, with parameters, computational cost, and model size increasing by 0.58 M, 0.04 GMACs, and 2.21 MB, respectively. These results demonstrate that each module individually improves performance with only a slight increase in computational cost, and their combination yields a substantially greater accuracy gain. Although incorporating both modules increases the parameter count from 0.27 M to 0.85 M, approximately two times larger than the original backbone, the overall model size remains below 1 M parameters, which is still considerably small for underwater signal recognition tasks. Consequently, the proposed model achieves notable performance improvements at a low resource cost, highlighting its lightweight design and suitability for deployment in resource-constrained environments.

4.3. Comparison Experiments

In this section, we present a detailed comparison of the proposed model with a range of lightweight and standard networks to evaluate its performance. The lightweight models include MobileNetV1 [36], MobileNetV2 [39], ShuffleNetV2 [40], LW-SEResNet10 [33], MA-CNN-A [23], and CFTANet [41], all of which are specifically designed to balance accuracy and efficiency in resource-constrained scenarios. Additionally, ResNet18 [42] and MACRN [14] were included as strong baselines, despite not being lightweight models. For implementation consistency and reliability, MobileNetV1, MobileNetV2, ShuffleNetV2, and ResNet18 were adopted from the official torchvision library to ensure standardized baseline implementations. MA-CNN-A was implemented directly using its publicly available source code, while the remaining models were reproduced according to the architectural details described in their respective publications, as their official implementations were not publicly released. To ensure fairness, all models were modified only in their classification layers to adapt to underwater acoustic signal recognition, and each experiment was repeated five times independently, with the average results reported.

Table 5 presents a comprehensive comparison of recognition accuracy, computational complexity, parameter count, and model size across all methods. The LW-MS-LFTFNet achieves the highest accuracy of 75.04%, outperforming both lightweight models and heavyweight competitors while maintaining compact size and low computational cost, and even its CNN backbone alone demonstrates strong performance relative to mainstream lightweight architectures. Compared with extremely lightweight models such as MobileNetV1 (0.5) and MobileNetV2 (0.5), which achieve only 66.82% and 65.50% accuracy with very low computational costs (0.16 and 0.11 GMACs), the proposed model delivers nearly 8% higher accuracy at a moderate cost of 0.38 GMACs. In terms of parameters, the proposed model maintains a compact size of 0.85 M, slightly larger than MobileNetV1 (0.5) (0.82 M) and MobileNetV2 (0.5) (0.69 M), yet substantially smaller than LW-SEResNet10 (4.91 M) and ResNet18 (11.18 M), while simultaneously achieving significantly higher accuracy. Regarding storage, the proposed LW-MS-LFTFNet is only 3.27 MB, much smaller than LW-SEResNet10 (18.70 MB) and MobileNetV1 (1.0) (12.30 MB), yet it surpasses them in accuracy by 7.09% and 10.32%, respectively. These results confirm the superiority of LW-MS-LFTFNet across all key metrics, demonstrating that it achieves an optimal balance between accuracy and efficiency while maintaining a lightweight architecture.

To further illustrate the trade-off between accuracy and efficiency, Figure 12 and Figure 13 plot recognition accuracy against computational complexity and parameter count, respectively, using different markers to represent different models. The proposed model occupies a well-balanced position in both plots, achieving high accuracy while efficiently utilizing parameters and computational resources. From the overall distribution, it is evident that most models do not follow a strict linear relationship in which increases in computational complexity or parameter count necessarily yield higher accuracy. However, within the lower computational complexity range of 0.04 to 0.40 GMACs, a roughly linear trend can be observed, where modest increases in computation costs often lead to gains in accuracy. For instance, MobileNetV2 and ShuffleNetV2 variants, as lightweight models, generally exhibit improved accuracy as computational cost increases within this range. By contrast, models with higher computational complexity, such as ResNet18 and MACRN, fail to achieve top-level accuracy. These observations underscore the importance of designing network architectures specifically adapted to the target task and highlight the value of incorporating domain knowledge to guide model design, enabling both high accuracy and efficiency under constrained computational resources.

The above analysis demonstrates the superiority of LW-MS-LFTFNet over other lightweight models, which may be attributed to differences in network structures or inductive biases. For example, MobileNetV1, MobileNetV2, ShuffleNetV2, and LW-SEResNet10 predominantly use

3 \times 3

convolutions, limiting receptive fields and hindering the extraction of uneven spectrogram features. CFTANet stacks multiple attention modules that may lead to feature redundancy and overfitting under limited data. MA-CNN-A employs excessively large multi-scale kernels of sizes 8, 16, 32, and 64, which may capture irrelevant noise and overlook short-term details. In contrast, LW-MS-LFTFNet uses appropriately sized 3, 5, and 7 convolutional kernels to extract hierarchical features from spectrograms with uneven energy distributions, and incorporates LSTM modules to capture temporal dependencies that are typically overlooked by other lightweight models. Additionally, the CBAM module effectively suppresses noise, enhances salient time-frequency features, and strengthens the model’s focus on informative regions, thereby improving recognition performance. Overall, these results suggest that a carefully balanced combination of multi-scale convolution, temporal modeling, and attention mechanisms is crucial for achieving high recognition accuracy in lightweight models for underwater acoustic signal recognition.

4.4. Saliency Visualization and Interpretation

To further interpret the decision-making behavior of LW-MS-LFTFNet, gradient-based saliency visualization was performed on the input features of four representative ship-radiated noise samples, corresponding to those shown in Figure 1 and Figure 2. The resulting saliency maps are presented in Figure 14. In these maps, brighter regions correspond to stronger attention responses, indicating that the model gives greater attention to these time-frequency regions. From the maps, it is evident that LW-MS-LFTFNet consistently attends to the low-frequency range (0–1000 Hz), while responses above 1000 Hz are comparatively weak. This observation demonstrates that the low-frequency components carry the most discriminative information for ship-radiated noise recognition. Such behavior is consistent with the energy distribution characteristics of ship-radiated noise, in which energy is concentrated in the low-frequency range yet remains unevenly distributed within this band. Furthermore, distinct saliency patterns across ship types suggest that the model adaptively emphasizes different low-frequency features according to each ship’s acoustic signature. Overall, these findings underscore the importance of low-frequency features in ship-radiated noise recognition and demonstrate the model’s ability to effectively leverage them.

5. Conclusions

This paper proposed LW-MS-LFTFNet, a lightweight network for ship-radiated noise recognition, which was designed under the guidance of domain-specific priors derived from time-frequency pattern analysis. By combining multi-scale depthwise separable convolutions with LSTM-based low-frequency temporal feature extraction, the model effectively captures both spectral structures and temporal dependencies of underwater signals. Experiments on the DeepShip dataset demonstrated that LW-MS-LFTFNet achieves 75.04% accuracy with only 0.85 M parameters, 0.38 GMACs, and 3.27 MB of storage, outperforming mainstream lightweight networks. The results highlight the importance of incorporating domain-specific priors to guide lightweight model design, enabling compact architectures to achieve strong performance while remaining computationally efficient. Moreover, LW-MS-LFTFNet shows promising potential for deployment on resource-constrained edge platforms, such as autonomous underwater vehicles. Nevertheless, this study has several limitations. The current evaluation is limited to a single dataset collected under relatively stable conditions, and the proposed model has not yet been deployed or validated on real embedded hardware. Moreover, the extracted sub-band features are derived from conventional Mel-spectrogram representations, which may not fully exploit the discriminative characteristics of ship-radiated noise. Future work will focus on extending evaluations to more complex and dynamic underwater environments, performing real-world deployment tests on embedded platforms, and exploring improved model compression strategies to further enhance efficiency. In addition, more discriminative feature representations and effective denoising techniques will be investigated to improve robustness under challenging acoustic conditions.

Author Contributions

Conceptualization, Y.F. and Z.C.; methodology, Y.F., Z.C. and Y.C.; software, Y.F. and J.H.; validation, Y.F., Y.C. and J.H.; formal analysis, Y.F., Z.C. and Y.C.; investigation, Y.F. and Z.C.; resources, Y.F., T.G. and Y.C.; data curation, Y.F. and J.H.; writing-original draft preparation, Y.F. and Z.X.; writing—review and editing, Y.F., K.C., T.G., Y.C., Z.X., J.L. and H.D.; visualization, Y.F., Z.X. and J.H.; supervision, T.G. and K.C.; project administration, T.G. and K.C.; funding acquisition, T.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The DeepShip dataset was obtained from https://github.com/irfankamboh/DeepShip (accessed on 15 September 2024), and the reference data split for the dataset was adopted from https://github.com/ZhuPengsen/Method-for-Splitting-the-DeepShip-Dataset (accessed on 20 September 2024). The publicly available implementation of MA-CNN-A can be accessed at https://github.com/FlyingWhale23/MA-CNN-A (accessed on 10 August 2025).

Conflicts of Interest

Yu Feng (Intern), Zhangxin Chen, Yixuan Chen and Tao Guo were employed by Wuhan Lingjiu Microelectronics Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SRN	Ship-Radiated Noise
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
CBAM	Convolutional Block Attention Module
DSC	Depthwise Separable Convolution
MLP	Multi-Layer Perceptron
LSTM	Long Short-Term Memory
LW-MS	Lightweight Multi-Scale
LFTF	Low-Frequency Temporal Features
t-SNE	t-distributed Stochastic Neighbor Embedding

References

Zhu, P.; Zhang, Y.; Huang, Y.; Zhao, C.; Zhao, K.; Zhou, F. Underwater acoustic target recognition based on spectrum component analysis of ship radiated noise. Appl. Acoust. 2023, 211, 109552. [Google Scholar] [CrossRef]
Ren, J.; Huang, Z.; Li, C.; Guo, X.; Xu, J. Feature Analysis of Passive Underwater Targets Recognition Based on Deep Neural Network. In Proceedings of the OCEANS 2019, Marseille, France, 17–20 June 2019; pp. 1–5. [Google Scholar] [CrossRef]
Shen, S.; Yang, H.; Yao, X.; Li, J.; Xu, G.; Sheng, M. Ship Type Classification by Convolutional Neural Networks with Auditory-like Mechanisms. Sensors 2020, 20, 253. [Google Scholar] [CrossRef] [PubMed]
Irfan, M.; Jiangbin, Z.; Ali, S.; Iqbal, M.; Masood, Z.; Hamid, U. DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification. Expert Syst. Appl. 2021, 183, 115270. [Google Scholar] [CrossRef]
Li, S.; Yang, S.; Liang, J. Recognition of ships based on vector sensor and bidirectional long short-term memory networks. Appl. Acoust. 2020, 164, 107248. [Google Scholar] [CrossRef]
Zhurba, N.; Siek, Y.; Khutornaia, E. Onboard computing environment of autonomous unmanned underwater vehicles: Possible design technologies and their comparative analysis. Vibroeng. Procedia 2021, 38, 62–67. [Google Scholar] [CrossRef]
Hou, X.; Wang, J.; Bai, T.; Deng, Y.; Ren, Y.; Hanzo, L. Environment-Aware AUV Trajectory Design and Resource Management for Multi-Tier Underwater Computing. IEEE J. Sel. Areas Commun. 2023, 41, 474–490. [Google Scholar] [CrossRef]
Aslam, M.A.; Zhang, L.; Liu, X.; Irfan, M.; Xu, Y.; Li, N.; Zhang, P.; Zheng, J.; Li, Y. Underwater sound classification using learning based methods: A review. Expert Syst. Appl. 2024, 255, 124498. [Google Scholar] [CrossRef]
Wang, S.; Zeng, X. Robust underwater noise targets classification using auditory inspired time–Frequency analysis. Appl. Acoust. 2014, 78, 68–76. [Google Scholar] [CrossRef]
Song, G.; Guo, X.; Wang, W.; Li, J.; Yang, H.; Ma, L. Underwater Noise Classification based on Support Vector Machine. In Proceedings of the 2021 OES China Ocean Acoustics (COA), Harbin, China, 14–17 July 2021; pp. 410–414. [Google Scholar] [CrossRef]
Ke, X.; Yuan, F.; Cheng, E. Integrated optimization of underwater acoustic ship-radiated noise recognition based on two-dimensional feature fusion. Appl. Acoust. 2020, 159, 107057. [Google Scholar] [CrossRef]
Qiao, W.; Khishe, M.; Ravakhah, S. Underwater targets classification using local wavelet acoustic pattern and Multi-Layer Perceptron neural network optimized by modified Whale Optimization Algorithm. Ocean Eng. 2021, 219, 108415. [Google Scholar] [CrossRef]
Qi, P.; Sun, J.; Long, Y.; Zhang, L.; Tianye. Underwater Acoustic Target Recognition with Fusion Feature. In Neural Information Processing, Proceedings of the 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, 8–12 December 2021; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; pp. 609–620. [Google Scholar] [CrossRef]
Hu, F.; Fan, J.; Kong, Y.; Zhang, L.; Guan, X.; Yu, Y. A Deep Learning Method for Ship-Radiated Noise Recognition Based on MFCC Feature. In Proceedings of the 7th International Conference on Transportation Information and Safety (ICTIS), Xi’an, China, 4–6 August 2023; pp. 1328–1335. [Google Scholar] [CrossRef]
Polson, N.G.; Sokolov, V.O. Deep Learning. arXiv 2018, arXiv:1807.07987. [Google Scholar] [PubMed]
IEEE. IEEE Transactions on Audio, Speech, and Language Processing publication information. IEEE Trans. Audio Speech Lang. Process. 2006, 14, c2. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Doan, V.S.; Huynh-The, T.; Kim, D.S. Underwater Acoustic Target Classification Based on Dense Convolutional Neural Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1500905. [Google Scholar] [CrossRef]
Hu, G.; Wang, K.; Peng, Y.; Qiu, M.; Shi, J.; Liu, L. Deep Learning Methods for Underwater Target Feature Extraction and Recognition. Comput. Intell. Neurosci. 2018, 2018, 1214301. [Google Scholar] [CrossRef]
Tian, S.; Chen, D.; Wang, H.; Liu, J. Deep convolution stack for waveform in underwater acoustic target recognition. Sci. Rep. 2021, 11, 9614. [Google Scholar] [CrossRef]
Yang, H.; Li, J.; Shen, S.; Xu, G. A Deep Convolutional Neural Network Inspired by Auditory Perception for Underwater Acoustic Target Recognition. Sensors 2019, 19, 1104. [Google Scholar] [CrossRef]
Han, X.C.; Ren, C.; Wang, L.; Bai, Y. Underwater acoustic target recognition method based on a joint neural network. PLoS ONE 2022, 17, e0266425. [Google Scholar] [CrossRef]
Yan, C.; Yan, S.; Yao, T.; Yu, Y.; Pan, G.; Liu, L.; Wang, M.; Bai, J. A Lightweight Network Based on Multi-Scale Asymmetric Convolutional Neural Networks with Attention Mechanism for Ship-Radiated Noise Classification. J. Mar. Sci. Eng. 2024, 12, 130. [Google Scholar] [CrossRef]
Gao, R.; Liang, M.; Dong, H.; Luo, X.; Suganthan, P.N. Underwater Acoustic Signal Denoising Algorithms: A Survey of the State-of-the-art. arXiv 2024, arXiv:2407.13264. [Google Scholar] [CrossRef]
Wang, B.; Zhang, W.; Zhu, Y.; Wu, C.; Zhang, S. An Underwater Acoustic Target Recognition Method Based on AMNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5501105. [Google Scholar] [CrossRef]
Lin, B.; Gao, L.; Zhu, P.; Zhang, Y.; Huang, Y. An Underwater Acoustic Target Recognition Method Based on Iterative Short-Time Fourier Transform. IEEE Sens. J. 2024, 24, 26199–26210. [Google Scholar] [CrossRef]
Zhang, Q.; Da, L.; Zhang, Y.; Hu, Y. Integrated neural networks based on feature fusion for underwater target recognition. Appl. Acoust. 2021, 182, 108261. [Google Scholar] [CrossRef]
Zhang, W.; Lin, B.; Yan, Y.; Zhou, A.; Ye, Y.; Zhu, X. Multi-Features Fusion for Underwater Acoustic Target Recognition based on Convolution Recurrent Neural Networks. In Proceedings of the 2022 8th International Conference on Big Data and Information Analytics, Guiyang, China, 24–25 August 2022. [Google Scholar]
Xu, J.; Li, X.; Zhang, D.; Chen, Y.; Peng, Y.; Liu, W. Enhanced underwater acoustic target recognition using parallel dual-branch network with attention mechanism. Eng. Appl. Artif. Intell. 2025, 158, 111603. [Google Scholar] [CrossRef]
Li, P.; Wu, J.; Wang, Y.; Lan, Q.; Xiao, W. STM: Spectrogram Transformer Model for Underwater Acoustic Target Recognition. J. Mar. Sci. Eng. 2022, 10, 1428. [Google Scholar] [CrossRef]
Chen, L.; Luo, X.; Zhou, H. A ship-radiated noise classification method based on domain knowledge embedding and attention mechanism. Eng. Appl. Artif. Intell. 2024, 127, 107320. [Google Scholar] [CrossRef]
Lei, Z.; Lei, X.; Wang, N.; Zhang, Q. Present status and challenges of underwater acoustic target recognition technology: A review. Front. Phys. 2022, 10, 1044890. [Google Scholar] [CrossRef]
Yang, S.; Xue, L.; Hong, X.; Zeng, X. A Lightweight Network Model Based on an Attention Mechanism for Ship-Radiated Noise Classification. J. Mar. Sci. Eng. 2023, 11, 432. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Sifre, L.; Mallat, S. Rigid-Motion Scattering for Texture Classification. arXiv 2014, arXiv:1403.1687. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Yang, S.; Jin, A.; Zeng, X.; Wang, H.; Hong, X.; Lei, M. Underwater acoustic target recognition based on sub-band concatenated Mel spectrogram and multidomain attention mechanism. Eng. Appl. Artif. Intell. 2024, 133, 107983. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]

Figure 1. Mel spectrograms of (a) cargo; (b) passenger ship; (c) tanker; and (d) tug.

Figure 2. Mel spectrograms in 1–100 Hz and 100–1000 Hz (a) cargo; (b) passenger ship; (c) tanker; and (d) tug radiated noises.

Figure 3. The feature extraction process of Mel-3D.

Figure 4. The framework of the proposed LW-MS-LFTFNet.

Figure 5. Multi-Scale Convolution Backbone Based on Depthwise Separable Convolution.

Figure 7. Structure of CBAM.

Figure 8. LSTM structure block diagram.

Figure 9. Flowchart of a two-layer LSTM network for extracting low-frequency temporal features from a Mel spectrogram.

Figure 10. The confusion matrix of the LW-MS-LFTFNet.

Figure 11. The t-SNE visualization of the output features from the final fully connected layer of LW-MS-LFTFNet.

Figure 12. Comparison of accuracy and computational cost for different methods.

Figure 13. Comparison of accuracy and number of parameters for different methods.

Figure 14. Saliency visualization of four ship-radiated noise types using the proposed LW-MS-LFTFNet. Brighter regions correspond to stronger attention responses, indicating that the model pays greater attention to these time-frequency regions.

Table 1. Ship type data distribution in Deepship dataset.

Ship Type	No. of Ships	Total Time	Total Recordings	Train Size	Validation Size	Test Size
Cargo	69	10 h 40 min	110	9185	1789	1782
Passenger Ship	46	12 h 22 min	193	10,555	2401	2393
Tanker	133	12 h 45 min	240	10,827	1932	1925
Tug	17	11 h 17 min	70	8804	2335	2327

Table 2. Parameter settings for Mel spectrogram extraction in different frequency bands.

Frequency Range (Hz)	Pre-Emphasis Coefficient	Number of Filter Banks	Hop Length	N-fft	Dimension
1–8000	0.97	513	512	4096	513 × 94
100–1000	0.00	80	512	2048	80 × 94
1–100	0.00	20	512	2048	20 × 94

Table 3. The results of LW-MS-LFTFNet on DeepShip dataset.

Class	Precision (%)	Recall (%)	F1-Score (%)	Support
Cargo	67.57	64.31	65.90	1782
Passenger ship	76.53	77.68	77.10	2393
Tanker	71.03	79.74	75.13	1925
Tug	83.33	76.67	79.86	2327
Macro average	74.62	74.60	74.50	8427

Table 4. Results of ablation experiments for LSTM-based low-frequency temporal features extraction modules. Module “A” represents the LSTM branch processing the 1–100 Hz Mel spectrogram and “B” stands for the branch handling the 100–1000 Hz band.

Model Configuration	Accuracy (%)	No. Params (M)	MACs (G)	Model Size (MB)
Lightweight Backbone (baseline)	72.40	0.27	0.34	1.06
Backbone + A (1–100 Hz)	73.38 (+0.98)	0.54 (+0.27)	0.36 (+0.02)	2.10 (+1.04)
Backbone + B (100–1000 Hz)	73.01 (+0.61)	0.58 (+0.31)	0.36 (+0.02)	2.22 (+1.16)
Backbone + A + B (proposed)	75.04 (+2.64)	0.85 (+0.58)	0.38 (+0.04)	3.27 (+2.21)

Table 5. Comparative experimental results on the DeepShip dataset.

Model	Accuracy (%)	No. Params (M)	MACs (G)	Model Size (MB)
LW-MS-LFTFNet (Proposed)	75.04	0.85	0.38	3.27
Backbone	72.40	0.27	0.34	1.06
LW-SEResNet10	67.95	4.91	0.90	18.70
CFTANet	66.32	0.54	0.25	2.08
MobileNetV1 (0.5)	66.82	0.82	0.16	3.23
MobileNetV1 (0.75)	63.94	1.82	0.34	7.05
MobileNetV1 (1.0)	64.72	3.21	0.59	12.30
ShuffleNetV2 (0.5)	64.06	0.35	0.04	1.46
ShuffleNetV2 (1.0)	66.73	1.26	0.15	4.97
MobileNetV2 (0.5)	65.50	0.69	0.11	2.82
MobileNetV2 (0.75)	67.30	1.36	0.23	5.40
MobileNetV2 (1.0)	69.79	2.23	0.33	8.74
MA-CNN-A	60.23	0.93	0.63	3.61
ResNet18	67.63	11.18	1.83	42.72
MACRN	65.31	3.16	1.44	12.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Y.; Chen, Z.; Chen, Y.; Xie, Z.; He, J.; Li, J.; Ding, H.; Guo, T.; Chen, K. LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition. J. Mar. Sci. Eng. 2025, 13, 2073. https://doi.org/10.3390/jmse13112073

AMA Style

Feng Y, Chen Z, Chen Y, Xie Z, He J, Li J, Ding H, Guo T, Chen K. LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition. Journal of Marine Science and Engineering. 2025; 13(11):2073. https://doi.org/10.3390/jmse13112073

Chicago/Turabian Style

Feng, Yu, Zhangxin Chen, Yixuan Chen, Ziqin Xie, Jiale He, Jiachang Li, Houqian Ding, Tao Guo, and Kai Chen. 2025. "LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition" Journal of Marine Science and Engineering 13, no. 11: 2073. https://doi.org/10.3390/jmse13112073

APA Style

Feng, Y., Chen, Z., Chen, Y., Xie, Z., He, J., Li, J., Ding, H., Guo, T., & Chen, K. (2025). LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition. Journal of Marine Science and Engineering, 13(11), 2073. https://doi.org/10.3390/jmse13112073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LW-MS-LFTFNet: A Lightweight Multi-Scale Network Integrating Low-Frequency Temporal Features for Ship-Radiated Noise Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Time-Frequency Pattern Analysis

2.2. The Design of the LW-MS-LFTFNet Model

2.2.1. Multi-Scale Convolution Backbone with Depthwise Separable Convolutions

2.2.2. CBAM Lightweight Attention Mechanism

2.2.3. LSTM Network for Low-Frequency Temporal Feature Extraction

3. Experiment Setup

3.1. Dataset

3.2. Parameters Setup

3.3. Evaluation Metric

4. Experiment Results and Analysis

4.1. The Result of LW-MS-LFTFNet

4.2. Ablation Experiments

4.3. Comparison Experiments

4.4. Saliency Visualization and Interpretation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI