Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition

Huang, Qing; Zhang, Xiaoyan; Jin, Anqi; Lei, Menghui; Zeng, Mingmin; Cao, Peilin; Na, Zihan; Zeng, Xiangyang

doi:10.3390/jmse13030599

Open AccessArticle

Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition

by

Qing Huang

,

Xiaoyan Zhang

,

Anqi Jin

,

Menghui Lei

,

Mingmin Zeng

,

Peilin Cao

,

Zihan Na

and

Xiangyang Zeng

^*

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(3), 599; https://doi.org/10.3390/jmse13030599

Submission received: 26 February 2025 / Revised: 13 March 2025 / Accepted: 14 March 2025 / Published: 18 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Many studies have used various time-frequency feature extraction methods to convert ship-radiated noise into three-dimensional (3D) data suitable for computer vision (CV) models, which have shown good results in public datasets. However, traditional feature engineering (FE) has been enhanced to interface matching–feature engineering (IM-FE). This approach requires considerable effort in feature design, larger sample duration, or a higher upper limit of frequency. In this context, this paper proposes a one-dimensional network design for underwater acoustic target recognition (UATR-ND1D), only combined with fast Fourier transform (FFT), which can effectively alleviate the problem of IM-FE. This method is abbreviated as FFT-UATR-ND1D. FFT-UATR-ND1D was applied to the design of a one-dimensional network, named ResNet1D. Experiments were conducted on two mainstream datasets, using ResNet1D in 4320 and 360 tests, respectively. The lightweight model ResNet1D_S, with only 0.17 M parameters and 3.4 M floating point operations (FLOPs), achieved average accuracies were 97.2% and 95.20%. The larger model, ResNet1D_B, with 2.1 M parameters and 5.0 M FLOPs, both reached optimal accuracies, 98.81% and 98.42%, respectively. Compared to existing methods, those with similar parameter sizes performed 3–5% worse than the methods proposed in this paper. Additionally, methods achieving similar recognition rates require more parameters of 1 to 2 orders of magnitude and FLOPs.

Keywords:

underwater acoustic target recognition; fast Fourier transform; one-dimensional residual neural network; designing deep neural networks

1. Introduction

In recent years, countries have increasingly focused on marine territories, in addition to defending sovereignty [1,2], protecting the marine environment [3,4,5], and exploiting marine resources [6,7,8]. Achieving these goals requires precise control of both natural and man-made targets within marine areas, making the accurate and efficient recognition of marine targets a critical challenge [9,10,11]. Due to severe light attenuation underwater, most researchers have focused on UATR using acoustic data [12,13,14]. Current UATR methods are largely driven by machine learning, emphasizing feature engineering and model adaptation.

In the last decade, deep learning (DL) has achieved remarkable results in computer vision (CV) [15,16,17]. Many studies have successfully translated DL techniques to underwater acoustic target recognition (UATR), yielding excellent results on two mainstream datasets: ShipsEar [18] and DeepShip [19]. On ShipsEar, Hong [20] combined LogMel (LM), MFCC, and CCTZ (which includes Chroma, Contrast, Tonnetz, and Zero-cross ratio) to create a three-channel feature set, recognized using ResNet2D18, resulting in an accuracy of 94.3%. Khishe [21] introduced DRW-AE features, which comprise a deep autoencoder wavelet network (DAWN) and a deep recurrent autoencoder (DRA), achieving a recognition rate of 94.49%. Ma [22], using SVM, examined three feature extraction methods—MFCC, HHT, and Demon—with MFCC yielding the best results. Their method, which utilized an attention mechanism embedded in a multi-scale residual network (MR-CNN_A) and balanced the categories using the cosx-function-weighted cross-entropy loss function (CFWCEL), achieved an impressive accuracy of 98.87%. On DeepShip, Li [23] built upon the work of Hong [20] by introducing an attention mechanism to develop the AResNet model. They carefully selected 15.5 h of data, achieving a recognition rate of 99%.

Zhang and Zeng [24] elaborated on the feature extraction method MSLEFC, which includes multi-scale STFT (MS-STFT), a ladder-like encoder (LE) for data enhancement, and Frequency-CAM (FC) for analyzing frequency bands of interest, achieving a recognition rate of 82.9% using ResNet18. Alouani [25] proposed a hybrid model incorporating MFCC-extracted features and a ConvLSTM network, achieving a recognition rate of 97.27%. An end-to-end model utilized a convolution module to replace MFCC in the hybrid model, resulting in an accuracy of 91.07%.

While DL has made significant strides in the UATR field [10,26,27], several challenges remain. Current state-of-the-art DL models predominantly originate from the CV domain, and UATR cannot directly adopt these models. The prevailing approach is to apply various traditional feature extraction methods to convert ship-radiated noise into a format suitable for CV input. This has led to an upgrade from feature engineering (FE) to interface matching–feature engineering (IM-FE)—stacking and combining various traditional feature extraction methods and transforming the data into 2D or 3D formats that align with those commonly used in CV, akin to an “interface” transformation.

Figure 1 briefly illustrates the IM-FE problem. The top and bottom parts represent comparison diagrams of the CV and UATR processes, while the left and right parts illustrate DL and traditional ML methods, respectively. IM-FE is highlighted in red. The figure shows that for the ML method, the steps for CV and UATR are quite similar, primarily comprising two components: FE and a shallow classifier. In contrast, DL in CV completely eliminates the energy-intensive FE step, as DL inherently provides powerful automatic feature extraction for unstructured data [28]. UATR, however, retains and enhances FE by stacking various complex feature extraction methods to transform 1D data into 2D or 3D image formats. One piece of evidence for this is that the CV domain’s channel dimension is typically set to 3 or 1, depending on visual image characteristics; interestingly, most UATR studies adopt similar channel parameters, with few considering alternative configurations. Additionally, to achieve spatial dimensions comparable to the CV domain input data (e.g., 224 × 224), many studies utilize inappropriate time or frequency ranges. Ultimately, mainstream DL networks are merely transformed or combined with different DL modules, lacking customization for UATR-specific needs.

To summarize, IM-FE has at least three drawbacks: 1. It still requires significant effort to design features, as demonstrated in [21,22,24]. 2. It necessitates larger sample duration (SD) or upper limit of frequency (ULF), which does not align with UATR characteristics, as indicated in [17,24,29]. 3. It neglects the powerful learning capabilities of deep learning, often using advanced CV models [17] without customization for underwater acoustic data. This approach is complex, computationally intensive, and not conducive to practical deployments, as noted in [22,24].

To address the shortcomings of IM-FE, this paper proposes a Network Design of One Dimension for Underwater Acoustic Target Recognition (UATR-ND1D), combined with fast Fourier transform (FFT), and referred to as FFT-UATR-ND1D. The main advantages of this method are as follows: 1. it does not require significant effort to design features, using only FFT; 2. it yields superior results across various time and frequency ranges; and 3. it offers a concise one-dimensional network design for UATR, greatly reducing the number of parameters and FLOPs while ensuring state-of-the-art performance, which is more conducive to practical deployment.

In this paper, we use the entry-level network ResNet2D [30,31] as a foundation to construct a series of one-dimensional network families, ResNet1D, using the FFT-UATR-ND1D method. On the two mainstream datasets, ShipsEar and DeepShip, the recognition rate of ResNet1D meets or exceeds the optimal level. Additionally, in terms of the number of parameters and FLOPs, ResNet1D significantly outperforms all other models discussed in the literature.

2. Materials and Methods

2.1. FFT-UATR-ND1D

FFT is a fundamental method in this paper for obtaining the specified frequency band range data for recognition. This section introduces the UATR-ND1D method, which utilizes ResNet2D as an example.

Figure 2a illustrates the generalized structure of ConvNets2D in the computer vision (CV) domain, which typically consists of three sequential parts: the stem, body, and head. The body is composed of several stacked stages. Figure 2b depicts the structure of the head, which is relatively simple and fixed, generally containing one or two linear layers. The head uses features extracted from the body for classification and changes the linear layer parameters based on the number of categories. Figure 2c presents a schematic diagram of the stage structure within the body. The body is the most critical component of ResNet2D, as it handles nearly all computations aimed at learning high-level features from the data. This process increases the channel dimension while reducing spatial resolution [28]. Each stage comprises structurally similar cells, categorized into reduced and normal cells based on whether they downsample the data, primarily adjusting the step size and channel counts. The stem, normal, and reduced cells in ResNet1D and ResNet2D are illustrated in Figure 3a–c and Figure 3d–f, respectively. The bold black font represents the structural differences between a typical 2D network and the 1D network used in this article. For example, the stem in ResNet2D uses a 7 × 7 convolution kernel with a stride of 2. ResNet1D uses a convolution kernel with a length of 256 and a stride of 16.

For the design of ConvNets1D, the ResNet2D design idea was followed. Assuming that the ConvNets2D kernel size is

C_{2 i} \times K_{2} \times K_{2} \times C_{2 o}

(where

C_{2 i}

and

C_{2 o}

denote the input and output channels, respectively, and

K_{2}

denotes the kernel spatial dimension), the channel magnification is

C_{2 o / i} = C_{2 o} / C_{2 i}

; step size is

S_{2}

; and spatial dimension reduction multiplier is defined as

R_{2 i / o}

. As for ConvNets1D, the kernel size is

C_{1 i} \times K_{1} \times C_{1 o}

; channel magnification is

C_{1 o / i} = C_{1 o} / C_{1 i}

; step size is

S_{1}

; and spatial dimension reduction multiplier is defined as

R_{1 i / o}

.

When

R_{1 i / o} = R_{2 i / o}

, we obtain the following:

R_{2 i / o} = \frac{{(S_{2})}^{2}}{C_{2 o / i}} = \frac{S_{1}}{C_{1 o / i}} = R_{1 i / o}

(1)

Generally, the intermediate results of the model are assumed to be in the form of feature maps, and the information exists in each channel. To ensure fairness and reduce design complexity, let the output channel be

C_{2 o} = C_{1 o}

. Then, we have the following:

C_{2 i} \times {(S_{2})}^{2} = C_{1 i} \times S_{1}; i f C_{2 o} = C_{1 o}

(2)

According to the layer-by-layer computation model of neural networks (the output of the previous layer is the input for the next layer), the convolutions other than the initial convolution

C_{2 i}^{0} \times K_{2}^{0} \times K_{2}^{0} \times C_{2 o}^{0}

(with an image channel dimension of 1 or 3, and a common radiated noise channel dimension of 1) satisfy

C_{2 i}^{m} = C_{1 i}^{m}

. Then, the 1D network step size

S_{1}^{m}

satisfaction is as follows:

S_{1}^{m} = \{\begin{cases} C_{2 i}^{0} \times {(S_{2}^{0})}^{2} / C_{1 i}^{0}; m = 0 \\ {(S_{2}^{m})}^{2}; 1 \leq m \leq M - 1 \end{cases}

(3)

where

M

denotes the number of convolutional layers in the network.

If

C_{1 i}^{m}

is related to the data being processed, when the output channel

C_{2 o} = C_{1 o}

, the common 1D network output channel

C_{1 o}^{m}

satisfaction is as follows:

C_{1 o}^{m} = \{\begin{cases} C_{2 o}^{m}; m = 0 \\ C_{1 o}^{m - 1} \times C_{2 i / o}^{m}; 1 \leq m \leq M - 1 \end{cases}

(4)

The convolution kernel K takes a more flexible value, and the common range is

K \geq S

, although obtaining

K_{1}^{m} = {(K_{2}^{m})}^{2}; 1 \leq m \leq M - 1

is possible when 1D and 2D convolutional FLOPS and parameters are guaranteed to be the same. This implies that for the same resource overhead, the 1D network’s receptive field expands by a factor of squares with respect to its step size, and an excessively large receptive field results in a loss of detailed information. Therefore, in this study, the smallest possible value of

K_{1}

was used, as shown in Figure 3b,c. Moreover, inspired by DL-based speech classification [5], the initial convolution of

K_{1}^{0}

can take larger values. The value can be selected based on the length of time

τ

; similar to the time-frequency analysis of the frame length. For example, when

f s = 32,000 Hz

,

τ = 16 ms

and

K_{1}^{0} = 256

, as shown in Figure 3a. In summary, the range of common 1D network steps for ship-radiated noise

K_{1}

is as follows:

K_{1}^{m} = \{\begin{cases} τ \times f s / 2; m = 0 \\ S_{1}^{m} + 1; S_{1}^{m} > 1 \\ 3; S_{1}^{m} = 1 \cap k_{2}^{m} \geq 2 \\ 1, S_{1}^{m} = 1 \cap k_{2}^{m} = 1 \end{cases}

(5)

where (3)–(5) denote the UATR-ND1D design idea under the

R_{1 i / o} = R_{2 i / o}

condition, which is mainly used to transform advanced 2D networks in CV to 1D networks in UATR.

Furthermore, the common parameter range of the 1D network can be estimated based on the 2D network common parameter range in the CV. Generally,

C_{2 i}^{0}

takes the value of 1 or 3 [32,33], while

C_{2 o}^{0}

takes 16, 32, 48, 64, and 96, etc., depending on resource requirements [34,35]. Similarly, the common range of

C_{2 i / o}^{m}

is approximately 1–2 [36,37], with a typical value of 2 (some special bottleneck structures [38,39] set a larger

C_{2 i / o}^{m}

in the middle layer). The range estimation of commonly used parameters for UATR-ND1D can be set as follows:

S_{1}^{m} = \{\begin{cases} 4 \times C_{1}^{0} \leq S_{1}^{0} \leq 16 \times C_{1}^{0}; m = 0 \\ 1 \leq S_{1}^{m} \leq 4; 1 \leq m \leq M - 1 \end{cases}

(6)

C_{1 o}^{m} = \{\begin{cases} 16, 32, 48, 64 and 96; m = 0 \\ C_{1 o}^{m - 1} \times C_{2 i / o}^{m}; 1 \leq m \leq M - 1, 1 \leq C_{2 i / o}^{m} \leq 2 \end{cases}

(7)

K_{1}^{m} = \{\begin{cases} τ \times f s / 2000; m = 0; 0 \leq τ \leq 20 \\ K_{1}^{m} \geq S_{1}^{m} + 1; S_{1}^{m} > 1; 1 \leq m \leq M - 1 \\ 1, S_{1}^{m} = 1 \cap k_{2}^{m} = 1 \end{cases}

(8)

2.2. ResNet1D Model Based on FFT-UATR-ND1D

In this paper, we utilize ResNet2D18, an entry-level network in deep learning, as an example to design a one-dimensional model using the FFT-UATR-ND1D method proposed in Section 3. The underwater acoustic data channel dimension is

C_{1 i}^{0} = 1

. In this study, ResNet1D was designed using Equations (4)–(6) and (8).

S_{1}^{0}

was selected as the maximum value in Equation (6) to reduce modeling complexity, and

K_{1}^{0}

was determined according to Equation (5). The network parameters are as follows:

\begin{matrix} C_{1 i}^{0} = 1 \\ S_{1}^{0} = 16 \times C_{1 i}^{0} = 16 \\ K_{1}^{0} = 16 \times 32,000 / 2000 = 256 \\ C_{1 o}^{0} = C_{2 o}^{0} \\ C_{1 i}^{m} = C_{1 i}^{m - 1} \\ S_{1}^{m} = {(S_{2}^{m})}^{2} \\ K_{1}^{m} = \{\begin{cases} S_{1}^{m} + 1; S_{1}^{m} > 1 \\ 3; S_{1}^{m} = 1 \cap k_{2}^{m} \geq 2 \\ 1; S_{1}^{m} = 1 \cap k_{2}^{m} = 1 \end{cases} \\ C_{1 o}^{m} = C_{2 o}^{m} = 64 \end{matrix}

(9)

Figure 4 illustrates the generic structure of the ResNet1D network family designed using the FFT-UATR-ND1D method. Compared to ResNet2D, this study omits changes in the convolutional kernel dimensions (for example, the convolutional kernel changed from 3 × 3 to 1 × 3). Instead, it highlights changes in convolutional kernel size, step length, and channel values in bold. In Figure 4, N denotes the number of stages (excluding stage 0 and the classification layer). Originally, ResNet2D18 is abbreviated as [2, 2, 2, 2], where the length of the vector indicates the number of stages, and each value denotes the number of residual cells corresponding to each stage. Including the initial and classification layers, the total number of convolutional layers was calculated as 1 + 4 × 2 × 2 + 1 = 18. Similarly, a ResNet1D member containing 2 stages, each with one residual cell, can be represented as [1, 1].

In this paper, we use ResNet2D as an example to simplify the workload. Residual concatenation marks the formal entry of DL into the “deep era”, and ResNet2D is the most basic model in the “deep era”. Customizing the UATR-ND1D proprietary model is a particularly difficult task, which some papers call network engineering (NE) [28]. NE consumes a large amount of time, computational resources, and expert knowledge.

2.3. Parameter Space and Datasets

The parameter space involved in this study is outlined in Table 1, where SD denotes sample duration and ULF denotes upper limit frequency. A total of 4320 experiments were conducted on the ShipsEar dataset, from which 72 parameter combinations yielding optimal results were selected for an additional 360 experiments on the DeepShip dataset.

The ShipsEar dataset contains 90 segments of data with varying durations, identified by IDs numbered from 6 to 96. It includes a total of 11 different vessel sizes, categorized into four groups based on size. With the addition of background noise, there are five categories in total, denoted by the capital letters A through E. Details can be found in Table 2.

The DeepShip dataset consists of five target classes: background, cargo, passenger ship, tanker, and tug. Given that the background noise acquisition environment differs from that of the other four classes, this paper adheres to the recommendations of most studies by using only data from the latter four target classes, as outlined in Table 3.

For dataset generation, a randomized division was employed, where the training, validation, and testing datasets for ShipsEar were set at 8:1:1, while for DeepShip, it was 7:1.5:1.5.

3. Results

3.1. Experimental Configuration

This paper employs the parameter combinations outlined in Section 2.3. All samples utilize frequency domain magnitude results as input, following the spectral range specified in Section 2.3. The gradient was updated using the Adam [40] algorithm, featuring an asymptotic learning rate with an initial value of 0.01 and a 10-fold reduction every 10 epochs, for a total of 35 epochs of training. The batch size was set to 256. Graphics card RTX3090 (Nvidia, Santa Clara, CA, USA), Torch (Version 1.11.0), SciPy (Version 1.7.3.1).

3.2. Performance Comparison on ShipsEar and DeepShip

Table 4 and Table 5 present the detailed performance results on ShipsEar and DeepShip, respectively. In this paper, we determine the optimal model based on the average validation set results from five random seeds, reporting both the maximum validation set result and the average validation set plus standard deviation. The results from the test set corresponding to the optimal validation set are consistent with these findings.

Using the FFT-UATR-ND1D method on ResNet2D, we designate the resulting network family as ResNet1D. The optimal and lightweight models are referred to as ResNet1D_B and ResNet1D_S, respectively, following the descriptions in Figure 4. Their model structures are summarized as [1, 1, 1, 1] and [2, 1], respectively. Here, RT denotes running time, SD denotes sample duration, ULF denotes upper limit frequency, and DU denotes data usage. Due to inconsistencies in experimental platforms across different papers, a comparison of running times is not possible; the data are presented for reference only.

As shown in Table 4 and Table 5, the recognition rate on ShipsEar is higher than all other methods by 1.21% to 4.57%, except for AMNet-S and MR-CNN-A. Although ResNet1D_B is lower than AMNet-S and MR-CNN-A by 0.59% and 0.06%, respectively, the number of parameters in AMNet-S and MR-CNN-A is 2.6 and 2.0 times greater, and their FLOPs are 74 and 83.2 times higher, respectively. This is because the complexity of the 1D model is much lower than that of the 2D network with the same structure. Despite the slight performance gap, ResNet1D_B significantly reduces model parameters and computational complexity.

The recognition rates of the lightweight models ResNet1D_S and AMNet-T are nearly equal and higher than ResNet2D18, SVM, and Transformer. The number of parameters and FLOPs in AMNet-T is 10 and 50 times greater than those of ResNet1D_S, respectively. Furthermore, while the number of parameters is slightly 70 KB lower than that of the SVM method, the recognition rate is 3.12% higher, and FLOPs are only 17% greater.

On DeepShip, ResNet1D_B outperforms other models across all performance metrics. Although it is slightly lower than AResNet by 0.58%, AResNet uses only 1/3 of the data, while this paper utilizes all the data, resulting in a recognition rate of 98.42%. The number of parameters and FLOPs for AResNet is 4.5 and 110.0 times higher than those of ResNet1D_B, respectively. Models using all the data have recognition rates that are 1.15% to 15.52% lower than ResNet1D_B. Unfortunately, only MSLEFC provides more detailed performance data, with a recognition rate 15.52% lower than that of this paper, while its parameters and FLOPs are 7.5 and 117.1 times higher than those of ResNet1D_B, respectively.

For the lightweight model ResNet1D_S, the performance is only lower than that of the Hybrid model, and it requires only 0.17 M parameters and 6.8 M FLOPs. These results demonstrate that the ResNet1D network designed using FFT-UATR-ND1D exhibits impressive performance regarding recognition rate, number of parameters, and FLOPs. This advancement represents a significant step toward deploying deep learning on small mobile platforms.

From the perspective of feature extraction and model structure, most papers on ShipsEar and DeepShip combine various complex traditional feature extraction methods with large, intricate models, as illustrated in Figure 1. In this study, we consider FFT as a simple alternative to filters, specifically to acquire ship-radiated noise bands. ResNet2D serves as an introductory DL network. By utilizing FFT-UATR-ND1D to design the ResNet1D model, we achieved remarkable results in terms of recognition rate, number of parameters, and FLOPs. In this context, FFT-UATR-ND1D approximates a simple and efficient end-to-end network design method.

The experimental results demonstrate that the FFT-UATR-ND1D method proposed in this paper effectively addresses the three challenges of IM-FE. First, FFT-UATR-ND1D is an approximate end-to-end method that uses FFT as a straightforward alternative to traditional filtering. Second, there is no need to accommodate CV model input interfaces using unnatural SD and ULF (refer to Section 3.2. for experimental results on time-frequency details). Third, while we employ FFT-UATR-ND1D to design the ResNet1D model based on the foundational ResNet2D network, the results on ShipsEar and DeepShip highlight significant improvements in recognition rate, number of parameters, and FLOPs.

3.3. Results of Partial Experiments on ShipsEar and DeepShip

Figure 5 presents the recognition experiment results on ShipsEar and DeepShip across various durations, frequencies, and network validation sets. Due to space limitations, only selected results are displayed. The horizontal axis represents the upper limit frequency (ULF), while “Acc” indicates the recognition rate. Figure 5a–d correspond to sample durations of 0.5, 1.0, 1.5, and 2.0 s, respectively. Each point reflects the mean and variance of 30 cumulative results from five random seeds and six window functions.

Figure 5a,d show results for ShipsEar, while Figure 5b,c depict results for DeepShip. Comparing Table 4 and Table 5, all results in Figure 5 are competitive. The following conclusions can be drawn:

Different scale data tend to utilize varying numbers of stages. For instance, 0.5 s and 1 s favor networks with two stages, while 2 s data benefit from networks with four stages. The recognition rate increases relatively slowly as ULF rises, indicating that blindly using a larger ULF is resource-intensive; thus, network structures should be designed according to specific task requirements.

Regarding SD and ULF, many studies utilize unnatural values for these parameters. Ship-radiated noise predominantly occupies the low-frequency band; thus, setting ULF to 500, 1000, or 2000 Hz is reasonable to balance recognition rates and model complexity. For SD, recognition tasks follow localization and tracking tasks. Given the immediacy and non-cooperative nature of the targets, larger SDs can lead to challenges, such as knowing the target’s location but failing to identify its type in real time. Upgrading feature extraction to a more complex interface matching–feature engineering (IM-FE), model should not compromise the input interface for CV models. Customizing the model according to the dimensions of underwater acoustic data and mission requirements is crucial, rather than blindly adopting the CV model interface, which deviates from the fundamental requirements of ship radiated noise or target identification missions.

Moreover, the results in Figure 5 demonstrate that the ResNet1D family, generated using the FFT-UATR-ND1D method proposed in this paper, can adapt to the time-frequency ranges typical of ship-radiated noise without necessitating IM-FE for compatibility with CV domain networks. This further emphasizes the potential of FFT-UATR-ND1D, as the ResNet family represents an early model in deep learning.

3.4. Visualization Results of T-SNE

Figure 6 illustrates the visualization results of FFT-UATR-ND1D on ShipsEar and DeepShip. The first two rows present results for 500 Hz and 2000 Hz on ShipsEar, while the last two rows show results for the same frequencies on DeepShip. All samples have a duration of 2 s.

Figure 6 indicates that FFT-UATR-ND1D effectively distinguishes between different categories. The recognition performance at 2000 Hz is superior to that at 500 Hz. Key insights from Figure 6 include the following:

Ship-radiated noise information is primarily concentrated in the low-frequency range, particularly below 500 Hz (with a recognition rate close to 97%).

A smaller portion of the information is found in the lower frequency band (500–2000 Hz). So, using the 0–2000 Hz frequency band for recognition can slightly improve performance at a cost of nearly four times the resource consumption.

From Section 3.3, it can also be seen that higher frequency limits not only require more resources but often result in a significant decrease in recognition rates, such as ResNet2D18 and Transformer in the ShipsEar dataset, as well as all models in DeepShip. Even if pursuing the maximum recognition rate, 2000 Hz may already be sufficient. These results suggest that blindly applying higher frequencies to match CV interfaces not only complicates recognition performance but also leads to significant resource wastage. The FFT-UATR-ND1D method effectively addresses the IM-FE problem. This method has great performance improvement but still has some limitations. 1. It mainly refers to the experience of image classification network design in the CV field, making is hard to guarantee that it is very suitable for UATR. Only part of the knowledge of ship radiated noise is integrated, but almost no knowledge of hydroacoustic physics is integrated; one reason for this is that the research of hydroacoustic physics is still a world-wide problem. 2. At this stage, there is still a lack of information on the variety of publicly available datasets, such as target type, operating conditions, and temporally and spatially varying environments, which prevents the model from fully learning the target characteristics.

4. Conclusions

In response to the challenges posed by the existing issues of IM-FE within UATR design, this paper identifies three primary problems:

Need for Manual Feature Design: Significant effort is still required to design effective features.
Inappropriate Sample Duration (SD) or Upper Limit Frequency (ULF): Larger SD or ULF values are often necessary, which do not align with the specific characteristics of hydroacoustic target recognition tasks.
Underutilization of Deep Learning’s Learning Capability: There is a tendency to rely on complex advanced models without customizing them for one-dimensional hydroacoustic data.

To address these challenges, we propose a one-dimensional network design method, FFT-UATR-ND1D, which integrates fast Fourier transform (FFT). This method utilizes only the FFT results as inputs for the model. Using the entry-level deep learning network ResNet2D as a foundation, we designed a series of ResNet1D models employing the FFT-UATR-ND1D approach, followed by extensive experiments on the ShipsEar and DeepShip datasets.

The experimental results demonstrate that the FFT-UATR-ND1D method achieves the best current performance across various time and frequency band combinations. Importantly, there is no need to compromise the CV domain model input interface with SD or ULFs that do not suit hydroacoustic target recognition. The FFT serves as a simple filter replacement aimed solely at capturing specified frequency band ranges, without the need for complex feature extraction methods. The ResNet1D network generated based on FFT-UATR-ND1D not only exhibits superior performance but also maintains a significantly lower number of parameters and FLOPs compared to other hydroacoustic models.

Author Contributions

Conceptualization, Q.H.; Data curation, M.Z.; Formal analysis, Q.H. and X.Z. (Xiaoyan Zhang); Funding acquisition, M.L. and X.Z. (Xiangyang Zeng); Investigation, P.C., Z.N. and X.Z. (Xiangyang Zeng); Methodology, Q.H.; Project administration, M.L.; Resources, P.C.; Supervision, M.Z.; Validation, Z.N.; Visualization, Q.H., X.Z. (Xiaoyan Zhang) and A.J.; Writing—review and editing, A.J. and X.Z. (Xiangyang Zeng). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant No. 52271351).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hou, Y.; Han, G.; Zhang, F.; Lin, C.; Peng, J.; Liu, L. Distributional soft actor-critic-based multi-AUV cooperative pursuit for maritime security protection. IEEE Trans. Intell. Transp. Syst. 2023, 25, 6049–6060. [Google Scholar] [CrossRef]
Li, L. Building up a sustainable path to maritime security: An analytical framework and its policy applications. Sustainability 2023, 15, 6757. [Google Scholar] [CrossRef]
Nguyen, M.-H.; Duong, M.-P.T.; Nguyen, M.-C.; Mutai, N.; Jin, R.; Nguyen, P.-T.; Le, T.-T.; Vuong, Q.-H. Promoting stakeholders’ support for marine protection policies: Insights from a 42-country dataset. Sustainability 2023, 15, 12226. [Google Scholar] [CrossRef]
Chen, X.; Yu, Z.; Di, Q.; Wu, H. Assessing the marine ecological welfare performance of coastal regions in China and analysing its determining factors. Ecol. Indic. 2023, 147, 109942. [Google Scholar] [CrossRef]
Mazzocchi, M.G.; Di Capua, I.; Kokoszka, F.; Margiotta, F.; d’Alcalà, M.R.; Sarno, D.; Zingone, A.; Licandro, P. Coastal mesozooplankton respond to decadal environmental changes via community restructuring. Mar. Ecol. 2023, 44, e12746. [Google Scholar] [CrossRef]
Fernandez Garcia, G.; Corpetti, T.; Nevoux, M.; Beaulaton, L.; Martignac, F. AcousticIA, a deep neural network for multi-species fish detection using multiple models of acoustic cameras. Aquat. Ecol. 2023, 57, 881–893. [Google Scholar] [CrossRef]
Er, M.J.; Chen, J.; Zhang, Y.; Gao, W. Research challenges, recent advances, and popular datasets in deep learning-based underwater marine object detection: A review. Sensors 2023, 23, 1990. [Google Scholar] [CrossRef]
El Mekkaoui, S.; Benabbou, L.; Caron, S.; Berrado, A. Deep learning-based ship speed prediction for intelligent maritime traffic management. J. Mar. Sci. Eng. 2023, 11, 191. [Google Scholar] [CrossRef]
Niu, H.; Li, X.; Zhang, Y.; Xu, J. Advances and applications of machine learning in underwater acoustics. Intell. Mar. Technol. Syst. 2023, 1, 8. [Google Scholar] [CrossRef]
Luo, X.; Chen, L.; Zhou, H.; Cao, H. A survey of underwater acoustic target recognition methods based on machine learning. J. Mar. Sci. Eng. 2023, 11, 384. [Google Scholar] [CrossRef]
Yang, H.; Lee, K.; Choo, Y.; Kim, K. Underwater acoustic research trends with machine learning: General background. J. Ocean Eng. Technol. 2020, 34, 147–154. [Google Scholar] [CrossRef]
Zhu, C.; Cao, T.; Chen, L.; Dai, X.; Ge, Q.; Zhao, X. High-order domain feature extraction technology for ocean acoustic observation signals: A review. IEEE Access 2023, 11, 17665–17683. [Google Scholar] [CrossRef]
Yao, Q.; Wang, Y.; Yang, Y. Underwater acoustic target recognition based on data augmentation and residual CNN. Electronics 2023, 12, 1206. [Google Scholar] [CrossRef]
Wang, B.; Zhang, W.; Zhu, Y.; Wu, C.; Zhang, S. An underwater acoustic target recognition method based on AMNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5501105. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar] [CrossRef]
Santos-Domínguez, D.; Torres-Guijarro, S.; Cardenal-López, A.; Pena-Gimenez, A. ShipsEar: An underwater vessel noise database. Appl. Acoust. 2016, 113, 64–69. [Google Scholar] [CrossRef]
Irfan, M.; Jiangbin, Z.; Ali, S.; Iqbal, M.; Masood, Z.; Hamid, U. DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification. Expert Syst. Appl. 2021, 183, 115270. [Google Scholar] [CrossRef]
Hong, F.; Liu, C.; Guo, L.; Chen, F.; Feng, H. Underwater acoustic target recognition with a residual network and the optimized feature extraction method. Appl. Sci. 2021, 11, 1442. [Google Scholar] [CrossRef]
Khishe, M. DRW-AE: A deep recurrent-wavelet autoencoder for underwater target recognition. IEEE J. Ocean. Eng. 2022, 47, 1083–1098. [Google Scholar] [CrossRef]
Ma, Y.; Liu, M.; Zhang, Y.; Zhang, B.; Xu, K.; Zou, B.; Huang, Z. Imbalanced underwater acoustic target recognition with trigonometric loss and attention mechanism convolutional network. Remote Sens. 2022, 14, 4103. [Google Scholar] [CrossRef]
Li, J.; Wang, B.; Cui, X.; Li, S.; Liu, J. Underwater acoustic target recognition based on attention residual network. Entropy 2022, 24, 1657. [Google Scholar] [CrossRef]
Zhang, Y.; Zeng, Q. MSLEFC: A low-frequency focused underwater acoustic signal classification and analysis system. Eng. Appl. Artif. Intell. 2023, 123, 106333. [Google Scholar] [CrossRef]
Alouani, Z.; Hmamouche, Y.; Khamlichi, B.E.; Seghrouchni, A.E.F. A spatio-temporal deep learning approach for underwater acoustic signals classification. In Proceedings of the 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Madrid, Spain, 29 November–2 December 2022; IEEE Publications: Piscataway, NJ, USA, 2022; pp. 1–7. [Google Scholar] [CrossRef]
Miao, Y.; Zakharov, Y.V.; Sun, H.; Li, J.; Wang, J. Underwater acoustic signal classification based on sparse time–frequency representation and deep learning. IEEE J. Ocean. Eng. 2021, 46, 952–962. [Google Scholar] [CrossRef]
Cheng, Z.; Huo, G.; Li, H. A multi-domain collaborative transfer learning method with multi-scale repeated attention mechanism for underwater side-scan sonar image classification. Remote Sens. 2022, 14, 355. [Google Scholar] [CrossRef]
Ren, P.; Xiao, Y.; Chang, X.; Huang, P.; Li, Z.; Chen, X.; Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Comput. Surv. 2022, 54, 1–34. [Google Scholar] [CrossRef]
Feng, S.; Zhu, X. A transformer-based deep learning network for underwater acoustic target recognition. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1505805. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Publications: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Chitty-Venkata, K.T.; Emani, M.; Vishwanath, V.; Somani, A.K. Neural architecture search benchmarks: Insights and survey. IEEE Access 2023, 11, 25217–25236. [Google Scholar] [CrossRef]
Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of image classification algorithms based on convolutional neural networks. Remote Sens. 2021, 13, 4712. [Google Scholar] [CrossRef]
Taye, M.M. Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
Cong, S.; Zhou, Y. A review of convolutional neural network architectures and their optimizations. Artif. Intell. Rev. 2023, 56, 1905–1969. [Google Scholar] [CrossRef]
Zhang, T.; Lei, C.; Zhang, Z.; Meng, X.-B.; Chen, C.L.P. AS-NAS: Adaptive scalable neural architecture search with reinforced evolutionary algorithm for deep learning. IEEE Trans. Evol. Comput. 2021, 25, 830–841. [Google Scholar] [CrossRef]
Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef]
Wang, B.; Li, L.; Nakashima, Y.; Nagahara, H. Learning bottleneck concepts in image classification. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 17–24 June 2023; pp. 10962–10971. [Google Scholar] [CrossRef]
Mellor, J.; Turner, J.; Storkey, A.; Crowley, E.J. Neural architecture search without training. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 7588–7598. [Google Scholar]
Reyad, M.; Sarhan, A.M.; Arafa, M. A modified Adam algorithm for deep neural network optimization. Neural Comput. Appl. 2023, 35, 17095–17112. [Google Scholar] [CrossRef]

Figure 1. Schematic structure of interface matching–feature engineering (IM-FE).

Figure 2. ConvsNet, head, and stage architecture.

Figure 3. Stem, normal, and reduced cells: 1D and 2D.

Figure 4. Generalized structure of ResNet1D based on FFT-UATR-ND1D.

Figure 5. Result of frequency versus layers in different times.

Figure 6. T-SNE of ShipsEar and DeepShip.

Table 1. Parameter space.

Variable Name	Ranges	Number
SD (S)	[0.5, 1, 1.5, 2]	4
ULF (Hz)	[500, 1000, 2000]	3
ResNet1D	[1, 1], [2, 1], [2, 2] [1, 1, 1], [2, 1, 1], [2, 2, 1], [2, 2, 2] [1, 1, 1, 1], [2, 1, 1, 1], [2, 2, 1, 1], [2, 2, 2, 1], [2, 2, 2, 2]	12
window	Rect, bartlett, blackman, hamming, hann, Kaiser	6
random seed	[0, 1, 2, 3, 4]	5
Total		864 × 5 = 4320

Table 2. ShipsEar dataset.

Categories	Ships	SD
A	fishing boats, trawlers, mussel boats, tugboats, dredgers	1881.27 s
B	motorboats, pilot boats, sailboats	1567.45 s
C	passenger ferries	4278.16 s
D	ocean liners, ro-ro vessels	2460.65 s
E	background noise recordings	1146.36 s
Total		3.15 h

Table 3. DeepShip dataset.

Categories	SD
cargo	38,495 s
passenger ship	46,421 s
tanker	44,530 s
tug	40,539 s
Total	47.22 h

Table 4. Comparison of ShipsEar experimental results.

Model	Result (%)	Params (M)	FLOPs (M)	RT ¹ (m)	Size	Feature	SD ²(s), ULF ³ (Hz)
ResNet2D18 [20]	94.3	11.19	1880	3.9	60 × 41 × 3	LM + MFCC + CCTZ	5, 20,480
SVM [21]	94.49	0.1	20	3.4	126 × 128	DAWN + DRA	0.5, ≥16,000 ⁴
Transformer [29]	96.9	2.55	/	4.3	128 × 512	LogMel	5, 16,000
MR-CNN_A [22]	98.87	4.3 ⁵	415.8 ⁵	/	98 × 12	MFCC	None
AMNet-N [14]	92.2	0.51	140	375	166 × 66	STFT	1, 16,000
AMNet-T [14]	97.6	1.69	170	443
AMNet-S [14]	99.4	5.47	370	481
ResNet1D_S	97.61/97.1 ± 0.43	0.17	3.4	0.7	1 × 1001	FFT	0.5, 2000
ResNet1D_S	97.62/97.2 ± 0.44	0.17	3.4	0.7
ResNet1D_B	98.89/98.5 ± 0.28	2.1	5.0	1.1
ResNet1D_B	98.81/98.4 ± 0.27	2.1	5.0	1.1

¹ RT: running time, unit in milliseconds. ² SD: sample duration, unit in seconds. ³ ULF: upper limit of frequency. Unit in Hertz. ⁴ Each sample in the paper is divided into frames at 32 ms, with a frame length of 512, and the presumed frequency band is at least 512/0.032 = 16,000 Hz. ⁵ The results were estimated based on the network structure diagram and input data dimension 98 × 12 in the original article.

Table 5. Comparison of DeepShip experimental results.

Model	Result (%)	Params (M)	FLOPs (M)	RT (ms)	Size	Feature	SD(s), ULF(Hz)	DU ¹ (h)
Transformer [29]	95.3	2.55	/	4.3	128 × 512	LogMel	5, 16,000	13.8
AResnet [23]	99	9.47	1460	/	3 × 60 × 41	LM + MFCC + CCTZ	5, 22,050	15.5
MSLEFC [24]	82.9	15.75	1556.9	/	9 × 6 × 2048	MS_STFT + LE + FC	3, 4096	47.2
End2end model [25]	91.07	/	/	7.2	6 × 40 × 45	None	3.36, 16,000
Hybird model [25]	97.27	/	/	20.6	128 × 3 × 3	MFCC	3.36, 16,000
ResNet1D_S	95.72/95.3 ± 0.28	0.17	6.8	0.8	1 × 2001	FFT	1, 2000
ResNet1D_S	95.29/95.2 ± 0.16	0.17	6.8	0.8	1 × 2001		1, 2000
ResNet1D_B	98.36/98.1 ± 0.15	2.1	13.3	1.2	1 × 3001		1.5, 2000
ResNet1D_B	98.42/98.1 ± 0.19	2.1	13.3	1.2	1 × 3001		1.5, 2000

¹ DU: duration of using the dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Q.; Zhang, X.; Jin, A.; Lei, M.; Zeng, M.; Cao, P.; Na, Z.; Zeng, X. Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition. J. Mar. Sci. Eng. 2025, 13, 599. https://doi.org/10.3390/jmse13030599

AMA Style

Huang Q, Zhang X, Jin A, Lei M, Zeng M, Cao P, Na Z, Zeng X. Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition. Journal of Marine Science and Engineering. 2025; 13(3):599. https://doi.org/10.3390/jmse13030599

Chicago/Turabian Style

Huang, Qing, Xiaoyan Zhang, Anqi Jin, Menghui Lei, Mingmin Zeng, Peilin Cao, Zihan Na, and Xiangyang Zeng. 2025. "Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition" Journal of Marine Science and Engineering 13, no. 3: 599. https://doi.org/10.3390/jmse13030599

APA Style

Huang, Q., Zhang, X., Jin, A., Lei, M., Zeng, M., Cao, P., Na, Z., & Zeng, X. (2025). Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition. Journal of Marine Science and Engineering, 13(3), 599. https://doi.org/10.3390/jmse13030599

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient One-Dimensional Network Design Method for Underwater Acoustic Target Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. FFT-UATR-ND1D

2.2. ResNet1D Model Based on FFT-UATR-ND1D

2.3. Parameter Space and Datasets

3. Results

3.1. Experimental Configuration

3.2. Performance Comparison on ShipsEar and DeepShip

3.3. Results of Partial Experiments on ShipsEar and DeepShip

3.4. Visualization Results of T-SNE

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI